# Non-targeted detection of food adulteration using an ensemble machine-learning model

### Normal and spiked raw milk samples

Archive data of 65,547 normal bovine raw milk samples sampled between 2017 and 2019 were provided by Mengniu and retrieved from in-house laboratory information management systems (LIMS). The data included data from tests that were routinely performed during industrial quality check testing; one such routine testing was performed on MilkoScan FT120 (FOSS Analytical, Denmark) using FTIR spectroscopy. Compositional data from MilkoScan FT120 comprised eight physiochemical properties of the milk samples: fat, protein, NFS, TS, lactose, RD, FPD, and acidity. The numerical values for the different milk components were determined by a series of calculations based on a multiple linear regression (MLR) model that considered the absorbance of light energy by the sample for specific wavelength regions obtained using an FTIR equipment. The readings were performed once. Among the 65,547 raw milk samples, 1,469 (2.21%) were removed including samples that were labelled as “testing in progress”, (normal) samples that were labelled as “fail”, and samples labelled as “pass”, “unlabelled”, or “untreated” but with one or more compositional features that fell outside the range of mean ± 3 standard deviation (SD) based on the three-sigma rule31.

Because no real adulterated milk had been found, spiked samples were used to train and test the model. From April to August 2020, 912 raw bovine milk samples were tested by Mengniu using MilkoScan FT120. A total of 27 samples (2.96%), which included samples that were labelled as “unlabelled” but with one or more compositional features that fell outside the range of mean ± 3 SD, were excluded. Among the remaining 885 samples, 834 were normal (94.24%) and 51 (5.76%) were intentionally spoilt with cow smell, improperly stored for 36 h, and spiked with potassium sulfate, potassium dichromate, water, citric acid, and sodium citrate. Table 5 shows the concentrations of the adulterants added. The compositional data (n = 885) were obtained using FTIR spectroscopy, and the readings were performed once.

From September 2020 to February 2021, 770 raw bovine milk samples were tested by Mengniu using FOSS FT120. A total of 113 samples (14.689%), which included samples that were labelled as “unlabelled” but with one or more compositional features falling outside the range of mean ± 3 SD, were removed. Among the 6579 remaining samples, 372 (56.62%) were normal raw milk samples and 2855 (43.38%) were spiked raw milk samples. The spiked raw milk samples included samples spiked potassium sulfate, citric acid, potassium dichromate, ammonium sulfate, melamine, urea, lactose, glucose, sucrose, maltodextrin, fructose, water, whole milk powder, whey protein, skimmed milk powder, starch, soy milk, and trisodium citrate. Table 5 shows the number of samples spiked with the corresponding concentrations of adulterants. The compositional data and full absorbance spectra with a wavenumber range of 1000–3550 cm−1 were considered in triplicate. Infrared spectra were obtained using the FTIR technique and were formed by 1056 points measured at wavenumbers ranging from 3000 to 1000 cm−1.

In April 2021, 155 raw bovine milk samples were tested by Mengniu using FOSS FT120 and used for cross-validation. A total of 65 (41.93%) samples were normal raw milk samples, and 90 (58.06%) samples were spiked raw milk samples. The spiked raw milk samples included samples spiked with hydrogen peroxide, glucose, sodium hydroxide, salt, fructose, and sucrose. Table 5 shows the number of spiked samples with the corresponding concentrations of adulterants. The compositional data and full absorbance spectra with a wavenumber range of 1000–3550 cm−1 were considered in triplicate.

Table 6 presents a summary of the number of normal, spiked, and all raw milk samples in their respective years of sampling. Potassium dichromate, potassium sulfate, and hydrogen peroxide are common chemicals used to increase shelf life; sodium citrate, citric acid, sodium hydroxide, and salt are common chemicals used to maintain correct pH. Nitrogen-based adulterants, such as ammonium sulfate and urea, are used to increase shelf life and volume; while melamine, whey protein, soy milk, and whole and skimmed milk powder are used as diluent to artificially alter the protein content after dilution with water. Carbohydrate-based adulterants, such as starch, sucrose, glucose, lactose, fructose, and maltodextrin are used to increase the carbohydrate content and density of the milk. Finally, water is commonly used as a diluent in milk32. All of the abovementioned adulterants are not commonly tested in the dairy industry, and specific tests for these adulterants are not required by the national standard GB 19,301-201017.

### Standardization of full absorbance spectra into selected coordinates of 7 peaks and 1 average

Standardisation of the full absorbance spectra into eight coordinates was performed by the selection of seven peaks within the spectrum regions 1000–1100, 1500–1600, 1730–1800, 2840–2940, and 3450–3550 cm−1 and an average absorbance value for 1250–1450 cm−1 for each sample (Fig. 2)33,34,35.

### Squared Mahalanobis distance (MD) scoring method

The performances of the decision tree and non-decision tree methods were compared. The MD scoring method is a non-decision tree method used to authenticate raw milk samples. The compositional and absorbance spectral data were used to calculate the squared MD score between each sample and the centroid. Upon iterating a range of MD scores, the MD score with the highest F1 score was considered the MD cutoff for distinguishing atypical from typical raw milk. F1 score considers both false positives and false negatives through the weighted average of precision and recall.

### ExtraTrees

ExtraTrees is a machine-learning algorithm proposed by Pierre Geurts et. al in 2006 that consists of multiple decision trees36. Compared with RF, Extratrees has a high discrimination ability and can be more resilient to noise in the dataset because it uses the entire original sample instead of a bootstrap replica to train each decision tree. In this study, we used compositional and spectral data to evaluate how ExtraTrees can be used for the binary classification of a sample as typical or atypical37,38. The original dataset was randomly split into training and testing datasets. The training dataset was first used to train the ExtraTrees predictive model, and the model was verified using a testing dataset to compare the actual and predicted labels. The selection of the best proportion for splitting into the training and test datasets and the number of iterations are discussed in the next section.

### XGBoost

XGBoost is an ensemble learning approach based on (CART)30. XGBoost ensembles trees in a top-down manner. Each tree consists of internal (or split) and terminal (or leaf) nodes. Each split node makes a binary decision, and the final decision is based on the terminal node reached by the input feature. Tree-ensemble methods regard different decision trees as weak learners and then construct a strong learner by either bagging or boosting. Mathematically, the model can be represented by the following objective function with respect to the model parameter (uptheta 🙂

$$objleft( theta right) = Lleft( theta right) + Omega left( theta right),$$

where (Lleft( theta right)) is the empirical loss that must be minimised and (Omega left( theta right)) is a regularisation of the model complexity to prevent overfitting. Considering a tree-ensemble model where the overall prediction is the summation of ({text{K}}) predictive values across all trees (f_{k} left( {x_{i} } right)),

$${text{p}}_{{text{i}}} = mathop sum limits_{{{text{k}} = 1}}^{{text{K}}} {text{f}}_{{text{k}}} left( {{text{x}}_{{text{i}}} } right),$$

the objective function can be expressed as:

$${text{obj}}left( theta right) = mathop sum limits_{i}^{n} lleft( {p_{i} ,t_{i} } right) + mathop sum limits_{k = 1}^{K} Omega left( {f_{k} } right),$$

where (lleft({p}_{i},{t}_{i}right)) is the mean-squared loss imposed on each sample (i), ({p}_{i}) is its predictive value, and the labels ({t}_{i}), (Omega left({f}_{k}right)) are the regularisation constraints imposed on each tree.

In this study, we used compositional and spectral data to evaluate how XGBoost could be used to classify atypical raw milk samples. The original dataset was randomly split into training and testing datasets. The training dataset was first used to train the XGBoost predictive model, and the model was predicted using the testing dataset to compare the actual and predicted labels. The two basic hyperparameters, the learning rate of XGBoost and maximum depth of the tree, were set empirically at 0.01 and 5°, respectively. The hyper-parameters “min_child_weight” and “col_sample_by_tree” were also tuned carefully with a grid search with tenfold cross-validation, and different seeds were applied in each search process to increase the variance of the model and to find an optimal parameter setting that could maximise the generalisation. For each search iteration, we used the prediction score and calculated the binary cross-entropy with respect to the ground-truth labels, that is, the label indicating whether the testing dataset was normal or spiked. The minimum sum of instance weight (Hessian) required in a child was set to 0.5, the subsample ratio of columns when constructing each tree was set to 0.8, and the objective in specifying the learning task and corresponding learning objective were linear. The hyperparameters “subsample” and “num_boost_weight” required for the selection of the best proportion for splitting into training and test datasets and the number of boosting iterations are discussed in the next section.

### Ensemble model: voting and weighting

The ensemble results of the three methods (MD, ExtraTrees, and XGBoost) were investigated to improve the model performance. First, a voting strategy was adopted to combine the results of each method. Training data were used to individually train the MD, ExtraTrees, and XGBoost models. After obtaining three sets of the initial predicted results from each model, the final predicted result was reported as the majority vote among the three results. The voting strategy was evaluated by comparing the voted result with the label.

In addition to voting, a weighting strategy was adopted. Weights from each of the three methods were assigned based on the individual F1 scores. After training the model for MD, ExtraTrees, and XGBoost individually, the initial predicted results of the testing data were obtained in binary form (({r}_{1}), ({r}_{2}), ({r}_{3})). The F1 score of each model (({f}_{{m}_{1}}), ({f}_{{m}_{2}}), and ({f}_{{m}_{3}})) and weights (({w}_{1}), ({w}_{2}), ({w}_{3})) were calculated as follows:

$$w_{1} + w_{2} + w_{3} = 1,$$

$$frac{{f_{{m_{1} }} }}{{w_{1} }} = frac{{f_{{m_{2} }} }}{{w_{2} }} = frac{{f_{{m_{3} }} }}{{w_{3} }}$$

The final predicted result calculated as (r = w_{1} r_{1} + w_{2} r_{2} + w_{3} r_{3}) was evaluated with the labels.

### Selection of the best proportion for splitting into training and test datasets and number of iterations for ExtraTrees and XGBoost

An arbitrary range of proportions for splitting into training and test datasets was examined to determine the optimal proportion of training and testing datasets for ExtraTrees and XGBoost. Each arbitrary splitting was repeated thrice and with one iteration. The splitting proportions of training-to-testing ratios attempted were 50:50, 60:40, 70:30, 80:20, and 90:10. The proportion with the highest F1 score was selected as the optimal proportion for the corresponding model.

Similarly, a range of iterations was performed to determine the optimal number of iterations for ExtraTrees and XGBoost. Splitting was performed with the selected proportion of the training and testing datasets, and each splitting was repeated thrice. The iterations were attempted 1, 5, 10, 50, and 100 times. The iteration with the highest F1 score was selected as the optimal iteration for the corresponding model.

The results were reported in terms of accuracy, sensitivity or recall, specificity, precision or positive predictive value, negative predictive value, false alarm, and F1 score. TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively, and normal raw milk was considered negative whereas spiked raw milk was considered positive.

$${text{Accuracy}}:frac{{{text{TP }} + {text{ TN}}}}{{{text{TP}} + {text{ TN}} + {text{FP}} + {text{FN}}}}$$

$${text{Sensitivity }};{text{or }};{text{recall}}:frac{{{text{TP}}}}{{{text{TP}} + {text{ FN}}}}$$

$${text{Specificity}}:frac{{{text{TN}}}}{{{text{FP}} + {text{ TN}}}}$$

$${text{Precision }};{text{or }};{text{positive}};{text{ predictive }};{text{value}}:frac{{{text{TP}}}}{{{text{TP}} + {text{ FP}}}}$$

$${text{Negative}};{text{ predictive}};{text{ value}}:frac{{{text{TN}}}}{{{text{TN}} + {text{ FN}}}}$$

$${text{False}};{text{ alarm}}:frac{{{text{FP}}}}{{{text{FP }} + {text{ TN}}}}$$

$${text{F}}_{{1}} ;{text{score}}:frac{{{text{Precision }} times {text{recall}}}}{{{text{Precision }} + {text{recall}}}} times 2$$

The model parameters were selected based on the highest F1 scores. In detecting food adulteration, outlier detection implies an unbalanced dataset, with the vast majority of samples being normal raw milk. With such an uneven class distribution, the cost of false positives and false negatives in our dataset can differ significantly. Hence, the F1 score was used instead of accuracy to select the best model as the F1 score considers both false positives and false negatives through the weighted average of precision and recall. The MD calculations, ExtraTrees, and XGBoost were performed in Python and visualised using PyCharm Community Edition 2021.3.

### Assessment of seasonal and annual variations in raw milk samples

Normal raw milk samples were sub-grouped according to season and year using SPSS. Statistical analysis of annual variations was performed using analysis of variance (ANOVA) with Fisher’s least significant difference (LSD) post hoc test. To further examine if the drift effects affected the modelling results, raw milk samples from 2020 (n = 1002, of which 821 were normal and 181 treated) were used to train the models using XGBoost (found to be the best model in terms of compositional data), and the same trained model was used to predict different samples from 2020 (n = 273, of which 226 were normal and 47 treated) and 2021 (n = 276, 171 normal and 105 treated). A comparison of the models for the raw milk samples from 2020 and 2021 was performed using an independent sample t-test with equal variance, which was not assumed in SPSS. Statistical significance was set at P < 0.05.

### Cross validation of the selected machine-learning model with blinded samples

Cross-validation was performed by testing spiked samples blinded from the training data. Model training was performed on previously available samples using both compositional and spectral data (n = 657, of which 372 were normal raw milk samples and 285 were spiked raw milk samples). Model testing were was performed on 65 normal raw milk samples and 90 raw milk samples spiked hydrogen peroxide (n = 12), sodium hydroxide (n = 15), salt (n = 15), glucose (n = 15), fructose (n = 15), and sucrose (n = 15) with serial dilutions of 0.01, 0.02, 0.05, 0.1, and 0.2 g/100 g of raw milk provided by Mengniu. Hydrogen peroxide, sodium hydroxide, and salt represented new adulterants not presented in the previous dataset. The compositional data and full absorbance spectra were both used for the testing. The weighting method of ExtraTrees and XGBoost was used to model the compositional data and selected coordinates from the full absorbance spectra of raw milk.

To address the problem of drift effect, the inclusion and exclusion of sugar (glucose, fructose, and sucrose) adulterants (n = 45) from the cross-validation dataset into the training dataset were studied and compared. To examine the effect of training the model with more data from the cross-validation dataset, a comparison analysis by including and excluding other adulterants not excluded from the training dataset was also performed. Furthermore, model testing with each adulterant blinded from the training dataset was evaluated.

### Effect of sample size using 8 compositional features

A range of sample sizes was used to determine the relationship between the sample size and predictive power for ExtraTrees and XGBoost. Each splitting was repeated thrice, with one iteration and a training-to-testing ratio of 90:10. The samples sizes attempted were 20%, 40%, 60%, 80%, and 100% of the original sample size (n = 65, 632).

### Performance comparison to GB 19,301-2010

Each sample was labelled as “pass” or “fail” according to the national standards, as described in GB 19,301-2010.

### Ethical approval

None to declare as no human subjects or animal models were required in this study.