Using vis-NIRS and Machine Learning methods to diagnose sugarcane soil chemical properties

Knowing chemical soil properties might be determinant in crop management and total yield production. Traditional soil properties estimation approaches are time-consuming and require complex lab setups, refraining farmers from promptly taking steps towards optimal practices in their crops. Soil properties estimation from its spectral signals, vis-NIRS, emerged as a low-cost, non-invasive, and non-destructive alternative. Current approaches use mathematical and statistical techniques, avoiding machine learning frameworks. This proposal uses vis-NIRS in sugarcane soils and machine learning techniques such as three regression and six classification methods. The scope is to assess performance in predicting and inferring categories of common soil properties (pH, soil organic matter OM, Ca, Na, K, and Mg), evaluated by the most common metrics. We use regression to estimate properties and classification to assess soil property status. In both cases, we achieved comparable performance on similar setups reported in the literature for property estimation for pH($R^2$=0.8, $\rho$=0.89), OM($R^2$=0.37, $\rho$=0.63), Ca($R^2$=0.54, $\rho$=0.74), Mg($R^2$=0.44, $\rho$=0.66) in the validation set.


Introduction
As the population grows, the demand for food continues to increase. However, unsustainable practices reduce the arable soil. Soils are dynamic systems that change in response to different natural and anthropogenic activities. Soil health must be a priority, particularly in agricultural practices, to increase productivity without affecting the soil. Soil chemical properties are important for agricultural production because they determine the amount of nutrients that plants uptake from soil to grow. It is essential to monitor soil quality through physicochemical analyses to provide a specific assessment looking towards sustainability [1].
Laboratory analysis of soils is widely used to know soil properties; it employs traditional chemical analyses that are expensive, time-consuming, and generates environmental contamination due to the number of chemical reagents used [2]. There is currently a growing demand to This study uses vis-NIRS and ML approaches from sugarcane for panela Colombian soil samples (653 data points) with two specific objectives. First, to evaluate the capacity of ML approaches to estimate six soil chemical properties: pH, organic matter (OM), calcium (Ca), magnesium (Mg), sodium (Na), and potassium (K) content. We compare the selected ML model for each property with two scenarios that simulate traditional chemometric techniques (1) using the band with the highest regression coefficient(s) and (2) Partial Least Squares Regression (PLSR) [19] [20]. Second, to infer categories for soil properties to see whether this is a viable alternative.

Study area and sample collection procedure
This study uses a data set derived from a previous study in the Hoya del río Suárez region in Colombia (Coordinates: 73°22' -73°39' West longitude and 5°53' -6°10' North latitude). This region covers an area of about 470 km 2 . Entisols, inceptisols, and vertisols characterize this region, according to the soil survey [15]. The area has two principals crops: sugar cane for panela agro-industry and grasslands.
The sampling stage occurred during 2015 and 2016; samples are from the surface to a depth of 20 cm. Each sample point represents four sub-samples collected and mixed. The sample design corresponds to a reticulate grid of 700 meters ( Figure 1) wherein each point corresponds to four sub-samples to compose only one sample. This study uses a total of 653 samples.

Chemical measurements
We dried and analyzed the samples for classic laboratory analysis. The laboratory procedure employed a pH meter with 1:2.5 soil-water suspension (NTC 5264, 2008), organic matter (OM) with Walkley and Black's wet digestion method. We used the ammonium acetate extraction method to measure exchangeable cations (Ca +2 , K + , Mg +2 , and Na + ) by NTC 5349 -2008 [21].

Vis-NIR spectroscopy
The samples passed through a humidity homogenization process; samples were dried at 40°C (for 48-96 h, depending on the type of soil). Samples were placed in a 50 mm diameter annular cup and scanned from 850 to 2500 nm using a NIR spectrophotometer (FOOS-DS2500).

Regression and classification strategy
This study used the strategy presented in figure 2 for regression and classification. First, the spectrum was transformed to obtain informative features (section 2.2.1). Second, we followed two paths, one towards regression with three models from scikit-learn [22]in Python and another path towards classification with six classification methods from the Statistics and Machine learning toolbox in MATLAB. At follows, we describe the details for the ML regression (section 2.2.2) and classification (section 2.2.3) models. The coefficient of determination R 2 and correlation coefficient ρ were used to compare and select the best regression model. We used the accuracy, confusion matrix, and Mathews correlation coefficient [18] to compare the classification models.

Features and pre-processing
The Vis-NIR spectra for each sample cover the range between 400 and 2491 nm with steps of 8.5 nm (vector of 247 elements). We took each data point as a feature and applied transformations to the spectra, such as the first derivative (D1), second derivative (D2), and the Fast Fourier Transform (FFT). We applied standard normalization by feature in the whole samples dataset to ensure unit variance, and zero mean [23] and concatenated each feature set, resulting in 247x4 = 988 features for every sample.

ML regression
The dataset contains 653 samples and 988 features for six soil properties. First, we randomly split this dataset 70% (457 samples) for training (to evaluate and adjust parameters) and 30% (196 samples) only for validation purposes.
Regression models: We evaluated three regression models from [22] such as (1) Linear Regression (LR) that fits a predictor based on a model of ordinary least squares linear regression Pedregosa et al. [22], (2) Support Vector Regression (SVR) with the linear kernel (based on libsvm library Chang and Lin [24], implements an Epsilon-Support Vector Regression predictor with free parameters C = 1 and = 0.1), (3) LASSO by using cross-validation, implements a linear model that estimates sparse coefficients reducing the number of features, its fitting is iterative, and 5-fold cross-validation helps to select the best model (free parameters are = 0.001).
Cross-validation: We selected the best model among the three previously mentioned using a 5-fold cross-validation in the 70 % defined for training (457 samples) and the 988 features. We performed the selection by using the distribution, 5-fold (for the median and deviation), for the correlation (ρ), determination coefficients (R 2 ), and the mean squared error (MSE). However, for Figure 2: Strategy for regression and classification model selection from vis-NIRS data. Both start from the same data and features. At left, the steps that are taken to obtain the best regression model for each property; and at right, the classification steps some properties, none of the regression ML appears promising (ρ >0.6); we did not proceed in this case because we consider that ML regression is not suitable for the property in this dataset.
Simulating chemometric approaches: We simulate two chemometric approaches. For the first, we obtain the feature (from the 988) with the highest correlation coefficient to the target label; then, we used a model based on linear regression with the best feature (LR-bf). For the second approach, we used Partial Least Squares Regression (PLSR), a method frequently used in the literature [22]. Although we used two up to eight components, the results reported correspond to the best tuned with six principal components and 10000 iterations.
Comparison ML regression to simulated chemometric approaches: We used the 30 % test set (196 samples not used during the training) to compare the regression model selected against (1) the linear regression with the best feature (LR-bf), and (2) PLSR with six principal components.

ML classifiers
Classes: We defined the target classes for the properties according to soil fertility requirements for sugarcane for panela: K (Low: <0. Cross-validation: Due to the class imbalance of the dataset, we opted to perform a 5fold cross-validation in the entire dataset to select the best performing ML model. The crossvalidation gives us an estimate of the model accuracy trained and evaluated with all the data. This method generally results in a less biased and less optimistic evaluation of the model performance than other methods, such as train/test split. [25].
Misclassification cost grid search Due to the class imbalance of the dataset, we perform a grid search over a penalty cost calculated from the confusion matrix to optimize the model hyperparameters. As usual, the confusion matrix represents how accurately the current model predicted the observation class when that particular observation was a part of the held-out fold in the cross-validation while the model is training. Therefore, the values of all the confusion matrices are integers that aggregate the complete dataset, as shown in Figures 4 A-F. This penalty cost was applied to all Type I and Type II errors in the Confusion Matrix. By default, the penalty is set to one in all errors and cero otherwise. Hence, all observations have the same penalty in the cost function. Changing this penalty affects the cost function of the model.
We use a grid search to find the best-performing combination of penalty costs for all errors. In the models for pH, OM, Ca, Mg, K, we use a grid-search of 6 parameters corresponding to all type I and II errors on a three-class confusion matrix; each parameter varied between gs = {1, 2, ..., 7} for a total of 117649 different full cross-validation experiments. For Na, we optimize the two misclassification cost, we increased the grid search to gs = {1, 2, ..., 150} for a total of 22500 cross-validations. Finally, we use the Mathews correlation coefficient or MCC as a metric to evaluate the performance due to the advantage of MCC over the accuracy or F1-score in unbalanced datasets [18].

Feature ranking
Finally, we propose a feature ranking approach to unveil each wavelength or band's effects from the spectrum and the properties. We obtained and normalized scores for feature selection such as (1) the correlation coefficient, (2) LASSO ranking, (3) F-Score, and (4) variance. We 6 only consider the derivatives for this procedure because the best bands for all properties belong to these transforms. The variance score, unsupervised, aids in filtering features with low variance and thus insufficient information. The LASSO ranking through regularization removes features with low information content and redundant, reducing dimensionality. Finally, F-Score and correlation coefficient rank the bands that have more effects on the target based on univariate regressions. We added these four scores and obtained a unique value to rank the features. We used the feature selection module form scikit learn [22] to obtain the scores.

Regression
The best pH estimates result using an SVR regressor in the test set with a correlation between true and predicted ρ = 0.898 (R 2 = 0.802). When using the feature that best correlates with pH, we get ρ = 0.694 (R 2 = 0.479) using linear regression. LASSO performed slightly better than PLSR (ρ = 0.865 and R 2 = 0.745). LASSO and PLSR regressor significantly improved the accuracy of the pH estimates, shown by the non-overlapping 95% confidence interval of all three LASSO, LR-bf, and PLSR ( Figure 3A).
Moreover, we tested several regression models on the remaining soil properties (K and Na), obtaining a correlation ρ below 0.5. Figure 3 E. summarises regressor results and presents the best ML regressor and the results simulating chemometric approaches. We use linear regression to compare to a simple approach, but this method leads to the worst results in all the cases.

Classification
Remarkably SVM with a linear kernel was the best performing classifier of the 24 evaluated for all soil properties. The confusion matrixes of the best ML classifiers are shown in Figure 4 A to F. As stated by Figure 4 H, we obtain accuracies greater than 73% for pH, OM and Ca, and 68% for Mg. However, for OM, K, and Na some of the classes were under-represented by less than 5% of the total, this makes the precision results misleading. i.e., for Na, a precision of 99% is not as good an indicator as appeared, since the medium class only has eight samples against 645 of the Low class. In this case, the F1 score, the True Positive Ratio TPR, and the MCC are preferred to compare the performance.

Feature ranking
At last, Figure 5.A shows the feature ranking for each property and region with distinct absorption (towards red). The visible 450-670nm range contains highly ranked features for all properties. pH and Ca have similar feature ranking heatmaps with the highest bands ranked around 600nm . Mg has a highly correlated range of 600-670nm, while K has a highly ranked area near 500nm. Thus, the visible spectrum bands are hihgly correlated to these properties. Instead, Na presents highly ranked features between 2100-2400nm. 7

Discussion
Regression and vis-NIRS provide an alternative to soil property estimation, with an uncertainty higher than traditional laboratory analyses but with certain advantages (cost, timing, residues). The most frequently used method PLSR, has similar results to the best machine learning approaches but is slightly improved by them.
We proposed an alternative strategy for property estimation from soil samples by recasting the regression into a classification problem. We labeled conventional test results depending on the property to be estimated. We then implemented standard mappings of real property values into qualitative classes from literature. For our dataset, such mappings ended in unbalanced classes, with few samples (classes with less than 5% of the samples). Surveyed ML classified the test samples mostly in the label with the largest training set. We marginally improved the classification of the under-represented labels by introducing weighted metrics in the cross-validation stage.We obtained the best results from a maximum margin method such as linear SVM, we believe that this result is due to the high tolerance to overfitting given by the cross-validation and the MCC. We also tested oversampling approaches to balance our training dataset synthetically. However, weighted metrics outperformed such an approach. These classifiers can be used as a qualitative assessment tool that might help optimal sampling design for further expensive conventional lab tests or initial intervention plans.
It is worth noting efforts in the literature attempt to use vis-NIRS data in proposals aiming towards soil diagnosis. Viscarra et al. [4] used Vis-NIRS, in sugarcane soils, to predict soil properties and moved towards a soil fertility index. They used 184 soil samples, PLSR, to estimate soil properties, in addition to 17 terrain attributes to derive the index. Awiti et al. [26] used vis-NIR into an odds logistic model to classify soil into good, average and poor condition. The usage of vis-NIRS and ML is a rapid strategy that offers the possibility to diagnose soil condi-tions. This study is the first step to evaluate the performance of vis-NIRS, ML regressors, and classifiers, but we look forward to getting into soil diagnosis.
Bands with the highest correlations and chemical hypotheses Regions for features highly ranked are centered near to 500, 600, 1400, 1700, 1900, 2200, and 2400 nm ( Figure 5). For the pH(H 2 O), absorptions near 500 and 600 nm are primarily associated with some minerals containing hematite, and goethite [27,5]. At the same time, absorptions near 600 nm result from chromophores and the darkness of organic. In the Vis-NIRS, the overtones and combination bands due to organic matter result from the stretching and bending of CO, CH, and NH groups [28]. The band around 1400 nm is linked to the vibration of OH and residual water in organic matter [29]. On the other hand, the wavelengths at 1700 and 1930 nm are assigned to groups (C-H) and (C=O) that correspond to aromatic asymmetric alkylsymmetric doublet and carboxylic acids, respectively [30]. These bands have been identified as essential bands for organic matter calibration [5]. The band near 2200 nm can be attributed to metal-OH bend plus O-H stretch combinations of several clay minerals, among them illitic types [31], organic compounds, and carbonate. The wavelength at 2350 nm is related to Mg-OH [32]. Finally, in Figure 5 the region between 500 and 600nm has a high correlation with the chemical parameters analyzed. This could be related to both the dissolution mechanisms of iron oxides within the soils, and particularly within the rhizosphere (protonation, reduction, complexation) [33]; as well as the reactions that organic matter (humic acids and fulvic acids) with cations (Ca +2 , K + , Mg +2 , and Na + ) ( [34,35,36,37,38]). Although there is no direct association between properties and the NIRS, highly classified characteristics could be associated with property components.

Conclusions
The combination of spectra, its first and second derivative, and ML regressors have the best accuracy results for pH, OM, Ca, and Mg soil content. Despite the estimation performance being close to reported in the literature, it is critical increasing the number of samples, adding soil samples with extreme values to enhance prediction power. PLSR has a comparable performance estimating chemical properties; although it has a similar performance to the best ML regression models, the best ML regressors outperform PLSR.
ML classifiers are a feasible strategy when ML regressors poorly perform. Also, ML classifiers can be used as a qualitative assessment tool for optimal sampling design.
The feature ranking approach enables the researcher to get insight into the bands that highly correlate with each property. It is essential to understand what is behind ML approaches; thus, feature ranking is the first step in getting back to the data.

Data availability upon acceptance
The filtered datasets and scripts are archived at github (available upon acceptance).

Acknowledegments
Special thanks to Oscar Daniel Torres Rodríguez and Andrés Felipe Mariño Guerra for a preliminary study. We are also grateful for the project 243. Recomendaciones técnicas preliminares de manejo de suelos en ladera para el sistema de producción de caña panelera en la HRS from 11