9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
MACHINE LEARNING-BASED RICE GRAIN CLASSIFICATION THROUGH
NUMERICAL FEATURE EXTRACTION FROM RICE IMAGE DATA
İsmail KIRBAŞ1*, Ahmet ÇİFCİ2**
1
Burdur Mehmet Akif Ersoy University, Faculty of Engineering and Architecture, Department
of Computer Engineering, Burdur, Turkey
ORCID: ID/ 0000-0002-1206-8294
2
Burdur Mehmet Akif Ersoy University, Faculty of Engineering and Architecture, Department
of Electrical-Electronics Engineering, Burdur, Turkey
ORCID: ID/ 0000-0001-7679-9945
Abstract
Rice is a staple food for over half of the world, making it a crucial crop on a global scale. It is
a major source of food and income for millions of people and is widely cultivated in many
countries. The diversity of rice species and its many uses make it an important crop, both
economically and culturally. Accurate classification of rice grains is important in various
stages of the rice industry, including quality control, grain sorting, and species identification.
Grain morphology plays a vital role in the classification of rice, and classifying rice using
traditional methods, which rely on morphological features like grain length, width, weight,
and shape, are subject to human error and can be time-consuming and labour-intensive with
subjective results. The growing need for accurate and efficient rice species classification has
led to the development of machine learning models, which can process large amounts of data
and provide accurate results in real-time. In recent years, machine learning models have
shown promising results in the classification of different rice species. In this study, we
evaluated the performance of several machine learning models, including Support Vector
Machines, k-Nearest Neighbors, Stochastic Gradient Descent, Naïve Bayes and Random
Forest for classifying different rice species based on numerical features extracted from images
of rice grains. The rice (Cammeo and Osmancik) dataset comprises 3810 numerical data,
separated into 2180 instances of the Osmancik species and 1630 instances of the Cammeo
species. Seven morphological features were identified, namely, the area, perimeter, major axis
length, minor axis length, extent, convex area, and eccentricity for each grain of rice. The
results show that the Naïve Bayes model had the best performance with the area under the
curve of 0.969, and the Stochastic Gradient Descent model achieved the highest performance
with a cumulative accuracy of 92.8%.
Keywords: classification, machine learning, morphological features, rice species
INTRODUCTION
Half of the world’s population depends on rice (Oryza sativa L.), an essential staple food
(Wang et al., 2014; Cao et al., 2021). Worldwide, significant amounts of rice are grown and
traded in dozens of different species. The demand for rice is greatly increasing, leading to the
need for more efficient production and sorting processes to meet the heightened demand. Rice
classification is the process of classifying and grading different types of rice. Grading rice
quality is an important step in the processes used by rice-producing companies to assess rice
quality and establish rice pricing on the commercial market (Thammastitkul and Petsuwan,
2023). Rice quality grading is based on various metrics such as the length, width, shape,
color, and texture of the grains. Traditionally, experts manually classify grains based on their
appearance. However, this procedure is time-consuming and subjective.
www.zeugmakongresi.org 420 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
Together with big data technologies and high-performance computing, machine learning
(ML) has emerged to open up new possibilities for unravelling, quantifying, and
comprehending data-intensive processes in agricultural operational environments (Liakos et
al., 2018). Evaluating rice grains using ML models has been the subject of several studies
(Sun et al., 2014; Chen et al., 2020; Kiratiratanapruk et al., 2020; Ruslan et al., 2022). ML
models have also demonstrated promising results in the classification of various rice species
in recent years. Kong et al. (2013) used a near-infrared hyperspectral imaging system and ML
models (k-Nearest Neighbor (k-NN), Support Vector Machines (SVM) and Random Forest
(RF)) to classify the four rice seed cultivars. Liu et al. (2016) achieved an accuracy of 87.16%
through the use of the SVM model on 200 rice grain samples belonging to 16 classes. Cinar
and Koklu (2019) performed a comparison of the performance of various ML models
(Logistic Regression (LR), Multilayer Perceptron (MLP), SVM, Decision Tree (DT), RF,
Naïve Bayes (NB) and k-NN) by feeding them with seven morphological features extracted
from each rice grain to classify two rice species. The study by Sethy and Chatterjee (2018)
aimed to determine the geometric and texture features of six rice varieties using the Multi-
class Support Vector Machine (M-SVM) algorithm. They achieved an accuracy of 92%. See
also Ibrahim et al. (2019), where authors used the same algorithm to classify the three types of
rice grains, which are basmati, ponni, and brown rice. They obtained an accuracy of 92.22%.
Tarakci and Ozkan (2021) evaluated the performances of k-NN and weighted k-NN models in
the rice (Cammeo and Osmancik) dataset. They found that k-NN performed slightly better
than weighted k-NN, with a 92.6% accuracy rate, while weighted k-NN achieved a 91.7%
accuracy rate.
This study compared the performance of several ML models (SVM, k-NN, Stochastic
Gradient Descent (SGD), NB, and RF) in classifying rice species based on numerical features
extracted from rice grain images. Two different rice species (Cammeo and Osmancik) were
classified based on their morphological characteristics. The goal of this study is to accurately
identify the performance of classification models.
The structure of this study is as follows. The dataset and ML models used in the study were
all described in the second section. The performance metrics and experimental findings from
the study were presented in the third section. The experimental findings were assessed, and
suggestions were made in the final section.
MATERIALS AND METHODS
Rice (Cammeo and Osmancik) Dataset
The rice dataset contains features extracted from images of two rice grain species grown in
Turkey: Osmancik and Cammeo (Cinar and Koklu, 2019). When looking at Osmancik species
in general, they have a wide, long, glassy, and dull appearance. When it comes to the general
appearance of the Cammeo species, they are wide and long, glassy and dull. The features
encompass seven morphological properties, including area, perimeter, major axis length,
minor axis length, extent, convex area, and eccentricity. The Osmancik species has 1630
instances and the Cammeo species has 2810 instances. Figure 1 depicts the rice varieties used
in the study.
www.zeugmakongresi.org 421 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
Figure1. Rice species used in the study a. Osmancik, b. Cammeo
For each of the 3810 samples, 7 different features were measured, and a table was created
accordingly. The contribution of each feature used in the classification to the class separation
was calculated according to the InfoGain parameter and given in Table 1 in order from the
largest to the smallest.
Table 1. Features of the observations in the dataset and classification contribution rates
according to the InfoGain criterion
Features InfoGain
Major axis length 0.643
Perimeter 0.612
Area 0.512
Convex area 0.512
Eccentricity 0.305
Minor axis area 0.095
Extent 0.036
In Figure 2 and Figure 3, scatter plots of major axis length, minor axis length and perimeter
features are shown to illustrate the difficulty of the classification process. While the classes
are shown in red and blue colours, the large number of nested samples complicates the
classification problem.
www.zeugmakongresi.org 422 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
Figure2. Scatter plot of major axis length vs. perimeter
Figure3. Scatter plot of major axis length vs. minor axis length
www.zeugmakongresi.org 423 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
Machine Learning Models
ML models such as SVM, k-NN, SGD, NB, and RF were used to classify rice species.
Following is a brief explanation of the ML models used in the study.
Support Vector Machines are a type of supervised learning algorithm that is used for
classification and regression analysis (Battineni et al., 2019). In a nutshell, an SVM algorithm
seeks the boundary between classes in a dataset that maximizes the margin between the
classes. The margin is defined as the distance between the boundary and the nearest data
points from each class, which are referred to as the support vectors (Bilgin and Çifci, 2021).
An SVM creates a decision boundary that separates the classes in classification problems. In
regression problems, an SVM generates a prediction line that best fits the data while
maximizing the margin between the data points and the line.
The k-Nearest Neighbors algorithm is a simple ML algorithm used for classification and
regression problems (Reddy et al., 2020). The basic idea behind k-NN is to predict the class or
value of a new data point based on its similarity to the training data set’s k nearest neighbors.
Because it does not learn a function or model during the training phase, the k-NN algorithm is
considered as a lazy learning algorithm (Moorthy and Pabitha, 2020; Cano-Ortiz et al., 2022).
Instead, it memorizes the entire training set and makes predictions based on a similarity
comparison between the new data point and the previously stored training set. As a result,
while k-NN is a simple and quick algorithm for small data sets, it may not be the best choice
for large or complex data sets (Moorthy and Pabitha, 2020).
Stochastic Gradient Descent is an optimization algorithm commonly used in training ML
models (Pang et al., 2020). It is a variation of the traditional Gradient Descent optimization
algorithm. Instead of computing the gradient of the loss function with respect to the
parameters using the entire training set, Stochastic Gradient Descent estimates the gradient
using only one randomly selected training sample at a time. Because only a single sample is
computed, the update step in SGD is much faster (Deepa et al., 2021). It does, however,
introduce a high level of stochasticity, or randomness, into the optimization process, which
can sometimes result in slower convergence or suboptimal solutions.
Naïve Bayes is a probabilistic ML algorithm based on the theorem of Bayes (Abbas et al.,
2019; Venkatesh et al., 2019; Ressan and Hassan, 2022). It is a quick and easy to use
algorithm for binary and multiclass classification problems. One of Naïve Bayes’ main
advantages is its simplicity and efficiency (Balaji et al., 2020; Wickramasinghe and
Kalutarage, 2021; Zhang et al., 2021). Even for large datasets with many features, the
algorithm is simple to implement and computationally fast. As a result, it is a popular solution
for text classification and spam filtering issues. Another advantage of Naïve Bayes is that it
can deal with noisy or irrelevant features (Suasnawa et al., 2021; Cano-Ortiz et al., 2022). The
algorithm makes predictions by aggregating evidence from all features, so it can produce
good results even if some features are irrelevant or noisy.
Random Forest is an ensemble ML algorithm (Ibrahim and Abdulazeez, 2021) that can be
used to solve classification and regression problems (Kilimci, 2022). The algorithm generates
a set of decision trees, each of which is trained on a randomly selected subset of the training
data. Averaging the predictions of all trees in the forest yields the final prediction. RF is a
powerful algorithm capable of dealing with large datasets with many features as well as non-
linear relationships between features and target variables. The algorithm is also resistant to
overfitting, as averaging multiple trees can help to reduce prediction variance (Islam and
Nahiduzzaman, 2022; Cano-Ortiz et al., 2022). RF is widely used in practice, and it is
frequently one of the first algorithms that practitioners try when faced with a new problem. It
www.zeugmakongresi.org 424 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
is a powerful algorithm that can produce good results with little effort, and it is an excellent
choice for many classification and regression problems.
RESULTS
In order to measure model performance in classification problems, confusion matrices are
created and for this purpose True Positive (TP), True Negative (TN), False Positive (FP), and
finally False Negative (FN) values are calculated. A FP value is also referred to as a Type 1
error. In this case, the predicted value is positive while the actual value is negative. FN error
is also called Type 2 error, and, in this case, the predicted value is negative while the actual
value is positive. Figure 4 shows the confusion matrix generated for a binary classification
problem.
Figure4. The confusion matrix used to compare binary classification results
There are a variety of measures that are well known and generally recognized that may be
used to compare the categorization performance of different models. In order to begin this
procedure, confusion matrices of the models must first be built. In Table 2, the confusion
matrix values of the two models that performed the best are presented.
Table 2. Confusion matrix data for the two most successful classification models
SGD Model Predicted
Cammeo Osmancik Total
Cammeo 1490 140 1630
Actual
Osmancik 135 2045 2180
Total 1625 2185 3810
NB Model Predicted
Cammeo Osmancik Total
Cammeo 1480 150 1630
Actual
Osmancik 210 1970 2180
Total 1690 2120 3810
www.zeugmakongresi.org 425 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
The formulas used to calculate the model classification performance metrics are given in
Table 3 (Kırbaş and Çifci, 2022).
Table 3. Formulas for calculating model performance metrics
Performance Metric Equation
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
Recall
𝑇𝑃 + 𝐹𝑁
𝑇𝑃 + 𝑇𝑁
Accuracy
𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃
2. (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑟𝑒𝑐𝑎𝑙𝑙)
F1-score
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Table 4 presents the results of the evaluations of the five models that were utilized in the
study in terms of the classification metrics of Area Under Curve (AUC), Cumulative
Accuracy (CA), F1-score, Precision, and Recall. The best results are the values closest to 1
and are underlined in Table 4.
Table 4. Classification metrics for five different classification models
Model AUC CA F1-score Precision Recall
k-NN 0.933 0.883 0.883 0.883 0.883
SVM 0.826 0.726 0.714 0.733 0.726
SGD 0.926 0.928 0.928 0.928 0.928
RF 0.967 0.918 0.918 0.918 0.918
NB 0.969 0.906 0.906 0.906 0.906
The statistics presented in Table 4 are depicted visually in Figure 5, which may be found
below.
Figure5. Model performance comparison for five classification metrics
www.zeugmakongresi.org 426 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
The results of Table 4 demonstrate that the SGD model is the superior option in terms of
cumulative accuracy, F1-score, precision, and recall parameters. However, when the value of
the area under the curve is considered, the NB technique fared significantly better than the
other methods. Among all of the strategies that were examined, the SVM model had the worst
success rate, followed by the k-NN model. The SGD model achieves the best outcome when
the precision and recall parameters are taken into consideration.
DISCUSSION
In this study, the rice dataset was used to classify Osmancik and Cammeo rice grains, both of
which are widely produced in Turkey. The data contains seven morphological features
extracted from images of two rice grain species. There are 3810 data rows in total, which are
not evenly distributed uniformly between the two varieties. Cammeo has 1630 rows of data,
while Osmancik has 2180. More data in one of the sets provides better training grounds,
which may result in a better evaluation of one side versus the other.
Based on Table 4 and Figure 5 which show the comparative classification performance, SGD
and NB achieved better results than the other models. Among all, SVM shows low
performance. According to the findings, this study outperformed the best study ever reported
in the literature.
CONCLUSION
Rice varieties Osmancik and Cammeo were evaluated as part of the scope of the study, and
classification was accomplished by the extraction of seven distinct attributes from a total of
3810 samples. The findings of the investigation into binary classification that made use of five
distinct AI and ML models revealed that a cumulative accuracy of 92.8% was possible to
attain. This was discovered because of the research that was carried out. The effectiveness of
each of the five distinct models was evaluated using one of five distinct criteria (AUC, CA,
F1-score, precision, and recall), and the resulting data was compiled. As a result of the
research, it was determined that ML algorithms can be used to do risk-free high-performance
categorization of examples like this and similar ones.
It is expected to focus on improving classification results in the near future by experimenting
with various artificial neural network and deep learning methods.
REFERENCES
1. Abbas, M., Memon, K. A., Jamali, A. A., Memon, S., & Ahmed, A. (2019). Multinomial
Naive Bayes classification model for sentiment analysis. IJCSNS S International Journal of
Computer Science and Network Security, 19(3), 62.
2. Balaji, V. R., Suganthi, S. T., Rajadevi, R., Kumar, V. K., Balaji, B. S., & Pandiyan, S.
(2020). Skin disease detection and segmentation using dynamic graph cut algorithm and
classification through Naive Bayes classifier. Measurement, 163, 107922.
3. Battineni, G., Chintalapudi, N., & Amenta, F. (2019). Machine learning in medicine:
Performance calculation of dementia prediction by support vector machines (SVM).
Informatics in Medicine Unlocked, 16, 100200.
4. Bilgin, G., & Çifci, A. Eritematöz skuamöz hastalıkların teşhisinde makine öğrenme
algoritmaları performanslarının değerlendirilmesi. Journal of Intelligent Systems: Theory and
Applications, 4(2), 195-202.
5. Cano-Ortiz, S., Pascual-Munoz, P., & Castro-Fresno, D. (2022). Machine learning algorithms
for monitoring pavement performance. Automation in Construction, 139, 104309.
6. Cao, J., Zhang, Z., Tao, F., Zhang, L., Luo, Y., Zhang, J., Han, J., & Xie, J. (2021).
Integrating multi-source data for rice yield prediction across China using machine learning
and deep learning approaches. Agricultural and Forest Meteorology, 297, 108275.
www.zeugmakongresi.org 427 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
7. Chen, J., Lian, Y., & Li, Y. (2020). Real-time grain impurity sensing for rice combine
harvesters using image processing and decision-tree algorithm. Computers and Electronics in
Agriculture, 175, 105591.
8. Cinar, I., & Koklu, M. (2019). Classification of rice varieties using artificial intelligence
methods. International Journal of Intelligent Systems and Applications in Engineering, 7(3),
188-194.
9. Deepa, N., Prabadevi, B., Maddikunta, P. K., Gadekallu, T. R., Baker, T., Khan, M. A., &
Tariq, U. (2021). An AI-based intelligent system for healthcare analysis using Ridge-Adaline
Stochastic Gradient Descent Classifier. The Journal of Supercomputing, 77, 1998-2017.
10. Ibrahim, I., & Abdulazeez, A. (2021). The role of machine learning algorithms for diagnosing
diseases. Journal of Applied Science and Technology Trends, 2(01), 10-19.
11. Ibrahim, S., Zulkifli, N. A., Sabri, N., Shari, A. A., & Noordin, M. R. M. (2019). Rice grain
classification using multi-class support vector machine (SVM). IAES International Journal of
Artificial Intelligence, 8(3), 215.
12. Islam, M. R., & Nahiduzzaman, M. (2022). Complex features extraction with deep learning
model for the detection of COVID19 from CT scan images using ensemble based machine
learning approach. Expert Systems with Applications, 195, 116554.
13. Kırbaş, İ., & Çifci, A. (2022). An effective and fast solution for classification of wood
species: A deep transfer learning approach. Ecological Informatics, 69, 101633.
14. Kilimci, Z. H. (2022). The effectiveness of homogeneous classifier ensembles on customer
churn prediction in banking, insurance, and telecommunication sectors. International Journal
of Computational and Experimental Science and Engineering, 8(3), 77-85.
15. Kiratiratanapruk, K., Temniranrat, P., Sinthupinyo, W., Prempree, P., Chaitavon, K.,
Porntheeraphat, S., & Prasertsak, A. (2020). Development of paddy rice seed classification
process using machine learning techniques for automatic grading machine. Journal of
Sensors, 2020.
16. Kong, W., Zhang, C., Liu, F., Nie, P., & He, Y. (2013). Rice seed cultivar identification using
near-infrared hyperspectral imaging and multivariate data analysis. Sensors, 13(7), 8916-
8927.
17. Liakos, K. G., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in
agriculture: A review. Sensors, 18(8), 2674.
18. Liu, T., Wu, W., Chen, W., Sun, C., Chen, C., Wang, R., Zhu, X., & Guo, W. (2016). A
shadow-based method to calculate the percentage of filled rice grains. Biosystems
Engineering, 150, 79-88.
19. Moorthy, R. S., & Pabitha, P. (2020). Optimal detection of phising attack using SCA based K-
NN. Procedia Computer Science, 171, 1716-1725.
20. Pang, B., Nijkamp, E., & Wu, Y. N. (2020). Deep learning with tensorflow: A review.
Journal of Educational and Behavioral Statistics, 45(2), 227-248.
21. Reddy, G. T., Bhattacharya, S., Ramakrishnan, S. S., Chowdhary, C. L., Hakak, S., Kaluri, R.,
& Reddy, M. P. K. (2020, February). “An ensemble based machine learning model for
diabetic retinopathy classification.” 2020 International Conference on Emerging Trends in
Information Technology and Engineering (ic-ETITE) (pp. 1-6). IEEE.
22. Ressan, M. B., & Hassan, R. F. (2022). Naive-Bayes family for sentiment analysis during
COVID-19 pandemic and classification tweets. Indonesian Journal of Electrical Engineering
and Computer Science, 28(1), 375.
23. Ruslan, R., Khairunniza-Bejo, S., Jahari, M., & Ibrahim, M. F. (2022). Weedy rice
classification using image processing and a machine learning approach. Agriculture, 12(5),
645.
www.zeugmakongresi.org 428 Gaziantep, Türkiye
9th INTERNATIONAL ZEUGMA CONFERENCE ON SCIENTIFIC RESEARCH
24. Sethy, P. K., & Chatterjee, A. (2018). Rice variety identification of western Odisha based on
geometrical and texture feature. International Journal of Applied Engineering Research,
13(4), 35-39.
25. Suasnawa, I. W., Caturbawa, I. G. N. B., Widharma, I. G. S., Sapteka, A. A. N. G., Indrayana,
I. N. E., & Sunaya, I. G. A. M. (2021). “Twitter sentiment analysis on the implementation of
online learning during the pandemic using Naive Bayes and Support Vector Machine.”
Proceedings of the 4th International Conference on Applied Science and Technology on
Engineering Science (iCAST-ES 2021) (pp. 348-353).
26. Sun, C., Liu, T., Ji, C., Jiang, M., Tian, T., Guo, D., Wang, L., Chen, Y., & Liang, X. (2014).
Evaluation and analysis the chalkiness of connected rice kernels based on image processing
technology and support vector machine. Journal of Cereal Science, 60(2), 426-432.
27. Tarakci, F., & Ozkan, I. A. (2021). Comparison of classification performance of kNN and
WKNN algorithms. Selcuk University Journal of Engineering Sciences, 20(2), 32-37.
28. Thammastitkul, A., & Petsuwan, J. (2023). Thai Hom Mali rice grading using machine
learning and deep learning approaches. IAES International Journal of Artificial Intelligence,
12(2), 815.
29. Venkatesh, R., Balasubramanian, C., & Kaliappan, M. (2019). Development of big data
predictive analytics model for disease prediction using machine learning technique. Journal of
Medical Systems, 43, 1-8.
30. Wang, P., Zhang, Z., Song, X., Chen, Y., Wei, X., Shi, P., & Tao, F. (2014). Temperature
variations and rice yields in China: historical contributions and future trends. Climatic
Change, 124, 777-789.
31. Wickramasinghe, I., & Kalutarage, H. (2021). Naive Bayes: applications, variations and
vulnerabilities: a review of literature with code snippets for implementation. Soft Computing,
25(3), 2277-2293.
32. Zhang, H., Jiang, L., & Yu, L. (2021). Attribute and instance weighted naive Bayes. Pattern
Recognition, 111, 107674.
www.zeugmakongresi.org 429 Gaziantep, Türkiye