Updated Presentation
Updated Presentation
set
I selected this dataset because it provides a substantial amount of data, with 8,124 rows, making it ideal for machine learning models as it allows for robust training and
potentially higher accuracy. The dataset, sourced from the UCI Machine Learning Repository, is derived from reliable scientific observations, ensuring its credibility. The
dataset includes entirely categorical data (e.g., cap shape, odor, habitat, and population), offering a rich opportunity for classification tasks. Additionally, the topic is
highly compelling, offering insights into the characteristics that determine whether a mushroom is edible or poisonous. This combination of a rich dataset, a reputable
source, and a meaningful subject makes it both an engaging and impactful project.
Key features or variables/attributes in the dataset:
The dataset consists of 8,124 rows and 23 columns, providing a rich mix of categorical features that describe various mushroom characteristics. Each row represents a unique observation of a
mushroom, while the columns capture attributes such as cap shape, cap surface, cap color, and odor. All data in the dataset is categorical, with no missing values, making it well-suited for classification
tasks. The target column, class, specifies whether a mushroom is edible (e) or poisonous (p). This dataset offers a unique opportunity to uncover patterns and relationships among features to predict
the edibility of mushrooms with high accuracy. (Dua, D. and Graff, C. 2019).
Problem Statement:
Problem The problem at hand is the difficulty in identifying characteristics that determine weather a
Mushroom is edible or poisonous , posing a potential risk to foragers and consumers
definitio
Objective:
n
The goal of this project is to determine the characteristics that mostly accurately predict
weather a mushroom is edible or poisonous.
Data Preprocessing cleaning and transformation
No Outlier Removal
Outlier removal was not required as all features were categorical, and the dataset was already clean and well-structured.
Data Preprocessing cleaning
and transformation
Encoding categorical columns into
numerical values because ml
Checking for missing values
algorithms require numerical inputs
and I used OneHotEncoding
Exploratory Data Analysis (EDA)
The three graphs highlight important patterns and relationships within the mushroom dataset. The first bar chart
visualizes the distribution of the target variable (class), showing the number of edible (e) and poisonous (p)
mushrooms. This balanced distribution ensures that both classes are well-represented for classification tasks. The
second graph explores the distribution of cap-shape, revealing that certain shapes, such as x and f, are more
prevalent, while others are less common. The third graph, a stacked bar chart, illustrates the relationship between
odor and habitat, showing how certain odors are dominant in specific habitats. For instance, n is prominent in
wooded areas (d), while other odors vary across habitats. Together, these visualizations provide valuable insights
into the distribution and interaction of features, helping to identify key patterns for predicting mushroom edibility.
Machine learning models
I will begin with a Logistic Regression model as a baseline due to its simplicity and effectiveness in
binary classification tasks. It provides interpretable coefficients, helping to understand the
relationship between features like odor and gill size with the target variable. Next, I will use a
Random Forest Classifier, which is reliable and capable of handling non-linear relationships. It also
provides feature importance metrics, which are crucial for identifying the most influential features,
such as cap shape and bruises. Finally, I will implement XGBoost, a more advanced gradient boosting
algorithm that refines predictions by reducing residual errors and capturing complex interactions
among features. These three models will enable a comprehensive evaluation of the dataset, ensuring
robust and accurate classification of mushrooms.
Model evaluation :
For model evaluation, I will use accuracy to measure how well the models correctly classify mushrooms
as edible or poisonous. Additionally, I will consider precision, recall, and the F1-score to evaluate the
performance of each model in distinguishing between the two classes. I will compare these metrics for
Logistic Regression, Random Forest, and XGBoost on the test data to determine which model performs
best. (Hossain, M.S., Muhammad, G., and Kwon, J., 2020).
I will train and test the Random Forest and xg Boost models by splitting the dataset into
training and testing subsets, typically using an 80/20 split. The models will be trained on
the training data to learn patterns and relationships between features and the target
variable
Potential Challenges
My dataset is balanced with numerical values, so I do not expect major class imbalance issues. However, I
will monitor for skewed distributions in the target variable (Smoking Prevalence). If imbalance is
identified in any future classification tasks, I will address it using techniques like oversampling with
SMOTE or under sampling the majority class. Ganganwar, V. (2012)
I will handle the categorical variables by converting them into numerical representations
using one-hot encoding for all non-numeric features. This ensures that the machine
learning models can interpret the data accurately and effectively. (Ganganwar, V., 2012).
conclusion
In conclusion, by leveraging Logistic Regression, Random Forest, and XGBoost models,
I aim to accurately classify mushrooms as edible or poisonous while identifying the key
features driving these classifications. Logistic Regression provides a simple and
interpretable baseline, offering insights into the relationships between features and
the target variable. Random Forest serves as a robust model, highlighting feature
importance and handling non-linear relationships effectively. XGBoost builds on this by
refining predictions through its advanced capabilities in capturing complex interactions
among features. Looking ahead, I plan to improve the models by experimenting with
hyperparameter tuning and exploring additional feature engineering techniques.
Incorporating larger or more diverse datasets in the future could further enhance the
accuracy and reliability of the analysis.
References
Dua, D., & Graff, C. (2019). Mushroom Dataset. [online] UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/ml/datasets/mushroom (Accessed: 19 November
2024).
Singh, A., N., M., & Lakshmiganthan, R. (2017). Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms.
International Journal of Advanced Computer Science and Applications, 8(12). doi: https://doi.org/10.14569/ijacsa.2017.081201.
Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42–47.
Available at: https://www.researchgate.net/publication/292018027 (Accessed: 19 November 2024).
Hossain, M.S., Muhammad, G., & Kwon, J. (2020). A survey of big data architectures and machine learning algorithms for internet of things (IoT). Journal of Sensors, 15(8), 260.
Available at: https://www.mdpi.com/1999-5903/15/8/260 (Accessed: 17 November 2024).