0% found this document useful (0 votes)
1 views11 pages

Updated Presentation

Abc

Uploaded by

Umaid Ali Keerio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views11 pages

Updated Presentation

Abc

Uploaded by

Umaid Ali Keerio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to the data

set
I selected this dataset because it provides a substantial amount of data, with 8,124 rows, making it ideal for machine learning models as it allows for robust training and
potentially higher accuracy. The dataset, sourced from the UCI Machine Learning Repository, is derived from reliable scientific observations, ensuring its credibility. The
dataset includes entirely categorical data (e.g., cap shape, odor, habitat, and population), offering a rich opportunity for classification tasks. Additionally, the topic is
highly compelling, offering insights into the characteristics that determine whether a mushroom is edible or poisonous. This combination of a rich dataset, a reputable
source, and a meaningful subject makes it both an engaging and impactful project.
Key features or variables/attributes in the dataset:

The dataset consists of 8,124 rows and 23 columns, providing a rich mix of categorical features that describe various mushroom characteristics. Each row represents a unique observation of a
mushroom, while the columns capture attributes such as cap shape, cap surface, cap color, and odor. All data in the dataset is categorical, with no missing values, making it well-suited for classification
tasks. The target column, class, specifies whether a mushroom is edible (e) or poisonous (p). This dataset offers a unique opportunity to uncover patterns and relationships among features to predict
the edibility of mushrooms with high accuracy. (Dua, D. and Graff, C. 2019).
Problem Statement:

Problem The problem at hand is the difficulty in identifying characteristics that determine weather a
Mushroom is edible or poisonous , posing a potential risk to foragers and consumers

definitio
Objective:
n
The goal of this project is to determine the characteristics that mostly accurately predict
weather a mushroom is edible or poisonous.
Data Preprocessing cleaning and transformation

Checked for Missing Values


I started by checking for any missing values in the mushroom dataset using data.isnull().any(). Fortunately, there were no missing values in any of the
columns, so no imputation or handling of null entries was required.

Encoded Categorical Columns


All columns in the dataset were categorical. I used OneHotEncoder to convert these columns into numerical representations. This transformation was
essential because machine learning algorithms require numerical data to process and interpret the features effectively.

Dropped Irrelevant Columns


Since all features were relevant to the classification task, no columns were removed. However, care was taken to ensure that no duplicates or redundant
data were present.

Validated Feature Distribution


I checked the distribution of categorical features to ensure there were no anomalies or imbalances that would adversely affect the model.

No Outlier Removal
Outlier removal was not required as all features were categorical, and the dataset was already clean and well-structured.
Data Preprocessing cleaning
and transformation
Encoding categorical columns into
numerical values because ml
Checking for missing values
algorithms require numerical inputs
and I used OneHotEncoding
Exploratory Data Analysis (EDA)
The three graphs highlight important patterns and relationships within the mushroom dataset. The first bar chart
visualizes the distribution of the target variable (class), showing the number of edible (e) and poisonous (p)
mushrooms. This balanced distribution ensures that both classes are well-represented for classification tasks. The
second graph explores the distribution of cap-shape, revealing that certain shapes, such as x and f, are more
prevalent, while others are less common. The third graph, a stacked bar chart, illustrates the relationship between
odor and habitat, showing how certain odors are dominant in specific habitats. For instance, n is prominent in
wooded areas (d), while other odors vary across habitats. Together, these visualizations provide valuable insights
into the distribution and interaction of features, helping to identify key patterns for predicting mushroom edibility.
Machine learning models

I will begin with a Logistic Regression model as a baseline due to its simplicity and effectiveness in
binary classification tasks. It provides interpretable coefficients, helping to understand the
relationship between features like odor and gill size with the target variable. Next, I will use a
Random Forest Classifier, which is reliable and capable of handling non-linear relationships. It also
provides feature importance metrics, which are crucial for identifying the most influential features,
such as cap shape and bruises. Finally, I will implement XGBoost, a more advanced gradient boosting
algorithm that refines predictions by reducing residual errors and capturing complex interactions
among features. These three models will enable a comprehensive evaluation of the dataset, ensuring
robust and accurate classification of mushrooms.
Model evaluation :
For model evaluation, I will use accuracy to measure how well the models correctly classify mushrooms
as edible or poisonous. Additionally, I will consider precision, recall, and the F1-score to evaluate the
performance of each model in distinguishing between the two classes. I will compare these metrics for
Logistic Regression, Random Forest, and XGBoost on the test data to determine which model performs
best. (Hossain, M.S., Muhammad, G., and Kwon, J., 2020).

I will train and test the Random Forest and xg Boost models by splitting the dataset into
training and testing subsets, typically using an 80/20 split. The models will be trained on
the training data to learn patterns and relationships between features and the target
variable
Potential Challenges

My dataset is balanced with numerical values, so I do not expect major class imbalance issues. However, I
will monitor for skewed distributions in the target variable (Smoking Prevalence). If imbalance is
identified in any future classification tasks, I will address it using techniques like oversampling with
SMOTE or under sampling the majority class. Ganganwar, V. (2012)

I will handle the categorical variables by converting them into numerical representations
using one-hot encoding for all non-numeric features. This ensures that the machine
learning models can interpret the data accurately and effectively. (Ganganwar, V., 2012).
conclusion
In conclusion, by leveraging Logistic Regression, Random Forest, and XGBoost models,
I aim to accurately classify mushrooms as edible or poisonous while identifying the key
features driving these classifications. Logistic Regression provides a simple and
interpretable baseline, offering insights into the relationships between features and
the target variable. Random Forest serves as a robust model, highlighting feature
importance and handling non-linear relationships effectively. XGBoost builds on this by
refining predictions through its advanced capabilities in capturing complex interactions
among features. Looking ahead, I plan to improve the models by experimenting with
hyperparameter tuning and exploring additional feature engineering techniques.
Incorporating larger or more diverse datasets in the future could further enhance the
accuracy and reliability of the analysis.
References

Dua, D., & Graff, C. (2019). Mushroom Dataset. [online] UCI Machine Learning Repository. Available at: https://archive.ics.uci.edu/ml/datasets/mushroom (Accessed: 19 November
2024).

Singh, A., N., M., & Lakshmiganthan, R. (2017). Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms.
International Journal of Advanced Computer Science and Applications, 8(12). doi: https://doi.org/10.14569/ijacsa.2017.081201.

Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42–47.
Available at: https://www.researchgate.net/publication/292018027 (Accessed: 19 November 2024).

Hossain, M.S., Muhammad, G., & Kwon, J. (2020). A survey of big data architectures and machine learning algorithms for internet of things (IoT). Journal of Sensors, 15(8), 260.
Available at: https://www.mdpi.com/1999-5903/15/8/260 (Accessed: 17 November 2024).

You might also like