0% found this document useful (0 votes)
29 views29 pages

Exploratory Data Analysis and Case

Uploaded by

shadowalker2276
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views29 pages

Exploratory Data Analysis and Case

Uploaded by

shadowalker2276
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

EXPLORATORY DATA ANALYSIS AND CASE

PREDICTION ON
WINE QUALITY DATASET

ENGINEERING CLINIC PROJECT REPORT

Submitted by

1.S.MIRUTHUVIKASINI 18BCS021
2.N.ABINAYASRI 18BCS033
3.R.PRAGATHI 18BCS038
4.N.JENFERO 18BCS043
5. S.RAGHAVI 18BCS056

In partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

KUMARAGURU COLLEGE OF TECHNOLOGY

COIMBATORE-641 049
(An Autonomous Institution Affiliated to Anna University, Chennai)

December 2020

Verified by

(V. Sudha)

1
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO NO
ABSTRACT 3
1. INTRODUCTION 3
1.1 CONCEPTUAL STUDY OF THE PROJECT 3
1.2 OBJECTIVE OF THE PROJECT 5
1.3 SCOPE OF THE PROJECT 5

2. LITERATURE REVIEW
2.1 LITERATURE REVIEW OF JOURNALS 6
3. PROBLEM DEFINITION
11
4. LOADING THE DATASET
4.1 BASIC DATA EXPLORATION 12
4.2 DATA CLEANING 17
17
4.2.1. CHECKING FOR NULL VALUE
18
4.2.2.CHECKING FOR OUTLIERS
4.2.3.TREATING THE OUTLIER 20
4.3 DATA VISUALIZATION 21
4.3.1.HISTOGRAM 22
4.3.2.STRIPPLOT 23
4.3.3.COUNTPLOT 24
25
4.4 NORMALIZATION
27
4.5 PREDICTION OF TARGET VARIABLE

5. CONCLUSION 29
6. REFERENCE LINK 29

2
ABSTRACT

Machine learning has made dramatic improvements in


the past few years, Here we applied various machine learning techniques to
predict the Quality of wine based on various physicochemical data. In our
Project Wine Quality-red dataset has been used to analyze and infer various
information about the data.We use various Python libraries such as pandas,
numpy, matplotlib, seaborn and scikit-learn to validate our dataset.Data
Visualization had played a vital role to make data more natural for the human
mind to comprehend and therefore makes it easier to identify trends,
patterns, and outliers within large data sets.We also use that technique to get
clear inference and we use it to remove all our outliers. We had trained our
data by using a machine learning algorithm called DecisonTreeClassifier so
that it would predict the quality of wine based on the available
physicochemical data.In this quality prediction testing is done on the 20
percent of the data and the training is done on the 80 percent of the data.The
results we have obtained is about the accuracy

1.INTRODUCTION
1.1 INTRODUCTION

The red wine industry shows a recent exponential growth as social drinking is
on the rise. Nowadays, industry players are using product quality
certifications to promote their products. This is a time-consuming process and
requires the assessment given by human experts, which makes this process
very expensive. Another vital factor in red wine certification and quality
assessment is physicochemical tests, which are laboratory-based and
consider factors like acidity, pH level, sugar, and other chemical properties.

Our analysis will use Red Wine Quality Data Set, available on Kaggle
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

The wine Samples was obtained from the north of Portugal to model red wine
quality based on physicochemical tests. The dataset contains a total of 12
variables, which were recorded for 1,599 observations.
3
Attribute Details:

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

These days with the advent of machine learning techniques it is possible to


classify the wines as well as it is possible to figure out the importance of each
chemical analyses parameters in the wine and which one to ignore for
reduction of cost. The performance comparison with different feature sets
will also help to classify it in a more distinctive way. In this paper machine
learning approach is proposed to train the dataset and make a test to predict
the Quality of wine given the physicochemical data.

4
1.2 OBJECTIVES OF THE PROJECT

The main objective of this project is to study the wine quality prediction
dataset which is available in kaggle and to explore more on python libraries
which helps to do exploratory data analysis.

To analyse the data by using various Data Visualization Techniques.

To Preprocess the data by removing the NULL values and treating the missing
values.

To identify and remove the Outliers by using various Python techniques.

Finally Prediction is made on the quality of wine by incorporating various


machine learning models.

1.3 SCOPE OF THE PROJECT

Learning the attributes of a dataset and understanding the relationship


between them.

Cleaning the data by removing the NULL values and treating Outliers.

Visualizing data to get it more precise, by exploring various python libraries


such as Numpy, Pandas and seaborn.

To understand the various Machine Learning algorithms and use them


accordingly.

Split the data into Train and Test data and make prediction on the quality of
wine test by including alternative models on machine learning.

5
2. LITERATURE REVIEW

2.1 LITERATURE REVIEW OF JOURNALS

Title of the paper: Wine Quality

Research Focus: Exploratory Data Analysis (EDA) in Python for


the analysis of Wine Quality dataset.

Student level: Undergraduate

Abstract:

Wine classification is a difficult task since taste is the least understood of


the human senses. A good wine quality prediction can be very useful in the
certification phase, since currently the sensory analysis is performed by
human tasters, being clearly a subjective approach. An automatic predictive
system can be integrated into a decision support system, helping the speed
and quality of the performance. Furthermore, a feature selection process can
help to analyze the impact of the analytical tests. If it is concluded that several
input variables are highly relevant to predict the wine quality, since in the
production process some variables can be controlled, this information can be
used to improve the wine quality.

From the following research journal papers we have included the details
related to our dataset.

Research paper 1:
6
Selection of important features and predicting wine
quality using machine learning techniques

Y Gupta - Procedia Computer Science, 2018 – Elsevier

Nowadays, industries are using product quality certifications to promote


their products. This is a time taking process and requires the assessment given
by human experts which makes this process very expensive. This paper
explores the usage of machine learning techniques such as linear regression,
neural network and support vector machine for product quality in two ways.
Firstly, determine the dependency of target variable on independent
variables and secondly, predicting the value of target variable. In this paper,
linear regression is used to determine the dependency of target variable on
independent variables. On the basis of computed dependency, important
variables are selected those make significant impact on dependent variable.
Further, neural network and support vector machine are used to predict the
values of dependent variable. All the experiments are performed on Red Wine
and White Wine datasets. This paper proves that the better prediction can be
made if selected features (variables) are being considered rather than
considering all the features.

Proposed methodology:

o Linear regression
o Neural network
o Support vector machine

Experimental results and analysis :


7
o Determining important features for prediction
o Predicting value of dependent variable (Quality)

Conclusion and future directions:

The interest has been increased in wine industry in recent years which
demands growth in this industry. Therefore, companies are investing in new
technologies to improve wine production and selling. In this direction, wine
quality certification plays a very important role for both processes and it
requires wine testing by human experts. This paper explores the usage of
machine learning techniques in two ways. Firstly, how linear regression
determines important features for prediction. Secondly, the usage of neural
network and support vector machine in predicting the values. The benchmark
Wine dataset is used for all experiments. This dataset has two parts: Red Wine
and White Wine data. Red wine contains 1599 samples and white wine
contains 4898 samples. Both red and white wine dataset consists of 12
physicochemical characteristics. One (quality) is dependent variable and
other 11 are predictors. The experiments shows that the value of dependent
variable can be predicted more accurately if only important features are
considered in prediction rather than considering all features. In future, large
dataset can be taken for experiments and other machine learning techniques
may be explored for wine quality prediction.

Reference link:
https://www.sciencedirect.com/science/article/pii/S1877050917328053

Research paper 2:
8
Assessing wine quality using a decision tree

S Lee, J Park, K Kang - 2015 IEEE International Symposium on …,


2015 - ieeexplore.ieee.org

Even though wine-drinkers generally agree that wines may be ranked by


quality, wine-tasting is famously subjective. There have been many attempts
to construct a more methodical approach to the assessment of wines. We
propose a method of assessing wine quality using a decision tree, and test it
against the wine-quality dataset from the UC Irvine Machine Learning
Repository. Results are 60% in agreement with traditional assessment
techniques.

Reference link: https://ieeexplore.ieee.org/abstract/document/7302752

Research paper 3:

Prediction of Quality for Different Type of Wine based on


Different Feature Sets Using Supervised Machine Learning
Techniques

S Aich, AA Al-Absi, KL Hui… - 2019 21st International …, 2019 -


ieeexplore.ieee.org

In recent years, most of the industries promoting their products based on


the quality certification they received on the products. The traditional way of
assessing the product quality is time consuming, however with the invent of
machine learning techniques the processes has become more efficient and
9
consumed less time than before. In this paper we have explored, some of the
machine learning techniques to assess the quality of wine based on the
attributes of wine that depends on quality. We have used white wine and red
wine quality dataset for this research work. We have used different feature
selection technique such as genetic algorithm (GA) based feature selection
and simulated annealing (SA) based feature selection to check the prediction
performance. We have used different performance measure such as
accuracy, sensitivity, specificity, positive predictive value, negative predictive
value for comparison using different feature sets and different supervised
machine learning techniques. We have used nonlinear, linear and
probabilistic classifiers. We have found that feature selection-based feature
sets able to provide better prediction than considering all the features for
performance prediction.We have found accuracy ranging from 95.23% to
98.81% with different feature sets. This analysis will help the industries to
access the quality of the products at less time and more efficient way.

Reference link: https://ieeexplore.ieee.org/abstract/document/8702017

Research paper 4:

The classification of wine according to their physicochemical


qualities
Y Er, A Atasoy - International Journal of Intelligent Systems and …, 2016 -
ijisae.org

The main purpose of this study is to predict wine quality based on


physicochemical data. In this study, two large separate data sets which were
10
taken from UC Irvine Machine Learning Repository were used. These data sets
contain 1599 instances for red wine and 4898 instances for white wine with
11 features of physicochemical data such as alcohol, chlorides, density, total
sulfur dioxide, free sulfur dioxide, residual sugar, and pH. First, the instances
were successfully classified as red wine and white wine with the accuracy of
99.5229% by using Random Forests Algorithm. Then, the following three
different data mining algorithms were used to classify the quality of both red
wine and white wine: k-nearest-neighbourhood, random forests and support
vector machines. There are 6 quality classes of red wine and 7 quality classes
of white wine. The most successful classification was obtained by using
Random Forests Algorithm. In this study, it is also observed that the use of
principal component analysis in the feature selection increases the success
rate of classification in Random Forests Algorithm.

Reference link: https://www.ijisae.org/IJISAE/article/view/914

3. Problem Definition:

This data will allow us to create different regression models to determine


how different independent variables help predict our dependent variable,
quality. Knowing how each variable will impact the wine quality will help
producers, distributors, and businesses in the wine industry better assess
their production, distribution, and pricing strategy.

4.LOADING THE DATASET

We use pandas to analyse the dataset.


import pandas as pd
We use numpy for scientific computing.
11
import numpy as np
We use matplotlib for visualization of data generally consists of bars, pies,
lines, scatter plots and so on.
import matplotlib.pyplot as plt
We use seaborn for data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.
import seaborn as sns

4.1 BASIC DATA EXPLORATION

• Head of the dataset- The head function will display the top records in
the data set. By default, python shows you only the top 5 records.

12
• Tail of the dataset- The tail function will display the last records in the
data set. By default, python shows you only the last 5 records.

• Shape of the dataset- To check the dimension of data.

13
• Info of the dataset- info() is used to check the Information about the
data and the data types of each respective attribute.

• Summary of the dataset- The describe method will help to see how
data has been spread for numerical values.

14
• Columns of the dataset- The column method will help to see the
names of the columns the dataset contains.

• Unique values of the dataset- The unique function will help to see the
unique values in the specific column of the dataset.

15
• The nunique function will help to see the no of unique values does the
dataset contains.

4.2. DATA CLEANING –

16
4.2.1CHECKING FOR NULL VALUES

TREATING THE NULL VALUES:


17
Here I treated the null values with spaces . (i.e)I replaced the
null values with empty space.

4.2.2. CHECKING FOR OUTLIERS

USING BOXPLOT:

18
19
4.2.3 TREATING THE OUTLIERS

20
4.3.DATA VISUALIZATION

Data visualization is the graphical representation of information and


data. With visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data.

4.3.1 HISTOGRAM:

21
4.3.2 STRIPPLOT:

22
4.3.3 COUNTPLOT:

23
4.4. NORMALIZATION

Normalization is a scaling technique in which values are shifted and


rescaled so that they end up ranging between 0 and 1. It is also known as
Min-Max scaling.
24
25
4.5 PREDICTION OF TARGET VARIABLE 1

26
27
CONCLUSION:

By looking into the details, we can see that good quality wines have higher
levels of alcohol on average, have a lower volatile acidity on average, higher
levels of sulphates on average, and higher levels of residual sugar on average.

Reference link:

● https://towardsdatascience.com/predicting-wine-quality-with-several-
classification-techniques-179038ea6434
● https://www.kaggle.com/scsaurabh/red-wine-quality-analysis-python
● https://www.kaggle.com/sgus1318/wine-quality-exploration-and-
analysis

28
● https://medium.com/analytics-vidhya/wine-quality-prediction-with-
python-695939d34d87
● https://medium.com/datadriveninvestor/regression-from-scratch-
wine-quality-prediction-d61195cb91c8
● https://github.com/vikrantkakad/Red-Wine-Quality-Analysis
● https://dzone.com/articles/predicting-wine-quality-with-several-
classificatio
● https://datauab.github.io/red_wine_quality/

29

You might also like