Exploratory Data Analysis and Case
Exploratory Data Analysis and Case
PREDICTION ON
WINE QUALITY DATASET
Submitted by
1.S.MIRUTHUVIKASINI 18BCS021
2.N.ABINAYASRI 18BCS033
3.R.PRAGATHI 18BCS038
4.N.JENFERO 18BCS043
5. S.RAGHAVI 18BCS056
of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
COIMBATORE-641 049
(An Autonomous Institution Affiliated to Anna University, Chennai)
December 2020
Verified by
(V. Sudha)
1
TABLE OF CONTENTS
2. LITERATURE REVIEW
2.1 LITERATURE REVIEW OF JOURNALS 6
3. PROBLEM DEFINITION
11
4. LOADING THE DATASET
4.1 BASIC DATA EXPLORATION 12
4.2 DATA CLEANING 17
17
4.2.1. CHECKING FOR NULL VALUE
18
4.2.2.CHECKING FOR OUTLIERS
4.2.3.TREATING THE OUTLIER 20
4.3 DATA VISUALIZATION 21
4.3.1.HISTOGRAM 22
4.3.2.STRIPPLOT 23
4.3.3.COUNTPLOT 24
25
4.4 NORMALIZATION
27
4.5 PREDICTION OF TARGET VARIABLE
5. CONCLUSION 29
6. REFERENCE LINK 29
2
ABSTRACT
1.INTRODUCTION
1.1 INTRODUCTION
The red wine industry shows a recent exponential growth as social drinking is
on the rise. Nowadays, industry players are using product quality
certifications to promote their products. This is a time-consuming process and
requires the assessment given by human experts, which makes this process
very expensive. Another vital factor in red wine certification and quality
assessment is physicochemical tests, which are laboratory-based and
consider factors like acidity, pH level, sugar, and other chemical properties.
Our analysis will use Red Wine Quality Data Set, available on Kaggle
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
The wine Samples was obtained from the north of Portugal to model red wine
quality based on physicochemical tests. The dataset contains a total of 12
variables, which were recorded for 1,599 observations.
3
Attribute Details:
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
8 - density
9 - pH
10 - sulphates
11 - alcohol
4
1.2 OBJECTIVES OF THE PROJECT
The main objective of this project is to study the wine quality prediction
dataset which is available in kaggle and to explore more on python libraries
which helps to do exploratory data analysis.
To Preprocess the data by removing the NULL values and treating the missing
values.
Cleaning the data by removing the NULL values and treating Outliers.
Split the data into Train and Test data and make prediction on the quality of
wine test by including alternative models on machine learning.
5
2. LITERATURE REVIEW
Abstract:
From the following research journal papers we have included the details
related to our dataset.
Research paper 1:
6
Selection of important features and predicting wine
quality using machine learning techniques
Proposed methodology:
o Linear regression
o Neural network
o Support vector machine
The interest has been increased in wine industry in recent years which
demands growth in this industry. Therefore, companies are investing in new
technologies to improve wine production and selling. In this direction, wine
quality certification plays a very important role for both processes and it
requires wine testing by human experts. This paper explores the usage of
machine learning techniques in two ways. Firstly, how linear regression
determines important features for prediction. Secondly, the usage of neural
network and support vector machine in predicting the values. The benchmark
Wine dataset is used for all experiments. This dataset has two parts: Red Wine
and White Wine data. Red wine contains 1599 samples and white wine
contains 4898 samples. Both red and white wine dataset consists of 12
physicochemical characteristics. One (quality) is dependent variable and
other 11 are predictors. The experiments shows that the value of dependent
variable can be predicted more accurately if only important features are
considered in prediction rather than considering all features. In future, large
dataset can be taken for experiments and other machine learning techniques
may be explored for wine quality prediction.
Reference link:
https://www.sciencedirect.com/science/article/pii/S1877050917328053
Research paper 2:
8
Assessing wine quality using a decision tree
Research paper 3:
Research paper 4:
3. Problem Definition:
• Head of the dataset- The head function will display the top records in
the data set. By default, python shows you only the top 5 records.
12
• Tail of the dataset- The tail function will display the last records in the
data set. By default, python shows you only the last 5 records.
13
• Info of the dataset- info() is used to check the Information about the
data and the data types of each respective attribute.
• Summary of the dataset- The describe method will help to see how
data has been spread for numerical values.
14
• Columns of the dataset- The column method will help to see the
names of the columns the dataset contains.
• Unique values of the dataset- The unique function will help to see the
unique values in the specific column of the dataset.
15
• The nunique function will help to see the no of unique values does the
dataset contains.
16
4.2.1CHECKING FOR NULL VALUES
USING BOXPLOT:
18
19
4.2.3 TREATING THE OUTLIERS
20
4.3.DATA VISUALIZATION
4.3.1 HISTOGRAM:
21
4.3.2 STRIPPLOT:
22
4.3.3 COUNTPLOT:
23
4.4. NORMALIZATION
26
27
CONCLUSION:
By looking into the details, we can see that good quality wines have higher
levels of alcohol on average, have a lower volatile acidity on average, higher
levels of sulphates on average, and higher levels of residual sugar on average.
Reference link:
● https://towardsdatascience.com/predicting-wine-quality-with-several-
classification-techniques-179038ea6434
● https://www.kaggle.com/scsaurabh/red-wine-quality-analysis-python
● https://www.kaggle.com/sgus1318/wine-quality-exploration-and-
analysis
28
● https://medium.com/analytics-vidhya/wine-quality-prediction-with-
python-695939d34d87
● https://medium.com/datadriveninvestor/regression-from-scratch-
wine-quality-prediction-d61195cb91c8
● https://github.com/vikrantkakad/Red-Wine-Quality-Analysis
● https://dzone.com/articles/predicting-wine-quality-with-several-
classificatio
● https://datauab.github.io/red_wine_quality/
29