Data Science Mid Syllabus
Data Science Mid Syllabus
ABID ISHAQ
Lecturer Computer Science
Islamia University Bahawalpur
Course Books
• Data
• Data types
• Data science
introduction to data science
Data science:
• Def:?????????????????
Types of data
• Alphabetical
• Categorical
• Images
• ?????
Types of Data
• What Is an attribute?
An attribute is a property or characteristic of an object that may
vary; either from one object to another or from one time to
another. For example, eye colour varies from person to person,
while the temperature of an object varies over time. Note that
eye colour is a symbolic attribute with a small number of
possible values {brown, black,blue,green,hazel,etc.}, while
temperature is a numerical attribute with a potentially unlimited
number of values.
Measurement
Understanding data
Data preparation
Data mining tasks
Interpreting data mining results
Data Sets
1http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg#mediaviewer/File:Iris_v
ersicolor_3.jpg
Descriptive Statistics - Univariate
Descriptive Statistics - Multivariate
Central datapoint
Correlation
Descriptive Statistics - Multivariate
Data Visualization
Histogram
Data Visualization
Quantile plot
Data Visualization
Distribution plot
Data Visualization
Scatter plot
Data Visualization
Scatter mutiple
Data Visualization
Bubble plot
Data Visualization
Density chart
Data Visualization
Parallel chart
Data Visualization
Deviation chart
Data Visualization
Andrews curves
Data Visualization
Parallel chart
Roadmap for data exploration
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and practice with rapidminer. Morgan Kaufmann.
UMER et al. 9
F I G U R E 1 Distribution of tweets data in Dataset 1 and Dataset 2 [Color figure can be viewed at
wileyonlinelibrary.com]
Visualization plays an essentially important role in understanding the dataset. It helps to under-
stand important patterns in the dataset before the application of a classification model. Dataset 1
contains Tweets about six airline companies—United Airlines, US Airways, Delta Airlines, Amer-
ican Airlines, Southwest Airlines, and Virgin America Airlines. The number of Tweets for an
individual airline is different, and the divisions are shown in Figure 1. The highest number of
Tweets in the dataset belongs to United Airlines and makes up approximately 26% of the dataset.
The negative reason attribute of Dataset 1 has 10 reasons in total for each airline. The nega-
tive reason count is very different for each reason for a particular airline. Figure 2 shows that
the highest count is for customer service issues, which is what the majority of the customers
complain about.
Similarly, Dataset 2 contains Tweets about 20 garment classes. Dresses, Pants, Blouses, Knits,
and Sweaters make up the majority of the dataset. It contains ratings, ranging from 1 to 5, assigned
10 UMER et al.
F I G U R E 3 Ratings assigned by
consumers [Color figure can be viewed
at wileyonlinelibrary.com]
by the consumer, each with a different count as displayed in Figure 3. Dataset 3 comprises Tweets
that contain hostile and sympathetic reviews, and the task is to categorize them as hatred or
nonhatred.
FIGURE 4 Steps carried out in data pre-processing [Color figure can be viewed at wileyonlinelibrary.com]
T A B L E 4 Preprocessing of tweets
Before Preprocessing After Preprocessing
@VirginAmerica plus you’ve added commercials to the plus added commercials experience tacky
experience … tacky.
@VirginAmerica I didn’t today … Must mean I need to take today must mean need take another trip
another trip!
@VirginAmerica it’s really aggressive to blast obnoxious really aggressive blast obnoxious
“entertainment” in your guests’ faces they have little entertainment guests faces amp little
recourse recourse.
with the same meaning. Removing the suffixes helps reduce feature complexity and improves the
learning capability of classifiers.
Removing @ and bad symbols: After stemming, words starting with @ are removed because
Twitter assigns a unique name for each subscriber, which starts with @. After that, special sym-
bols are removed. This study found that a few symbols remain in the Tweets even after the special
symbol–removal phase is complete. So the bad symbol step follows to remove symbols (eg, a
heart).
As the next step, numeric values are removed from the Tweets because they do not possess
any value for text analysis, and removing them decreases the complexity of the models’ training.
Table 4 shows a few Tweets before and after pre-processing has been performed.
Machine learning has played a vital role in enhancing the accuracy and efficacy of sentiment
classification of Twitter data. There exist rich variants of machine learning classifiers for
4/28/2019 Simple guide to confusion matrix terminology
Simple guide to
confusion matrix
Launch a data science career!
terminology
Join Data School Insiders Let's start with an example confusion matrix for
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 2/14
4/28/2019 Simple guide to confusion matrix terminology
About it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes
condition actually occur in our sample?
actual yes/total = 105/165 = 0.64
Email address:
the null error rate. (More details about
Cohen's Kappa.)
F Score: This is a weighted average of the
Join the Newsletter true positive rate (recall) and precision. (More
details about the F Score.)
New? Start here! ROC Curve: This is a commonly used graph
that summarizes the performance of a
Machine Learning course
classi er over all possible thresholds. It is
Join my 80,000+ YouTube generated by plotting the True Positive Rate
subscribers (y-axis) against the False Positive Rate (x-axis)
as you vary the threshold for assigning
Join Data School Insiders
observations to a given class. (More details
Private forum for Insiders about ROC Curves.)
EMAIL FACEBOOK
TWITTERLINKEDIN
TUMBLRREDDIT GOOGLE+
POCKET
Email address:
New? Start here! All comments are moderated, and will usually be
approved by Kevin within a few hours. Thanks for
Machine Learning course your patience!
LOG IN WITH
Name
Feature engineering is a process for finding meaningful features from data for
the efficient training of machine learning algorithms or, in other words, the
creation of features derived from original features. The study concludes that
feature engineering can boost the Performance of machine learning algorithms.
‘‘Garbage in garbage out’’ is a common saying in machine learning. According
to this idea, senseless data produces senseless output. On the other hand, data
that are more informational can produce desirable results. Therefore, feature
engineering can extract meaningful features from raw data, which helps to
increase the consistency and accuracy of learning algorithms. In this study, we
used three feature engineering methods: BoW, TF-IDF, and Chi2.
BAG-OF-WORDS
BoW is a method of extracting features from text data, and it is very easy to
understand and implement. BoW is very useful in problems such as language
modeling and text classification. In this method, we use CountVectorizer to
extract features. CountVectorizer works on term frequency, i.e., counting the
occurrences of tokens and building a sparse matrix of tokens. BoW is a
collection of words and features, where each feature is assigned a value that
represents the occurrences of that feature.
TF-IDF
TF-IDF is a feature extraction method used to extract features from data. TF-
IDF is most widely used in text analysis and music information retrieval. TF-
IDF assigns a weight to each term in a document based on its term frequency
(TF) and inverse document frequency (IDF). The terms with higher weight
scores are considered to be more important. TF-IDF computes weight of each
term by using formula as mention in equation 1:
CHI2
Chi2 is the most common feature selection method, and it is mostly used on text
data [21]. In feature selection, we use it to check whether the occurrence of a
specific term and the occurrence of a specific class are independent. More
formally, forgiven a document D, we estimate the following quantity
for each term and rank them by their score. Chi2 finds this score using equation
2:
where
• N is the observed frequency and E the expected frequency
For each feature (term), a corresponding high Chi2 score indicates that the null
hypothesis H0 of independence (meaning the document class has no influence
over the term’s frequency) should be rejected, and the occurrence of the term
and class are dependent. In this case, we should select the feature for the text
classification.
Decision Tree Algorithm
By
Muhammad Rizwan
KFUEIT