0% found this document useful (0 votes)
36 views

Data Science Mid Syllabus

This document discusses data science concepts including data types, the data science process, and data exploration techniques. It provides examples of different data types like images and categorical data. The standard data science process involves understanding the problem, preparing and exploring the data, developing a model, applying the model to test its effectiveness, and deploying the model. Data exploration techniques discussed include descriptive statistics, data visualization through various plots, and understanding relationships between attributes. The document provides examples of exploring tweet and product review datasets.

Uploaded by

Muhammad Akhtar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Data Science Mid Syllabus

This document discusses data science concepts including data types, the data science process, and data exploration techniques. It provides examples of different data types like images and categorical data. The standard data science process involves understanding the problem, preparing and exploring the data, developing a model, applying the model to test its effectiveness, and deploying the model. Data exploration techniques discussed include descriptive statistics, data visualization through various plots, and understanding relationships between attributes. The document provides examples of exploring tweet and product review datasets.

Uploaded by

Muhammad Akhtar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Data Science

ABID ISHAQ
Lecturer Computer Science
Islamia University Bahawalpur
Course Books

• Data Science, Concepts and Practice, Second


Edition, Vijay Kotu, Bala Deshpande

• Introduction To Data Mining, Pang.Ning Tan,


Michael steinbach, Vi Pin Kumar
introduction to data science

• Data
• Data types
• Data science
introduction to data science
Data science:

• Data science is a collection of techniques used


to extract value from data. It has become an
essential tool for any organization that collects,
stores, and processes data as part of its
operations. Data science techniques rely on
finding useful patterns, connections, and
relationships within data.
Data science:

• Data science starts with data, which can range


from a simple array of a few numeric
observations to a complex matrix of millions of
observations with thousands of variables. Data
science utilizes certain specialized
computational methods in order to discover
meaningful and useful structures within a
dataset.
Presence of data science

The discipline of data science coexists and is


closely associated with a number of related
areas such as:
• database systems,
• data engineering,
• visualization,
• data analysis,
• experimentation, and
• business intelligence (BI).
AI, MACHINE LEARNING, AND DATA SCIENCE

• Artificial intelligence, Machine learning, and data


science are all related to each other. Unsurprisingly,
they are often used interchangeably and conflated
with each other in popular media and business
communication. However, all of these three fields
are distinct depending on the context
Traditional program and machine learning
Data science models
CASE FOR DATA SCIENCE

A set of frameworks, tools, and techniques are


needed to intelligently assist humans to process all
these data and extract valuable information. Data
science is one such paradigm that can handle large
volumes with multiple attributes and deploy complex
algorithms to search for patterns from data.
Volume: The sheer volume of data captured by
organizations is exponentially increasing. The rapid
decline in storage costs and advancements in
capturing every transaction and event, combined
with the business need to extract as much leverage
as possible using data, creates a strong motivation to
store more data than ever.
Dimensions: Every single record or data point
contains multiple attributes or variables to provide
context for the record. For example, every user
record of an ecommerce site can contain attributes
such as products viewed, products purchased, user
demographics, frequency of purchase, clickstream,
etc.
Complex Questions: As more complex data are available for analysis,
the complexity of information that needs to get extracted from data is
increasing as well. If the natural clusters in a dataset, with hundreds of
dimensions, need to be found, then traditional analysis like hypothesis
testing techniques cannot be used in a scalable fashion.
Types of Data Science
• Today book: Data Science, Concepts and Practice,
Second Edition
by Vijay Kotu, Bala Deshpande
Data Science
ABID ISHAQ
Lecturer Computer Science
Islamia University Bahawalpur
Course Books

• Data Science, Concepts and Practice, Second


Edition, Vijay Kotu, Bala Deshpande

• Introduction To Data Mining, Pang.Ning Tan,


Michaelsteinbach, Vi Pin Kumar
Data science process
Data science process

The standard data science process involves


• Understanding the problem,
• Preparing the data samples,
• Developing the model,
• Applying the model on a dataset to see how
the model may work in the real world,
• Deploying and maintaining the models.
Data

• Def:?????????????????
Types of data

• Alphabetical
• Categorical
• Images
• ?????
Types of Data

• A data set can often be viewed as a collection of


data objects. Other names for a data object are
records, point, vector, pattern, eluent, case, sample,
observation, or entity. In turn, data objects are
described by a number of attributes that capture
the basic characteristics of an object, such as the
mass of a physical object or the time at which an
event occurred. Other names for an attribute are
variable, characteristic, field, feature, or dimension
Attributes and Measurement

• What Is an attribute?
An attribute is a property or characteristic of an object that may
vary; either from one object to another or from one time to
another. For example, eye colour varies from person to person,
while the temperature of an object varies over time. Note that
eye colour is a symbolic attribute with a small number of
possible values {brown, black,blue,green,hazel,etc.}, while
temperature is a numerical attribute with a potentially unlimited
number of values.
Measurement

• A measurement scale is a rule (function) that associates a


numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a
measurement scale to associate a value with a particular
attribute of a specific object. While this may seem a bit
abstract, we engage in the process of measurement all the
time. For instance, we step on a bathroom scale to determine
our weight, we classify someone as male or female, or we
count the number of
chairs in a room to seeif there will be enough to seat all the
people coming to a meeting.
The Different Types of Attributes

• The following properties


(operations) of numbers are typically used to
describe attributes.
1. Distinctness : and *
2. Order <) <, >, and )
3. Addition * and -
4. Multiplication x and /
Different types of attributes
Today Book

• Introduction To Data Mining,by PANG.Ning Tan,michaelsteinbach, Vi


Pin Kumar
3. Data Exploration
Objectives of Data Exploration

Understanding data
Data preparation
Data mining tasks
Interpreting data mining results
Data Sets

1http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg#mediaviewer/File:Iris_v
ersicolor_3.jpg
Descriptive Statistics - Univariate
Descriptive Statistics - Multivariate

Central datapoint
Correlation
Descriptive Statistics - Multivariate
Data Visualization

Histogram
Data Visualization

Class stratified Histogram


Data Visualization

Quantile plot
Data Visualization

Distribution plot
Data Visualization

Scatter plot
Data Visualization

Scatter mutiple
Data Visualization

Multiple Scatter matrix


Data Visualization

Bubble plot
Data Visualization

Density chart
Data Visualization

Parallel chart
Data Visualization

Deviation chart
Data Visualization

Andrews curves
Data Visualization

Parallel chart
Roadmap for data exploration

1. Organize the data set


2. Find the central point for each attribute:
3. Understand the spread of the attributes:
4. Visualize the distribution of each attributes:
5. Pivot the data:
6. Watch out for outliers:
7. Understanding the relationship between attributes:
8. Visualize the relationship between attributes:
9. Visualization high dimensional data sets:

Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and practice with rapidminer. Morgan Kaufmann.
UMER et al. 9

F I G U R E 1 Distribution of tweets data in Dataset 1 and Dataset 2 [Color figure can be viewed at
wileyonlinelibrary.com]

FIGURE 2 Negative reason count in


tweets [Color figure can be viewed at
wileyonlinelibrary.com]

3.2 Data visualization

Visualization plays an essentially important role in understanding the dataset. It helps to under-
stand important patterns in the dataset before the application of a classification model. Dataset 1
contains Tweets about six airline companies—United Airlines, US Airways, Delta Airlines, Amer-
ican Airlines, Southwest Airlines, and Virgin America Airlines. The number of Tweets for an
individual airline is different, and the divisions are shown in Figure 1. The highest number of
Tweets in the dataset belongs to United Airlines and makes up approximately 26% of the dataset.
The negative reason attribute of Dataset 1 has 10 reasons in total for each airline. The nega-
tive reason count is very different for each reason for a particular airline. Figure 2 shows that
the highest count is for customer service issues, which is what the majority of the customers
complain about.
Similarly, Dataset 2 contains Tweets about 20 garment classes. Dresses, Pants, Blouses, Knits,
and Sweaters make up the majority of the dataset. It contains ratings, ranging from 1 to 5, assigned
10 UMER et al.

F I G U R E 3 Ratings assigned by
consumers [Color figure can be viewed
at wileyonlinelibrary.com]

by the consumer, each with a different count as displayed in Figure 3. Dataset 3 comprises Tweets
that contain hostile and sympathetic reviews, and the task is to categorize them as hatred or
nonhatred.

3.3 Data pre-processing

The datasets are semi-structured/unstructured containing a large amount of unnecessary


data, which plays no significant role in the prediction. Moreover, a large dataset requires a
longer training time, and stop words reduce the accuracy of the prediction. Therefore, text
pre-processing is required to both save computational resources and increase prediction accu-
racy. Text pre-processing35 plays a vital role in more accurate prediction and elevates the model’s
performance. The following steps are carried out in the pre-processing phase, as shown in
Figure 4.
Tokenization: This involves the splitting of continuous text into words, symbols, and ele-
ments (called tokens).36 It has a significant impact on the performance of the subsequent analysis,
so it should be correct and efficient.37
Stop word removal: In the next step, stop words are removed from the Tweets. Although
stop words make sentences more readable, they do not add value to text analysis. The removal of
stop words increases the efficiency of the classification algorithm.38
Short word removal: Short words with a character length of less than three are removed
from the Tweets. Research39 finds out that SVC is not robust with short words and its accuracy is
affected if tweets contain short words. Hence, short words are discarded to increase the robustness
and efficacy of classifiers.
Case conversion: After short words are removed, the text in the Tweets is converted to lower
case. This is an important step because the analysis is case sensitive. The probabilistic models, for
example, consider “Bad,” and “bad” as different words, and they count the occurrence of each
word separately.38 If the words are not converted to lowercase, it could impair the efficiency of
the classifier.
Stemming: Stemming is the process of removing the affixes from words and restoring the
words to their root forms.40 For example, enjoys, enjoying, and enjoyed are variations of “enjoy”
UMER et al. 11

FIGURE 4 Steps carried out in data pre-processing [Color figure can be viewed at wileyonlinelibrary.com]

T A B L E 4 Preprocessing of tweets
Before Preprocessing After Preprocessing

@VirginAmerica plus you’ve added commercials to the plus added commercials experience tacky
experience … tacky.
@VirginAmerica I didn’t today … Must mean I need to take today must mean need take another trip
another trip!
@VirginAmerica it’s really aggressive to blast obnoxious really aggressive blast obnoxious
“entertainment” in your guests’ faces they have little entertainment guests faces amp little
recourse recourse.

with the same meaning. Removing the suffixes helps reduce feature complexity and improves the
learning capability of classifiers.
Removing @ and bad symbols: After stemming, words starting with @ are removed because
Twitter assigns a unique name for each subscriber, which starts with @. After that, special sym-
bols are removed. This study found that a few symbols remain in the Tweets even after the special
symbol–removal phase is complete. So the bad symbol step follows to remove symbols (eg, a
heart).
As the next step, numeric values are removed from the Tweets because they do not possess
any value for text analysis, and removing them decreases the complexity of the models’ training.
Table 4 shows a few Tweets before and after pre-processing has been performed.

3.4 Machine learning models for sentiment classification

Machine learning has played a vital role in enhancing the accuracy and efficacy of sentiment
classification of Twitter data. There exist rich variants of machine learning classifiers for
4/28/2019 Simple guide to confusion matrix terminology

March 25, 2014 · MACHINE LEARNING

Simple guide to
confusion matrix
Launch a data science career!
terminology

 

A confusion matrix is a table that is often used to


Name:
describe the performance of a classi cation
model (or "classi er") on a set of test data for which
Email address:
the true values are known. The confusion matrix
itself is relatively simple to understand, but the
Join the Newsletter related terminology can be confusing.

I wanted to create a "quick reference guide" for


New? Start here!
confusion matrix terminology because I couldn't
Machine Learning course nd an existing resource that suited my
requirements: compact in presentation, using
Join my 80,000+ YouTube
numbers instead of arbitrary variables, and
subscribers
explained both in terms of formulas and sentences.

Join Data School Insiders Let's start with an example confusion matrix for

Private forum for Insiders a binary classi er (though it can easily be


extended to the case of more than two classes):
About

What can we learn from this matrix?

There are two possible predicted classes:


"yes" and "no". If we were predicting the
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 1/14
4/28/2019 Simple guide to confusion matrix terminology

presence of a disease, for example, "yes"


would mean they have the disease, and "no"
would mean they don't have the disease.
The classi er made a total of 165 predictions
(e.g., 165 patients were being tested for the
presence of that disease).
Out of those 165 cases, the classi er
Launch a data science career!
predicted "yes" 110 times, and "no" 55 times.
 
In reality, 105 patients in the sample have the
Name: disease, and 60 patients do not.

Let's now de ne the most basic terms, which are


Email address: whole numbers (not rates):

true positives (TP): These are cases in which


Join the Newsletter we predicted yes (they have the disease), and
they do have the disease.
New? Start here!
true negatives (TN): We predicted no, and

Machine Learning course they don't have the disease.


false positives (FP): We predicted yes, but
Join my 80,000+ YouTube they don't actually have the disease. (Also
subscribers known as a "Type I error.")
false negatives (FN): We predicted no, but
Join Data School Insiders
they actually do have the disease. (Also known
Private forum for Insiders as a "Type II error.")

About I've added these terms to the confusion matrix, and


also added the row and column totals:

https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 2/14
4/28/2019 Simple guide to confusion matrix terminology

This is a list of rates that are often computed from a


confusion matrix for a binary classi er:

Accuracy: Overall, how often is the classi er


correct?
(TP+TN)/total = (100+50)/165 = 0.91
Misclassi cation Rate: Overall, how often is
Launch a data science career! it wrong?

  (FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
Name:
also known as "Error Rate"
True Positive Rate: When it's actually yes,
Email address: how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
Join the Newsletter
False Positive Rate: When it's actually no,
New? Start here! how often does it predict yes?
FP/actual no = 10/60 = 0.17
Machine Learning course True Negative Rate: When it's actually no,
how often does it predict no?
Join my 80,000+ YouTube
TN/actual no = 50/60 = 0.83
subscribers
equivalent to 1 minus False Positive
Join Data School Insiders Rate
also known as "Speci city"
Private forum for Insiders
Precision: When it predicts yes, how often is

About it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes
condition actually occur in our sample?
actual yes/total = 105/165 = 0.64

A couple other terms are also worth mentioning:

Null Error Rate: This is how often you would


be wrong if you always predicted the majority
class. (In our example, the null error rate
would be 60/165=0.36 because if you always
predicted yes, you would only be wrong for
the 60 "no" cases.) This can be a useful
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 3/14
4/28/2019 Simple guide to confusion matrix terminology

baseline metric to compare your classi er


against. However, the best classi er for a
particular application will sometimes have a
higher error rate than the null error rate, as
demonstrated by the Accuracy Paradox.
Cohen's Kappa: This is essentially a measure
of how well the classi er performed as
Launch a data science career!
compared to how well it would have
 
performed simply by chance. In other words,
Name: a model will have a high Kappa score if there
is a big di erence between the accuracy and

Email address:
the null error rate. (More details about
Cohen's Kappa.)
F Score: This is a weighted average of the
Join the Newsletter true positive rate (recall) and precision. (More
details about the F Score.)
New? Start here! ROC Curve: This is a commonly used graph
that summarizes the performance of a
Machine Learning course
classi er over all possible thresholds. It is
Join my 80,000+ YouTube generated by plotting the True Positive Rate
subscribers (y-axis) against the False Positive Rate (x-axis)
as you vary the threshold for assigning
Join Data School Insiders
observations to a given class. (More details
Private forum for Insiders about ROC Curves.)

About And nally, for those of you from the world of


Bayesian statistics, here's a quick summary of these
terms from Applied Predictive Modeling:

In relation to Bayesian statistics, the


sensitivity and speci city are the
conditional probabilities, the prevalence
is the prior, and the positive/negative
predicted values are the posterior
probabilities.

Want to learn more?


https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 4/14
4/28/2019 Simple guide to confusion matrix terminology

In my new 35-minute video, Making sense of the


confusion matrix, I explain these concepts in more
depth and cover more advanced topics:

How to calculate precision and recall for


multi-class problems
How to analyze a 10-class confusion matrix
Launch a data science career! How to choose the right evaluation metric for

  your problem
Why accuracy is often a misleading metric
Name:

EMAIL FACEBOOK
TWITTERLINKEDIN
TUMBLRREDDIT GOOGLE+
POCKET
Email address:

Join the Newsletter


Data School Comment Policy


New? Start here! All comments are moderated, and will usually be
approved by Kevin within a few hours. Thanks for
Machine Learning course your patience!

Join my 80,000+ YouTube


subscribers Comments Community 
1 Login

Join Data School Insiders t Tweet f Share


 Recommend 46

Private forum for Insiders Sort by Best

About Join the discussion…

LOG IN WITH

OR SIGN UP WITH DISQUS ?

Name

Engr Ali Raza • 2 years ago


Dear Kevin,
can you please tell me the relationship between
Misclassifcations and split value or split value
index??also explain that how we can calculate
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
th lit l i i l d i i l ith ? 5/14
C. FEATURE ENGINEERING METHODS

Feature engineering is a process for finding meaningful features from data for
the efficient training of machine learning algorithms or, in other words, the
creation of features derived from original features. The study concludes that
feature engineering can boost the Performance of machine learning algorithms.
‘‘Garbage in garbage out’’ is a common saying in machine learning. According
to this idea, senseless data produces senseless output. On the other hand, data
that are more informational can produce desirable results. Therefore, feature
engineering can extract meaningful features from raw data, which helps to
increase the consistency and accuracy of learning algorithms. In this study, we
used three feature engineering methods: BoW, TF-IDF, and Chi2.

BAG-OF-WORDS
BoW is a method of extracting features from text data, and it is very easy to
understand and implement. BoW is very useful in problems such as language
modeling and text classification. In this method, we use CountVectorizer to
extract features. CountVectorizer works on term frequency, i.e., counting the
occurrences of tokens and building a sparse matrix of tokens. BoW is a
collection of words and features, where each feature is assigned a value that
represents the occurrences of that feature.

TF-IDF
TF-IDF is a feature extraction method used to extract features from data. TF-
IDF is most widely used in text analysis and music information retrieval. TF-
IDF assigns a weight to each term in a document based on its term frequency
(TF) and inverse document frequency (IDF). The terms with higher weight
scores are considered to be more important. TF-IDF computes weight of each
term by using formula as mention in equation 1:

Here, TFi;j is the number of occurrences of term t in document d, Df ;t is the


number of documents containing the term t, and N is the total number of
documents in the corpus.

CHI2
Chi2 is the most common feature selection method, and it is mostly used on text
data [21]. In feature selection, we use it to check whether the occurrence of a
specific term and the occurrence of a specific class are independent. More
formally, forgiven a document D, we estimate the following quantity
for each term and rank them by their score. Chi2 finds this score using equation
2:

where
• N is the observed frequency and E the expected frequency

• et takes the value 1 if the document contains term t and 0 otherwise


• ec takes the value 1 if the document is in class c and 0 otherwise

For each feature (term), a corresponding high Chi2 score indicates that the null
hypothesis H0 of independence (meaning the document class has no influence
over the term’s frequency) should be rejected, and the occurrence of the term
and class are dependent. In this case, we should select the feature for the text
classification.
Decision Tree Algorithm

By
Muhammad Rizwan
KFUEIT

Muhammad Rizwan, IT, KFUEIT 1


• First we look into Decision Tree Alog
• Then will understand Random Forest

Muhammad Rizwan, IT, KFUEIT 2


Muhammad Rizwan, IT, KFUEIT 3
Muhammad Rizwan, IT, KFUEIT 4
Muhammad Rizwan, IT, KFUEIT 5
Muhammad Rizwan, IT, KFUEIT 6
Muhammad Rizwan, IT, KFUEIT 7
Muhammad Rizwan, IT, KFUEIT 8
Muhammad Rizwan, IT, KFUEIT 9
Muhammad Rizwan, IT, KFUEIT 10
Muhammad Rizwan, IT, KFUEIT 11
Muhammad Rizwan, IT, KFUEIT 12
Muhammad Rizwan, IT, KFUEIT 13
Muhammad Rizwan, IT, KFUEIT 14
Muhammad Rizwan, IT, KFUEIT 15
Muhammad Rizwan, IT, KFUEIT 16
Muhammad Rizwan, IT, KFUEIT 17
Muhammad Rizwan, IT, KFUEIT 18
Muhammad Rizwan, IT, KFUEIT 19
Muhammad Rizwan, IT, KFUEIT 20
Muhammad Rizwan, IT, KFUEIT 21
Muhammad Rizwan, IT, KFUEIT 22
Muhammad Rizwan, IT, KFUEIT 23
Muhammad Rizwan, IT, KFUEIT 24
Muhammad Rizwan, IT, KFUEIT 25
Muhammad Rizwan, IT, KFUEIT 26
Muhammad Rizwan, IT, KFUEIT 27
Muhammad Rizwan, IT, KFUEIT 28
Muhammad Rizwan, IT, KFUEIT 29
Muhammad Rizwan, IT, KFUEIT 30
Muhammad Rizwan, IT, KFUEIT 31
Muhammad Rizwan, IT, KFUEIT 32
Muhammad Rizwan, IT, KFUEIT 33
Muhammad Rizwan, IT, KFUEIT 34
Muhammad Rizwan, IT, KFUEIT 35
Muhammad Rizwan, IT, KFUEIT 36
Muhammad Rizwan, IT, KFUEIT 37

You might also like