0% found this document useful (0 votes)
6 views

01. Introduction

The document provides an overview of data science, its techniques, and its applications in various fields. It explains the relationship between data science, artificial intelligence, and machine learning, as well as the importance of data preparation and model building. Additionally, it outlines different types of data science tasks, associated fields, and core algorithms used in the discipline.

Uploaded by

Farzana Akter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

01. Introduction

The document provides an overview of data science, its techniques, and its applications in various fields. It explains the relationship between data science, artificial intelligence, and machine learning, as well as the importance of data preparation and model building. Additionally, it outlines different types of data science tasks, associated fields, and core algorithms used in the discipline.

Uploaded by

Farzana Akter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Adopted and Modified By: Avdesh Mishra

Data Science: Concepts and Practice

Course slides
Course Book Course Software

Data Science: Concepts and


Practice
Authors : Vijay Kotu & Bala Deshpande
Publisher : Morgan Kaufmann www.rapidminer.com

Free Download

Weka
Python: Scikit-learn/Keras/TensorFlow
1. Introduction
What is Data Science
➢ Data science is a collection of techniques used to extract value from data.
➢ It has become an essential tool for any organization that collects, stores, and
processes data as part of its operations.
➢ Data science techniques rely on finding useful patterns, connections, and
relationships within data.
➢ The underlying methods of data science are decades if not centuries old.
➢ Engineers and scientists have been using predictive models since the
beginning of nineteenth century.
What is Data Science

➢ Term “science” indicates that the methods are evidence based, and are built
on empirical knowledge, more specifically historical observations.

➢ Moore’s Law – computing hardware capabilities double every two years.

➢ Ability to collect, store, and process data has increase.

➢ Also, data since application in many fields has increased.

➢ To get meaningful results, major effort preparing, cleaning, scrubbing, or


standardizing the data is still required, before the learning algorithms can
begin to crunch them.
What is Data Science
AI, ML, and DS

➢ Artificial intelligence is about giving machines the capability of mimicking


human behavior, particularly cognitive functions.
➢ Examples: facial recognition, automated driving, sorting mail based on postal
code.

➢ Machine learning can either be considered a sub-field or one of the tools of


artificial intelligence, is providing machines with the capability of learning
from experience.
AI, ML, and DS

➢ ML algorithms also called “learners” take both input and output (training
data) to figure out a model for the program which converts input to output.

➢ Data science is the business application of ML, AI, and other quantitative
fields like statistics, visualization, and mathematics.
Data Science

➢ Starts with ‘Data’ e.g.,

➢ a simple array of a few numeric observation or

➢ a complex matrix of millions of observations with thousands of


variables.

➢ Utilizes specialized computational methods to discover meaningful


structures within a dataset.

➢ Coexist and closely associated with a number of related areas such as


➢ Database systems, Data engineering, Visualization, Data analysis, Business intelligence
Extracting Meaningful Patters

➢ Knowledge discovery is the process of


➢ identifying valid, novel, potentially useful, and ultimately understandable
patterns or relationships

➢ to make important decisions.

➢ Generalization of patterns
➢ Should be valid, not just for the dataset used to observe the pattern, but also for new
unseen data.
Building Representative Models

➢ Once representative model is created it can


be used to predict the value of the target
variable.

➢ Model serves two purposes:


➢ Predict output based on the new and unseen set
of input variables

➢ Understand relationship between the output


variable and all the input variables. Process of generating a model.
Combination of Statistics, Machine Learning, and
Computing

➢ Data science borrows computational techniques from the disciplines of


➢ Statistics
➢ Machine learning
➢ Experimentation
➢ Database theories
➢ Has also evolved to adopt more diverse techniques such as parallel computing,
evolutionary computing, linguistics, and behavioral studies.
Learning Algorithms
➢ Based on the problem, data science is classified into tasks such as
➢ Classification
➢ Association analysis
➢ Clustering and
➢ Regression
➢ Each task use specific algorithm such as
➢ Decision trees
➢ Neural networks
➢ K-NN
➢ K-mean clustering and others.
Associated Fields
➢ Data science heavily relies on following fields
➢ Descriptive statistics: compute mean, standard dev., correlation, etc.
➢ Exploratory visualization: process to express data in visual coordinates
➢ Dimensional slicing: gather information on the data through dimension slicing,
filtering, and pivoting (helps unique database schema design e.g., slice the
yearly revenue by products or combination of region or products.)

➢ Hypothesis testing: experimental data are collected to evaluate whether a


hypothesis has enough evidence to be supported or not.

➢ Data engineering: process of sourcing, organizing, assembling, storing, and


distributing data for effective analysis and usage.
Associated Fields
➢ Data science heavily relies on following fields
➢ Business intelligence: helps organizations consume data effectively and assist
in decision making.
Types of Data Science
➢ Data science can be broadly categorized into
➢ Supervised
➢ Infer a function or relationship based on labeled training data and uses this
function to map new unlabeled data.

➢ Needs a sufficient number of labeled records to learn the model from data.
➢ Output label that is being predicted is called class label or target variable.
➢ Unsupervised learning
➢ Uncovers hidden patterns in unlabeled data.
➢ Find patterns in data based on the relationship between data points
themselves
Types of Data Science
➢ Data science can be categorized into tasks such as
Tasks Description Algorithms Examples

Classification Predict if a data point belongs to Decision Trees, Neural Assigning voters into known buckets by
one of predefined classes. The networks, Bayesian political parties eg: soccer moms.
prediction will be based on models, Induction rules, K Bucketing new customers into one of
learning from known data set. nearest neighbors known customer groups.

Regression Predict the numeric target label of Linear regression, Logistic Predicting unemployment rate for next
a data point. The prediction will regression year. Estimating insurance premium.
be based on learning from known
data set.

Anomaly detection Predict if a data point is an outlier Distance based, Density Fraud transaction detection in credit
compared to other data points in based, LOF cards. Network intrusion detection.
the data set.

Time series Predict if the value of the target Exponential smoothing, Sales forecasting, production
variable for future time frame ARIMA, regression forecasting, virtually any growth
based on history values. phenomenon that needs to be
extrapolated

Clustering Identify natural clusters within the K means, density based Finding customer segments in a
data set based on inherit clustering - DBSCAN company based on transaction, web
properties within the data set. and customer call data.

Association analysis Identify relationships within an FP Growth, Apriori Find cross selling opportunities for a
itemset based on transaction retailor based on transaction purchase
data. history.
Course Core Algorithms
outline Classification
Decision Trees
Rule Induction
k-Nearest Neighbors
Naïve Bayesian
Artificial Neural Networks
Process Basics Support Vector Machines Common Applications
Data Science Ensemble Learners
Text Mining
Process Regression
Time Series Forecasting
Data Exploration Linear Regression
Logistic Regression Anomaly Detection
Model Evaluation
Association Analysis Feature Selection
Apriori
FP-Growth

Clustering
k-Means
DBSCAN
Self-Organizing Maps
Practice using RapidMiner
➢ Download the free version of RapidMiner Studio software from

➢ https://rapidminer.com/products/studio/

➢ Review Chapter 15: Getting started with RapidMiner to become familiar


with the features of the tool, its basic operations, and the user interface
functionality.

You might also like