0% found this document useful (0 votes)
70 views35 pages

Da&ml PPT-1

Data Science and Machine Learning

Uploaded by

vtilak109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views35 pages

Da&ml PPT-1

Data Science and Machine Learning

Uploaded by

vtilak109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ML & Data Analytics

Syllabus
Course Code : CSIC 221
Course Title : Machine Learning & Data Analytics
Number of Credits and L/T/P scheme : 4 ; 3/0/2
Prerequisites (Course code) : Problem solving : Programming using C
Course Category : IC
Course Learning Objectives:
1. The major goal of the course is to allow computers to learn (potentially
complex) patterns from data, and then make decisions based on these
patterns.
2. To provide strong foundation for data science and application area related to
it.
3. To provide the underlying core concepts and emerging technologies in data
science.
4. A data scientist requires an integrated skill set spanning mathematics,
probability and statistics, optimization, and branches of computer science
like databases, machine learning etc.
Course Content:
Unit 1: Introduction to Data Science: What is Data Science? Linear algebra for data
science algebraic and geometric view, Data Representation & Statistical
Inference:- Data objects and attribute types, Types of Data, descriptive statistics,
notion of probability, distributions, mean, variance, covariance, Understanding
univariate and multivariate normal distributions.

Unit 2: Data Analysis: Probability and Random Variables, Correlation, Regression,


Attribute Transformation, Sampling, Feature subset selection, Similarity
measures, High-dimensional Data: -Curse of Dimensionality, Dimensionality
reduction: PCA, SVD, etc.

Unit 3: Data Visualization, Bayesian Learning& Evaluating Hypotheses: Basic


principles, Scalar, Vector, & Tensor Visualization, Multivariate Data
Visualization, Text Data Visualization, Network Data Visualization, Visualization
Techniques, Bayesian Approach, Bayes’ Theorem, Evaluating Hypotheses- Z-test,
T-test, Chi-square Test.

Unit 4: Machine Learning (Supervised & Unsupervised Learning): Basic concepts


of Classification, k-Nearest Neighbor, Decision Tree classification, Naïve Bayes’
Classifier, Linear Regression Models, Logistics Regression, Basic concepts of
Clustering, K-means, HierarchicalClustering, DBSCAN.
Text Books
1. U Dinesh Kumar and Manaranjan Pradhan, Machine Learning using Python,
John Wiley & Sons,2020.
2. Cathy O ‘Neil and Rachel Schutt., Doing Data Science, Straight Talk From The
Frontline, O ‘Reilly. 2014.
3. Ethem Alpaydin, Introduction to Machine Learning, Second Edition, PHI, 2010.

Reference Books:

1. T. Hastie, R. Tibshirani and J. Friedman., The Elements of Statistical Learning,


Second Edition, Springer, 2009.

2. Christopher M. Bishop F.R.Eng., Pattern Recognition and Machine Learning,


Springer, 2006.

3. J. Grus., Data Science from Scratch, Second Edition,O‘Reilly. 2019.

4. Douglas C. Montgomery, George C. Runger., Applied Statistics and Probability


for Engineers, Third Edition, John Wiley & Sons, Inc., 2003.

5. Tom M.Mitchell, Machine Learning, McGraw-Hill International Edition, 1997.


What is Data Science?
• Data science, as it’s practiced, is a blend
of Red-Bull-fueled hacking and espresso-
inspired statistics.

• But data science is not merely hacking—


because when hackers finish debugging
their Bash one-liners and Pig scripts, few
of them care about non-Euclidean
distance metrics.
• And data science is not merely statistics,
because when statisticians finish
theorizing the perfect model, few could
read a tab-delimited file into R if their
job depended on it.

• Data science is the civil engineering of


data. Its acolytes possess a practical
knowledge of tools and materials,
coupled with a theoretical understanding
of what’s possible.
Drew Conway’s Venn diagram of data science
“Rise of the Data Scientist”
• Statistics (traditional analysis you’re
used to thinking about)
• Data munging (parsing, scraping, and
formatting data)
• Visualization (graphs, tools, etc.)

Data science is just a rebranding and


unwelcome takeover of statistics.
A Data Science Profile
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
Rachel’s data science profile, which she created to illustrate trying to visualize oneself as a
data scientist; she wanted students and guest lecturers to “riff” on this—to add buckets or
remove skills, use a different scale or visualization method, and think about the drawbacks
of self-reporting.
Data science team profiles can be constructed from data scientist profiles; there should be
alignment between the data science team profile and the profile of the data problems they
try to solve
• Linear algebra is a very fundamental part of
data science.

• We will introduce the use of linear algebra in


data science in a few hours so, we cannot
cover the topic of linear algebra in all its
detail.

• So, the most important concepts from linear


algebra, that are useful in the field of data
science and in particular for the material that
we are going to teach in this course.
Data Representation
• Data is represented usually in a matrix form and
we are going to talk about this representation
and concepts in matrices.

• If this data contains several variables of interest.

• We would like to know how many of these


variables are really important.

• If there are relationships between these variables


and if there are these relationships, how does
one uncover these relationships?
• The third block that we have basically says
that the ideas from linear algebra become
very important in all kinds of machine
learning algorithms.

• One needs to have a good understanding of


some of these concepts before you can go and
understand more complicated or more
complex machine learning algorithms.

• So, in that sense also linear algebra is an


important component of data science.
Data Representation
Matrices for Data Science
• We are going to look at matrices and
summarize the most important ideas
that are relevant from a data science
viewpoint.
• Matrix is a form of organizing data into
rows and columns.
• So, if you are an engineer and you are
looking at data for multiple variables, at
multiple times, how do you put this data
together in a format that can be used
later, the answer will be a matrix.
Matrices for Data Science
• matrices can be used to represent the
data or in some cases matrices can
also be used to represent equations
and the matrix could have the
coefficients in several equations as its
component.
Data Representation example
• Consider a reactor which needs to
controlled using multiple attributes from
various sensors like Pressure(Pa),
Temp(K), Density(gm/m3).

• The sensors generated 1000 data data


points independently.

• This is complete data set.


Data Representation example
• Let us take another example
Image representation
Storing
• The image is stored in the machines
as a large matrix of pixel values across
the image.

• Thus, storing the pixel value matrix is


equivalent to storing the image for the
machine.
Identification
• Several machine learning algorithms
are deployed in order to “teach” the
machine how to identify a particular
image.

• Linear algebra and matrix operations


are at the heart of these machine
learning algorithms.
Independent Attributes
• I might be interested in knowing if all
the variables that are there in the data
are important.

• How many are actually independent


variables?
• Is there any method which can
identify if some attributes are related
to the other attributes?

• If yes, how do we identify the linear


relationship?

• Can we reduce the size of the matrix?


• In reactor example the P, T, D and
viscosity.

• How does one identify the number of


independent attributes?
• Domain
D ~f(P,T)
• Thus, in some sense D is a function of
P and T.
• Implying that at least one attribute is
dependent on the others.
• This is a linear combination.
• Is there any approach which can be used
to identify the number of linear
relationships between the attributes
purely using data?

• This is addressed by the concept of the


rank of the matrix.

• Rank of matrix refers to the number of


linearly independent rows or columns of
the matrix.
Identification of linear Relationship
among Attributes
Null Space
Null Space

You might also like