CS5103 Lecture Plan - Fundamnetals of Data Science
CS5103 Lecture Plan - Fundamnetals of Data Science
The course will introduce students to the data scientist toolkit and the underlying core concepts. It will cover the full technical pipeline
from data collection to processing, basic notions of statistical analysis and data visualization techniques. This course will provide
with the basic toolkit to work with data sets in different formar (CSV, R, Google Refine). To support these learning outcomes, the course
will include exercises and a group project in which students will use existing open data sets and build their own application
Course Aim: The aim of offering this course is to become a proficient data scientist and practitioner.
Theory Class Hours: Tue (11-12), Wed(10-11), Thu(10-11), Fri(10-11);
Course Outcomes: At the end of the course the student will be able to:
CO5 Demonstrate key concepts in Data Science, including tools, approaches and application scenarios
Detailed syllabus
1. Introduction to Data Science. Structure and life cycle of Data Science project, Managing Data Analysis, Question types,
characteristics of good question, Overview of data science experiment (3 Hours)
3. Overview of Random variables and distributions, Statistical learning, Assessing Model Accuracy, Descriptive statistics,
Dependent and independent events. (3 hours)
5. Data analytics (statistical modeling, basic concepts, experiment design, pitfalls, R) (3 Hours)
7. Graphical Analysis: Histograms and frequency polygons, Box-plots, Quartiles, Scatter Plots, Heat Maps (03 hours)
9. Data Wrangling: Data Acquisition, Data Formats, Imputation, The split-apply-combine paradigm. (8 hours).
10. Simple Hypothesis Testing, Student's t-test, paired t and U test, correlation and covariance, tests for association (04 hours)
11. Linear Regression: Simple and multiple linear regressions, Comparison of Linear regression with K-nearest neighbors. (05 hours)
12. Classification: Linear and Logistic Regression, LDA and comparison of classification methods (04 hours)
13. Singular Value Decomposition: Singular Vectors, SVD, Best Rank k Approximations, Power Method for computing the SVD,
Eigen Values, Eigen Vectors, Principal Component Analysis. (6 Hours)
Assessment Process: Internal Assessment (20%), Mid Semester Examination (30%) and End- Semester Examination (50%).
Internal assessment is evaluated using Quiz tests using Plickers( 10%), case studies (10%) and Presentations(10%).
Text Books:
1) An Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten Trevor Hastie, Robert
Tibshirani, February 11, 2013, web link: www.statlearning.com
2) Foundation of Data Science ,John Hopcroft and Ravindran Kannan, draft copy, 2013.
3) Beginning R The statistical Programming Language, Mark Gardener , Wiley, 2015.
4) Data Science and Big Data Analytics, EMC Education Services, EMC2, Wiley Publication, 2015.
Note: All students are required to use Moodle for accessing content and assignments related to theory as well as laboratory course.
CS5105 Data Science Lab PCC 0–0–3 2 Credits
Course Aim: The aim of offering this laboratory course is to acquire skill on R tool and its usage for analysis .
Course Outcomes: At the end of the course the student will be able to:
Detailed syllabus
Text Books:
1) An Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten Trevor
Hastie, Robert Tibshirani, February 11, 2013, web link: www.statlearning.com ( 1 to 4 chapters)
2) Beginning R The statistical Programming Language , Mark Gardener , Wiley, 2015.
Assessment: Continuous Evaluation: 60%; End semester examination : 40%