0% found this document useful (0 votes)
23 views

CS5103 Lecture Plan - Fundamnetals of Data Science

The document provides details about a course on fundamentals of data science including its aims, outcomes, syllabus, textbooks, and assessment details. The course introduces students to the data science toolkit and concepts through lectures and a lab component. Topics include data collection, processing, statistical analysis, visualization, wrangling, and modeling techniques taught using the R programming language.

Uploaded by

NaniChinnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CS5103 Lecture Plan - Fundamnetals of Data Science

The document provides details about a course on fundamentals of data science including its aims, outcomes, syllabus, textbooks, and assessment details. The course introduces students to the data science toolkit and concepts through lectures and a lab component. Topics include data collection, processing, statistical analysis, visualization, wrangling, and modeling techniques taught using the R programming language.

Uploaded by

NaniChinnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS5103 Fundamentals of Data Science PCC 4–0–0 4 Credits

The course will introduce students to the data scientist toolkit and the underlying core concepts. It will cover the full technical pipeline
from data collection to processing, basic notions of statistical analysis and data visualization techniques. This course will provide
with the basic toolkit to work with data sets in different formar (CSV, R, Google Refine). To support these learning outcomes, the course
will include exercises and a group project in which students will use existing open data sets and build their own application

Course Aim: The aim of offering this course is to become a proficient data scientist and practitioner.
Theory Class Hours: Tue (11-12), Wed(10-11), Thu(10-11), Fri(10-11);
Course Outcomes: At the end of the course the student will be able to:

CO1 Apply statistical methods to data for inferences.

CO2 Analyze data using Classification, Graphical and computational methods.

CO3 Understand Data Wrangling approaches.

CO4 Perform descriptive analytics over massive data.

CO5 Demonstrate key concepts in Data Science, including tools, approaches and application scenarios

Detailed syllabus

1. Introduction to Data Science. Structure and life cycle of Data Science project, Managing Data Analysis, Question types,
characteristics of good question, Overview of data science experiment (3 Hours)

2. Exploratory Data Analysis (4 Hours)

3. Overview of Random variables and distributions, Statistical learning, Assessing Model Accuracy, Descriptive statistics,
Dependent and independent events. (3 hours)

4. Data interpretation and use (visualization techniques, pitfalls, D3) (3 hours)

5. Data analytics (statistical modeling, basic concepts, experiment design, pitfalls, R) (3 Hours)

6. Data integration (fundamentals of Linked Data, Google Refine) (3 Hours)

7. Graphical Analysis: Histograms and frequency polygons, Box-plots, Quartiles, Scatter Plots, Heat Maps (03 hours)

8. Sparse matrices and Interpolation by divided differences. (2 hours)

9. Data Wrangling: Data Acquisition, Data Formats, Imputation, The split-apply-combine paradigm. (8 hours).

10. Simple Hypothesis Testing, Student's t-test, paired t and U test, correlation and covariance, tests for association (04 hours)
11. Linear Regression: Simple and multiple linear regressions, Comparison of Linear regression with K-nearest neighbors. (05 hours)

12. Classification: Linear and Logistic Regression, LDA and comparison of classification methods (04 hours)

13. Singular Value Decomposition: Singular Vectors, SVD, Best Rank k Approximations, Power Method for computing the SVD,
Eigen Values, Eigen Vectors, Principal Component Analysis. (6 Hours)

Assessment Process: Internal Assessment (20%), Mid Semester Examination (30%) and End- Semester Examination (50%).
Internal assessment is evaluated using Quiz tests using Plickers( 10%), case studies (10%) and Presentations(10%).

Text Books:

1) An Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten Trevor Hastie, Robert
Tibshirani, February 11, 2013, web link: www.statlearning.com
2) Foundation of Data Science ,John Hopcroft and Ravindran Kannan, draft copy, 2013.
3) Beginning R The statistical Programming Language, Mark Gardener , Wiley, 2015.
4) Data Science and Big Data Analytics, EMC Education Services, EMC2, Wiley Publication, 2015.

Note: All students are required to use Moodle for accessing content and assignments related to theory as well as laboratory course.
CS5105 Data Science Lab PCC 0–0–3 2 Credits
Course Aim: The aim of offering this laboratory course is to acquire skill on R tool and its usage for analysis .

Theory Class Hours: Thursday (2-5); Venue: Data Engineering Lab

Course Outcomes: At the end of the course the student will be able to:

CO1 Demonstrate skills acquired in R Programming.

CO2 Interpret models in data using statistical analysis.

CO3 Prepare environment for distributed systems applications.

CO4 Writes Programs for big data using Map Reduce.

Detailed syllabus

1. Introduction to R: Installing R in windows, R Console (R window to edit and execute R Commands),


Commands and Syntax (R commands and R syntax), Packages and Libraries (Install and load a
package in R), Help In R, Workspace in R.
2. Familiarity of Data Structures in R: Introduction to Data Types (Why Data Structures?, Types of Data
Structures in R), Vectors, Matrices, Arrays, Lists, Factors, Data Frames, Importing and Exporting Data.
3. Graphical Analysis: Creating a simple graph (Using plot() command), Modifying the points and lines of
a graph (Using type, pch, font, cex, lty, lwd, col arguments in plot() command), Modifying Title and
Subtitle of graph (Using main, sub, col.main, col.sub, cex.main, cex.sub, font.main, font.sub arguments
in plot() command), Modifying Axes of a Graph (Using xlab, ylab, col.lab, cex.lab, font.lab, xlim, ylim,
col.axis, cex.axis, font.axis arguments and axis() command), Adding Additional Elements to a Graph
(Using points(), text(), abline(), curve() commands), Adding Legend on a Graph (Using legend()
command), Special Graphs (Using pie(), barplot(), hist() commands), Multiple Plots (Using mfrow or
mfcol arguments in par() command and layout command).
4. Descriptive Statistics: Measure of Central Tendency (Mean, Median and Mode), Measure of Positions
(Quartiles, Deciles, Percentiles and Quantiles), Measure of Dispersion (Range, Median, Absolute
deviation about median, Variance and Standard deviation), Measure of Distribution (Skewness and
Kurtosis), Box and Whisker Plot (Box Plot and its parts, Using Box Plots to compare distribution).
5. Comparing Population: Test of Hypothesis (Concept of Hypothesis testing, Null Hypothesis and
Alternative Hypothesis), Cross Tabulations (Contingency table and their use, Chi-Square test, Fisher's
exact test), One Sample t test (Concept, Assumptions, Hypothesis, Verification of assumptions,
Performing the test and interpretation of results), Independent Samples t test (Concept, Type,
Assumptions, Hypothesis, Verification of assumptions, Performing the test and interpretation of results),
Paired Samples t test (Concept, Assumptions, Hypothesis, Verification of assumptions, Performing the
test and interpretation of results), One way ANOVA (Concept, Assumptions, Hypothesis, Verification of
assumptions, Model fit, Hypothesis testing, Post hoc tests: Fisher's LSD, Tukey's HSD).
6. Experiments based on Linear Regression and Multiple Regression Methods.
7. Set up and practice examples on Hadoop 2.0;
8. Map Reduce implementation for Relational algebra operations
9. Map reduce implementation for matrix multiplications.

Text Books:
1) An Introduction to Statistical Learning with Applications in R, Gareth James Daniela Witten Trevor
Hastie, Robert Tibshirani, February 11, 2013, web link: www.statlearning.com ( 1 to 4 chapters)
2) Beginning R The statistical Programming Language , Mark Gardener , Wiley, 2015.
Assessment: Continuous Evaluation: 60%; End semester examination : 40%

You might also like