DSOST1
DSOST1
Syllabus :
UNIT – I
Introduction to Data Science: What is Data Science?
Tool boxes for Data Scientist :Introduction
Fundamentals of python libraries for Data Scientist
Data Science Using Open Source Tools
Installation,
IDE,
Get started with python for Data scientists
UNIT – II
Descriptive Statistics :Introduction ,
Data Preparation ,
Exploratory Data Analysis,
Estimation,
Conclusion
Data Science Using Open Source Tools
Statistical Interference:
Frequentist Approach,
Measuring variability in Estimates.
UNIT – III
Machine Learning:
Introduction,
Supervised Learning,
Learning Curves
Data Science Using Open Source Tools
Training,
Validation and Testing,
Learning Models,
Case Study: Toy business case
Regression Analysis:
Linear Regression,
Logistic regression
Data Science Using Open Source Tools
UNIT – IV
Unsupervised Learning:
Clustering: similarities and distances,
what constitutes a good clustering,
Defining metrics to measure clustering quality,
Taxonomies of clustering techniques
Data Science Using Open Source Tools
Network Analysis:
Basic Definition of Graphs,
Social Network Analysis,
centrality,
Ego-Networks,
Community Detection
Data Science Using Open Source Tools
UNIT – V
Recommender System:
How do recommender systems work: Content-based
filtering,
Collaborative Filtering,
Hybrid recommenders,
Modelling User preferences,
Evaluating Recommenders
Data Science Using Open Source Tools
Case study:
Movie Lens dataset,
User Based Collaborative Filtering.
Statistical Natural Language Processing for sentiment:
Data cleaning,
Text Representation
Data Science Using Open Source Tools
Text Books:
“Introduction to Data Science, A python Approach to
concepts, Techniques and Applications” Laura Igual &
Santi Segui,2016
Introduction to Data Science
Introduction to Data Science
Data Science is a multidisciplinary field that combines
knowledge from statistics, computer science, and
domain expertise to extract meaningful insights from
data.
4. Machine Learning
Open-source libraries provide tools to build, evaluate,
and deploy machine learning models.
5. Model Evaluation & Hyper parameter Tuning
Evaluating and fine-tuning models is crucial to
improving performance.
Introduction to Data Science
6. Model Deployment
Once a model is trained and evaluated, the next step is
deployment for real-world use.
7. Big Data & Distributed Computing
Open-source tools help process and analyze big data
efficiently.
What is Data Science?
What is Data Science?
Data science is a multidisciplinary field that uses scientific
methods, algorithms, processes, and systems to extract
knowledge and insights from structured and unstructured
data.
It combines elements from statistics, computer science,
mathematics, and domain expertise to analyze and interpret
large datasets. The goal of data science is to turn raw data
into actionable insights that can drive decision-making and
solve complex problems.
What is Data Science?
In general, data science allows us to adopt four
different strategies to explore the world using data:
Probing reality
Pattern discovery
Predicting future events
Understanding people and the world
Probing reality: Data can be gathered by passive or by
active methods. In the latter case, data represents the
response of the world to our actions.
Analysis of those responses can be extremely valuable
when it comes to taking decisions about our subsequent
actions.
What is Data Science?
Pattern discovery: Divide and conquer is an old
heuristic used to solve complex problems; but it is not
always easy to decide how to apply this common sense
to problems.
Datified problems can be analyzed automatically to
discover useful patterns and natural clusters that can
greatly simplify their solutions.
What is Data Science?
Pattern discovery: Divide and conquer is an old
heuristic used to solve complex problems; but it is not
always easy to decide how to apply this common sense
to problems.
now, almost all libraries have been ported to Python 3.0; but
Python 2.7 is still maintained, so one or another version can
be chosen
But also more specific tools for other related tasks such
as data visualization, code optimization, and big data
processing.
Integrated Development Environments
(IDE)
For any programmer, and by extension, for any data
scientist, the integrated development environment
(IDE) is an essential tool.
In the first cell we put the code to import the Pandas library as
pd.
Reading
Create a new notebook called Open Government Data
Analysis and open it.
The way to read CSV (or any other separated value, providing the
separator character) files in Pandas is by calling the read_csv method.
Besides the name of the file, we add the na_values key argument to this
method along with the character that represents “non available data” in
the file.
Pandas also has functions for reading files with formats such as Excel,
HDF5, tabulated files, or even the content from the clipboard
(read_excel(), read_hdf(), read_table(), read_clipboard