0% found this document useful (0 votes)
14 views91 pages

DSOST1

The document outlines a comprehensive syllabus for a course on Data Science using open-source tools, covering topics such as Python fundamentals, descriptive statistics, machine learning, and recommender systems. It emphasizes the importance of open-source tools for data collection, cleaning, exploration, and model deployment, while highlighting key Python libraries like NumPy, Pandas, and Scikit-Learn. Additionally, it discusses the setup of the data science ecosystem, including IDEs like Jupyter Notebook for executing data science tasks.

Uploaded by

229x1a3250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views91 pages

DSOST1

The document outlines a comprehensive syllabus for a course on Data Science using open-source tools, covering topics such as Python fundamentals, descriptive statistics, machine learning, and recommender systems. It emphasizes the importance of open-source tools for data collection, cleaning, exploration, and model deployment, while highlighting key Python libraries like NumPy, Pandas, and Scikit-Learn. Additionally, it discusses the setup of the data science ecosystem, including IDEs like Jupyter Notebook for executing data science tasks.

Uploaded by

229x1a3250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Science Using Open Source Tools

Syllabus :
UNIT – I
Introduction to Data Science: What is Data Science?
Tool boxes for Data Scientist :Introduction
Fundamentals of python libraries for Data Scientist
Data Science Using Open Source Tools
Installation,
IDE,
Get started with python for Data scientists
UNIT – II
Descriptive Statistics :Introduction ,
Data Preparation ,
Exploratory Data Analysis,
Estimation,
Conclusion
Data Science Using Open Source Tools
Statistical Interference:
Frequentist Approach,
Measuring variability in Estimates.
UNIT – III
Machine Learning:
Introduction,
Supervised Learning,
Learning Curves
Data Science Using Open Source Tools
Training,
Validation and Testing,
Learning Models,
Case Study: Toy business case
Regression Analysis:
Linear Regression,
Logistic regression
Data Science Using Open Source Tools
UNIT – IV
Unsupervised Learning:
Clustering: similarities and distances,
what constitutes a good clustering,
Defining metrics to measure clustering quality,
Taxonomies of clustering techniques
Data Science Using Open Source Tools
Network Analysis:
Basic Definition of Graphs,
Social Network Analysis,
centrality,
Ego-Networks,
Community Detection
Data Science Using Open Source Tools
UNIT – V
Recommender System:
How do recommender systems work: Content-based
filtering,
Collaborative Filtering,
Hybrid recommenders,
Modelling User preferences,
Evaluating Recommenders
Data Science Using Open Source Tools
Case study:
Movie Lens dataset,
User Based Collaborative Filtering.
Statistical Natural Language Processing for sentiment:
Data cleaning,
Text Representation
Data Science Using Open Source Tools
Text Books:
“Introduction to Data Science, A python Approach to
concepts, Techniques and Applications” Laura Igual &
Santi Segui,2016
Introduction to Data Science
Introduction to Data Science
Data Science is a multidisciplinary field that combines
knowledge from statistics, computer science, and
domain expertise to extract meaningful insights from
data.

It involves a series of processes that transform raw data


into actionable insights that can be used for
decision-making, predictions, and problem-solving.
Introduction to Data Science
What do mean DSOST:
Data science using open-source tools has become
increasingly popular because of the accessibility,
flexibility, and active communities that surround these
tools.
Open-source tools allow data scientists to work with a
variety of data types, apply complex algorithms, and
develop solutions without the high costs associated
with proprietary software.
Introduction to Data Science
open-source tools used in various stages of the data
science workflow:
1. Data Collection & Access
2. Data Cleaning & Preparation
3. Data Exploration & Visualization
4. Machine Learning
5. Model Evaluation & Hyper parameter Tuning
6. Model Deployment
7. Big Data & Distributed Computing
1. Data Collection & Access
Data collection involves retrieving data from various
sources, such as databases, APIs, or web scraping.
Open-source tools help with connecting to databases,
making API calls, and scraping data from the web.
Introduction to Data Science
2. Data Cleaning & Preparation

Data preparation and cleaning involve transforming


raw data into a usable form. Open-source libraries can
help in handling missing data, normalizing values,
removing duplicates, etc.
Introduction to Data Science

3. Data Exploration & Visualization

Exploratory Data Analysis (EDA) is essential to


understand patterns, relationships, and anomalies in the
data.
Introduction to Data Science

4. Machine Learning
Open-source libraries provide tools to build, evaluate,
and deploy machine learning models.
5. Model Evaluation & Hyper parameter Tuning
Evaluating and fine-tuning models is crucial to
improving performance.
Introduction to Data Science

6. Model Deployment
Once a model is trained and evaluated, the next step is
deployment for real-world use.
7. Big Data & Distributed Computing
Open-source tools help process and analyze big data
efficiently.
What is Data Science?
What is Data Science?
Data science is a multidisciplinary field that uses scientific
methods, algorithms, processes, and systems to extract
knowledge and insights from structured and unstructured
data.
It combines elements from statistics, computer science,
mathematics, and domain expertise to analyze and interpret
large datasets. The goal of data science is to turn raw data
into actionable insights that can drive decision-making and
solve complex problems.
What is Data Science?
In general, data science allows us to adopt four
different strategies to explore the world using data:
Probing reality
Pattern discovery
Predicting future events
Understanding people and the world
Probing reality: Data can be gathered by passive or by
active methods. In the latter case, data represents the
response of the world to our actions.
Analysis of those responses can be extremely valuable
when it comes to taking decisions about our subsequent
actions.
What is Data Science?
Pattern discovery: Divide and conquer is an old
heuristic used to solve complex problems; but it is not
always easy to decide how to apply this common sense
to problems.
Datified problems can be analyzed automatically to
discover useful patterns and natural clusters that can
greatly simplify their solutions.
What is Data Science?
Pattern discovery: Divide and conquer is an old
heuristic used to solve complex problems; but it is not
always easy to decide how to apply this common sense
to problems.

Datified problems can be analyzed automatically to


discover useful patterns and natural clusters that can
greatly simplify their solutions.
What is Data Science?
Predicting future events:
Predictive analytics allows decisions to be taken in
response to future events, not only reactively.
It is not possible to predict the future in any
environment and there will always be unpredictable
events; but the identification of predictable events
represents valuable knowledge.
For example, predictive analytics can be used to
optimize the tasks
What is Data Science?
Understanding people and the world:
This is an objective that at the moment is beyond the
scope of most companies and people.
But large companies and governments are investing
considerable amounts of money in research areas such
as understanding natural language, computer vision,
psychology and neuroscience.
Toolboxes for Data Scientists
Introduction:
The toolbox of any data scientist, as for any kind of
programmer, is an essential ingredient for success and
enhanced performance.
Choosing the right tools can save a lot of time and
thereby allow us to focus on data analysis
Introduction:

The most basic tool to decide on is which programming


language we will use. Many people use only one
programming language in their entire life: the first and
only one they learn.

For many, learning a new language is an enormous


task that, if at all possible, should be undertaken only
once.
Introduction:

The problem is that some languages are intended for


developing high-performance or production code, such
as C, C++, or Java

while others are more focused on prototyping code,


among these the best known are the so-called scripting
languages: Ruby, Perl, and Python.
Introduction:

So, depending on the first language you learned, certain


tasks will, at the very least, be rather tedious.

The main problem of being stuck with a single


language is that many basic tools simply will not be
available in it, and eventually you will have either to
reimplement them or to create a bridge to use some
other language just for a specific task.
Introduction:

To be ready to change to the best language for each


task and then glue the results together, or choose a very
flexible language with a rich ecosystem (e.g.,
third-party open-source libraries).

we have selected Python as the programming language


Why Python

Python is an interpreted language, so the code is


executed immediately in the Python console without
needing the compilation step to machine language.
Besides the Python console (which comes included
with any Python installation) you can find other
interactive consoles, such as IPython,which give you a
richer environment in which to execute your Python
code.
Why Python

Python is a mature programming language but it also


has excellent properties for newbie programmers

Python is one of the most flexible programming


languages.

It so flexible is that it can be seen as a multiparadigm


language
Why Python
This is especially useful for people who already know
how to program with other languages, as they can
rapidly start programming with Python in the same way

For example, Java programmers will feel comfortable


using Python as it supports the object-oriented
paradigm, or C programmers could mix Python and C
code using cython.
Why Python

Python also has basic statements for functional


programming in its own core library.

Python used as a specific platform for data scientists.

Python used large ecosystem of scientific libraries and


its high.
Fundamental Python Libraries for Data
Scientists
The Python community is one of the most active
programming communities with a huge number of
developed toolboxes.

The most popular Python toolboxes for any data


scientist are
NumPy,
SciPy,
Pandas, and
Scikit-Learn.
Fundamental Python Libraries for Data
Scientists
Numeric and Scientific Computation: NumPy and
SciPy:
NumPy is the cornerstone toolbox for scientific
computing with Python.

NumPy provides, among other things, support for


multidimensional arrays with basic operations on them
and useful linear algebra functions.
Fundamental Python Libraries for Data
Scientists
Many toolboxes use the NumPy array representations
as an efficient basic data structure

SciPy provides a collection of numerical algorithms


and domain-specific toolboxes, including signal
processing, optimization, statistics, and much more.
Fundamental Python Libraries for Data
Scientists
Another core toolbox in SciPy is the plotting library
Matplotlib. This toolbox has many tools for data
visualization.

SCIKIT-Learn: Machine Learning in Python

Scikit-learn is a machine learning library built from


NumPy, SciPy, and Matplotlib
Fundamental Python Libraries for Data
Scientists
Scikit-learn offers simple and efficient tools for
common tasks in data analysis such as
classification,
regression,
clustering,
dimensionality reduction,
model selection, and
preprocessing
Fundamental Python Libraries for Data
Scientists
PANDAS: Python Data Analysis Library

Pandas provides high-performance data structures and


data analysis tools.

The key feature of Pandas is a fast and efficient


DataFrame object for data manipulation with integrated
indexing.
Fundamental Python Libraries for Data
Scientists
The DataFrame structure can be seen as a spreadsheet
which offers very flexible ways of working with it.

You can easily transform any dataset in the way you


want, by reshaping it and adding or removing columns
or rows.

It also provides high-performance functions for


aggregating, merging, and joining datasets
Fundamental Python Libraries for Data
Scientists
Pandas also has tools for importing and exporting data from
different formats: comma-separated value (CSV), text files,
Microsoft Excel, SQL databases, and the fast HDF5 format.
In many situations, the data you have in such formats will
not be complete or totally structured.
For such cases, Pandas offers handling of missing data and
intelligent data alignment. Furthermore, Pandas provides a
convenient Matplotlib interface.
Data Science Ecosystem Installation
we will need to set up our programming environment -
Python language itself
There are currently two different versions of Python:
Python 2.X and
Python 3.X.
The differences between the versions are important, so
there is no compatibility between the codes, i.e., code
written in Python 2.X does not work in Python 3.X and
vice versa.
Data Science Ecosystem Installation
Python 3.X was introduced in late 2008

now, almost all libraries have been ported to Python 3.0; but
Python 2.7 is still maintained, so one or another version can
be chosen

Once we have chosen one of the Python versions, the next


thing to decide is whether we want to install the data
scientist Python ecosystem by individual toolboxes, or to
perform a bundle installation with all the needed toolboxes
Data Science Ecosystem Installation
For newbies, the second option is recommended. If the
first option is chosen, then it is only necessary to install
all the mentioned toolboxes

if a bundle installation is chosen, the Anaconda Python


distribution is then a good option.
Data Science Ecosystem Installation
The Anaconda distribution provides integration of all
the Python toolboxes and applications needed for data
scientists into a single directory without mixing it with
other Python toolboxes installed on the machine.

It contains, the core toolboxes and applications such as


NumPy,
Pandas,
Data Science Ecosystem Installation
SciPy,
Matplotlib,
Scikit-learn,
IPython,
Spyder, etc.,

But also more specific tools for other related tasks such
as data visualization, code optimization, and big data
processing.
Integrated Development Environments
(IDE)
For any programmer, and by extension, for any data
scientist, the integrated development environment
(IDE) is an essential tool.

IDEs are designed to maximize programmer


productivity.

Choosing the right IDE for each person is crucial and,


unfortunately, there is no “one-size-fits-all”
programming environment.
Integrated Development Environments
(IDE)
The basic pieces of any IDE are three: the editor, the
compiler, (or interpreter) and the debugger.
Some IDEs can be used in multiple programming
languages, provided by language-specific plugins, such
as Netbeans or Eclipse.
Others are only specific for one language or even a
specific programming task.
Integrated Development Environments
(IDE)
The case of Python, there are a large number of specific
IDEs, both commercial (PyCharm, WingIDE…) and
open-source.

For example, Spyder(Scientific Python Development


EnviRonment) is an IDE customized with the task of
the data scientist in mind.
Integrated Development Environments
(IDE)
Web Integrated Development Environment (WIDE):
Jupyter
One of the first applications of this kind of WIDE was
developed by William Stein in early 2005 using Python 2.3
as part of his SageMath mathematical software.

In SageMath, a server can be set up in a center, such as a


university or school, and then students can work on their
homework either in the classroom or at home, starting from
exactly the same point they left off.
Integrated Development Environments
(IDE)
Nowadays, such sessions are called notebooks and they
are not only used in classrooms but also used to show
results in presentations or on business dashboards.

Since December 2011, IPython has been issued as a


browser version of its interactive console, called
IPython notebook, which shows the Python execution
results very clearly and concisely by means of cells.
Cells can contain content other than code
Integrated Development Environments
(IDE)
IPython notebook has been separated from IPython
software and now it has become a part of a larger
project: Jupyter12.

Jupyter (for Julia, Python and R) aims to reuse the same


WIDE for all these interpreted languages and not just
Python.
Integrated Development Environments
(IDE)
All old IPython notebooks are automatically imported
to the new version when they are opened with the
Jupyter platform; but once they are converted to the
new version.

All the examples shown use Jupyter notebook style


Get Started with python for Data
Scientist
To execute our examples, we will use Jupyter notebook,
although any other console or IDE can be used
The Jupyter Notebook Environment
Once all the ecosystem is fully installed, we can start by
launching the Jupyter notebook platform.
The browser will immediately be launched displaying the
Jupyter notebook homepage, whose URL is
http://localhost:8888/tree. Note that a special port is used;
by default it is 8888.
A blank notebook is created called Untitled.

Click on the notebook name and rename it:


DataScience-GetStartedExample.

Let us begin by importing those toolboxes that we will


need for our program
Let us begin by importing those toolboxes that we will need for
our program.

In the first cell we put the code to import the Pandas library as
pd.

This is for convenience; every time we need to use some


functionality from the Pandas library, we will write pd instead of
pandas.

We will also import the two core libraries mentioned above:


the numpy library as np and
the matplotlib library as plt.
To execute just one cell, we press the ¸ button or click on
Cell Run or press the keys Ctrl + Enter .
While execution is underway, the header of the cell shows
the * mark:
Once the execution is finished, the header of the cell will be
replaced by the next number of execution.
Since this will be the first cell executed, the number shown
will be 1. If the process of importing the libraries is correct,
no output cell is produced.
The DataFrame Data Structure

The DataFrame Data Structure


The key data structure in Pandas is the DataFrame object.

A DataFrame is basically a tabular data structure, with rows and


columns. Rows have a specific index to access them, which can
be any name or value.

In Pandas, the columns are called Series, a special type of data,

First, we will create a new cell by clicking Insert Insert Cell


Below or pressing the keys Ctrl + B . Then, we write in the
following code:
The DataFrame Data Structure
The DataFrame Data Structure

Now, if we execute this cell, the result will be a table.

The index of each row is created automatically taking


the position of its elements inside the entry lists,
starting from 0
The DataFrame Data Structure
The DataFrame Data Structure

Open Government Data Analysis Example Using


Pandas
we will start doing some basic analysis of government
data.

For the sake of transparency, data produced by


government entities must be open, meaning that they
can be freely used, reused, and distributed by anyone.
The DataFrame Data Structure

An example of this is the Eurostat, which is the home


of European Commission data.

Eurostat’s main role is to process and publish


comparable statistical information at the European
level.

The data in Eurostat are provided by each member state


and it is free to reuse them, for both noncommercial
and commercial purposes
The DataFrame Data Structure

The first thing to do is to retrieve such data from


Eurostat.

Since open data have to be delivered in a plain text


format, CSV (or any other delimiter-separated value)
formats are commonly used to store tabular data.

The data we will use can be found already processed at


book’s Github repository as educ_figdp_1_Data.csv
file.
Reading

Reading
Create a new notebook called Open Government Data
Analysis and open it.

Then, after ensuring that the educ_figdp_1_Data.csv


file is stored in the same directory as our notebook
directory, we will write the following code to read and
show the content:
Reading
Reading

The way to read CSV (or any other separated value, providing the
separator character) files in Pandas is by calling the read_csv method.

Besides the name of the file, we add the na_values key argument to this
method along with the character that represents “non available data” in
the file.

Pandas also has functions for reading files with formats such as Excel,
HDF5, tabulated files, or even the content from the clipboard
(read_excel(), read_hdf(), read_table(), read_clipboard

Whichever function we use, the result of reading a file is stored as a


DataFrame structure.
Reading
Reading
Selecting Data
Selecting Data
Filtering Data
Filtering Missing Values
Pandas uses the special value NaN (not a number) to
represent missing values.

In Python, NaN is a special floating-point value returned by


certain operations when one of their results ends in an
undefined value.

To tell whether a value is missing in a DataFrame is by


using the isnull() function.

Indeed, this function can be used to filter rows with missing


values:
Filtering Missing Values
Manipulating Data
To select the desired data, the next thing we need to know is how
to manipulate data.
One of the most straightforward things we can do is to operate
with columns or rows using aggregation functions
If a function is applied to a DataFrame or a selection of rows and
columns, then you can specify:
If the function should be applied to the rows for each column
(setting the axis=0 keyword on the invocation of the function), or
It should be applied on the columns for each row (setting the
axis=1 keyword on the invocation of the function).
Manipulating Data
Manipulating Data
Manipulating Data
Manipulating Data
Manipulating Data
Manipulating Data
Sorting
Grouping Data
Rearranging Data
Ranking Data
Now we can perform the ranking using the rank function.
Note here that the parameter ascending=False makes the
ranking go from the highest values to the lowest values.

The Pandas rank function supports different tie-breaking


methods, specified with the method parameter.

In our case, we use the first method, in which ranks are


assigned in the order they appear in the array, avoiding gaps
between ranking.
Ranking Data
Plotting
Pandas DataFrames and Series can be plotted using the
plot function, which uses the library for graphics
Matplotlib.

For example, if we want to plot the accumulated values


for each country over the last 6 years, we can take the
Series obtained in the previous example and plot it
directly by calling the plot function as shown in the
next cell:
Plotting

You might also like