0% found this document useful (0 votes)
190 views

Data Science I: Charles C.N. Wang

This document provides an overview of data science and related topics including data processing, visualization, and analysis. It discusses popular Python libraries for working with data like Pandas, NumPy, SciPy, and Matplotlib. It also covers reading and cleaning data from different file formats like CSV, JSON, and XLS and exploring data visualization techniques and chart properties in Python.

Uploaded by

sar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views

Data Science I: Charles C.N. Wang

This document provides an overview of data science and related topics including data processing, visualization, and analysis. It discusses popular Python libraries for working with data like Pandas, NumPy, SciPy, and Matplotlib. It also covers reading and cleaning data from different file formats like CSV, JSON, and XLS and exploring data visualization techniques and chart properties in Python.

Uploaded by

sar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Data Science I

Charles C.N. Wang


http://cnwang.me
[email protected]
Department of Bioinformatics and Medical Engineering
王昭能 Charles C.N. Wang
• CURRENTLY
Department of Bioinformatics and Medical
Engineering

Center for Artificial Intelligence

Center for AI Precision Medicine

• EDUCATION
Ph.D. in Bioinformatics, Asia University, Taiwan

M.S. in Bioinformatics, Asia University, Taiwan

• RESEARCH FIELD
Bioinformatics, Systems Biology, Semantic Computing,
Text Mining and Knowledge Discovery
• Data Science

• Data Processing

• Data Visualization

• Statistical Data Analysis


Data Science
• Data is the new Oil.
• How every modern IT system is driven by capturing, storing and analysing data for
various needs.
• Data science about making decision for business, forecasting weather, studying
protein structures in biology or designing a marketing campaign.
• All of these scenarios involve a multidisciplinary approach of using mathematical
models, statistics, graphs, databases and of course the business or scientific logic
behind the data analysis.
Data Science
• Data science is the process of deriving knowledge and insights from a
huge and diverse set of data through organizing, processing and
analysing the data.
• Data science involves many different disciplines like mathematical and
statistical modelling, extracting data from it source and applying data
visualization techniques.
Data Science - Recommendation systems
• As online shopping becomes more prevalent, the e-commerce
platforms are able to capture users shopping preferences as well as
the performance of various products in the market.
• This leads to creation of recommendation systems which create
models predicting the shoppers needs and show the products the
shopper is most likely to buy.
Data Science - Financial Risk management
• The financial risk involving loans and credits are better analysed by
using the customers past spend habits, past defaults, other financial
commitments and many socio-economic indicators. These data is
gathered from various sources in different formats. Organising them
together and getting insight into customers profile needs the help of
Data science. The outcome is minimizing loss for the financial
organization by avoiding bad debt.
Data Science - Improvement in Health Care services
• The health care industry deals with a variety of data which can be
classified into technical data, financial data, patient information, drug
information and legal rules. All this data need to be analysed in a
coordinated manner to produce insights that will save cost both for
the health care provider and care receiver while remaining legally
compliant.
Data Science - Computer Vision
• The advancement in recognizing an image by a computer involves
processing large sets of image data from multiple objects of same
category. For example, Face recognition. These data sets are modelled,
and algorithms are created to apply the model to newer images to get
a satisfactory result. Processing of these huge data sets and creation
of models need various tools used in Data science.
Data Science - Efficient Management of Energy
• As the demand for energy consumption soars, the energy producing
companies need to manage the various phases of the energy
production and distribution more efficiently. This involves optimizing
the production methods, the storage and distribution mechanisms as
well as studying the customers consumption patterns. Linking the
data from all these sources and deriving insight seems a daunting task.
This is made easier by using the tools of data science.
Python in Data Science
• The programming requirements of data science demands a very
versatile yet flexible language which is simple to write the code but
can handle highly complex mathematical processing.
• Python is most suited for such requirements as it has already
established itself both as a language for general computing as well as
scientific computing. More over it is being continuously upgraded in
form of new addition to its plethora of libraries aimed at different
programming requirements.
• Below we will discuss such features of python which makes it the
preferred language for data science.
Python - Pandas
• Pandas is an open-source Python Library used for high-performance
data manipulation and data analysis using its powerful data structures.
Python with pandas is in use in a variety of academic and commercial
domains, including Finance, Economics, Statistics, Advertising, Web
Analytics, and more. Using Pandas, we can accomplish five typical
steps in the processing and analysis of data, regardless of the origin of
data — load, organize, manipulate, model, and analyse the data.
• Below are the some of the important features of Pandas which is
used specifically for Data processing and Data analysis work.
Key Features of Pandas
• Fast and efficient DataFrame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file
formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Pandas - Dimension & Description

DataFrame is widely used and it is the most important data structures.


Pandas - Series
Pandas - DataFrame
Python - Numpy
• NumPy is a Python package which stands for 'Numerical Python'. It is
a library consisting of multidimensional array objects and a collection
of routines for processing of array.

• Using NumPy, a developer can perform the following operations:


• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.
Python - SciPy
• The SciPy library of Python is built to work with NumPy arrays and
provides many user-friendly and efficient numerical practices such as
routines for numerical integration and optimization
• NumPy and SciPy run on all popular operating systems, are quick to
install, free of charge and are easy to use, but powerful enough to
depend on by some of the world's leading scientists and engineers.
Python - Matplotlib
• Matplotlib is a python library used to create 2D graphs and plots by
using python scripts. It has a module named pyplot which makes
things easy for plotting by providing feature to control line styles, font
properties, formatting axes etc. It supports a very wide variety of
graphs and plots namely - histogram, bar charts, power spectra, error
charts etc. It is used along with NumPy to provide an environment
that is an effective open source alternative for MatLab. It can also be
used with graphics toolkits like PyQt and wxPython.
Python - Matplotlib
Conventionally, the package is imported into the Python script by adding the following statement

Matplotlib Example
Data Processing- Data Operations in Numpy
Data Processing- Data Operations in Numpy
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas Series
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas DataFrame
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas Panel
Data Cleansing
• Missing data is always a problem in real life scenarios. Areas like
machine learning and data mining face severe issues in the accuracy
of their model predictions because of poor quality of data caused by
missing values. In these areas, missing value treatment is a major
point of focus to make their models more accurate and valid.
When and Why Is Data Missed?
Check for Missing Values
Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function
can “fill in” NA values with non-null data in a couple of ways, which we have
illustrated in the following sections.
Replace NaN with a Scalar Value
Fill NA Forward and Backward
Drop Missing Values
Replace Missing (or) Generic Values
Processing CSV Data
• Reading data from CSV(comma separated values) is a fundamental
necessity in Data Science.
Input as CSV File
Reading a CSV File
Reading Specific Rows
Reading Specific Columns
Reading Specific Columns and Rows
Reading Specific Columns for a Range of Rows
Processing JSON Data
JSON file stores data as text in human-readable format. JSON stands for JavaScript
Object Notation. Pandas can read JSON files using the read_json function.
Input Data

{ "ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ]
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }
Read the JSON File
Reading Specific Columns and Rows
Processing XLS Data
Microsoft Excel is a very widely used spread sheet program.

Input as Excel File


Reading an Excel File
Reading Specific Columns and Rows
Reading Multiple Excel Sheets
Data Visualization- Chart Properties
Python has excellent libraries for data visualization. A combination of Pandas, numpy and matplotlib can help in
creating in nearly all types of visualizations charts. In this chapter we will get started with looking at some simple
chart and the various properties of the chart.

Creating a Chart

Its output is as follows −


Data Visualization- Chart Properties
Labling the Axes

Its output is as follows −


Data Visualization- Chart Properties
Formatting Line type and Colour

Its output is as follows −


Data Visualization- Chart Properties
Saving the Chart File
Data Visualization- Chart Styling
The charts created in python can have further styling by using some appropriate methods from the
libraries used for charting.

Adding Annotations
Data Visualization- Chart Styling
The charts created in python can have further styling by using some appropriate methods from the
libraries used for charting.

Adding Annotations
Data Visualization- Chart Styling
Adding Legends
Data Visualization- Chart Styling
Chart presentation Style
Data Visualization- Box Plots
Boxplots are a measure of how well distributed the data in a data set is. It divides the data set into
three quartiles.
Drawing a Box Plot

Its output is as follows −


Data Visualization- Heat Maps
A heatmap contains values representing various shades of the same colour for each value to be plotted.
Data Visualization- Scatter Plots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two
variables. One variable is chosen in the horizontal axis and another in the vertical axis.

Drawing a Scatter Plot

Its output is as follows −


Data Visualization- Time Series
Time series is a series of data points in which each data point is associated with a timestamp.
Data Visualization- Time Series

Its output is as follows −


Statistical Data Analysis-
Measuring Central Tendency
Mathematically central tendency means measuring the center or distribution of location of values of
a data set. It gives an idea of the average value of the data in the data set and also an indication of
how widely the values are spread in the data set.

There are three main measures of central tendency which can be calculated using
the methods in pandas python library.

•Mean - It is the Average value of the data which is a division of sum of the values with the
number of values.
•Median - It is the middle value in distribution when the values are arranged in ascending or
descending order.
•Mode - It is the most commonly occurring value in a distribution.
Measuring Central Tendency-
Calculating Mean and Median
The pandas functions can be directly used to calculate these values.
Measuring Central Tendency-
Calculating Mode
Mode may or may not be available in a distribution depending on whether the data is continous or
whether there are values which has maximum frquency.

Its output is as follows −


Statistical Data Analysis-
Measuring Variance
In statistics, variance is a measure of how far a value in a data set lies from the mean value. In other
words, it indicates how dispersed the values are. It is measured by using standard deviation. The
other method commonly used is skewness.
Measuring Standard Deviation
Statistical Data Analysis-
Measuring Variance
It used to determine whether the data is symmetric or skewed. If the index is between -1 and 1,
then the distribution is symmetric. If the index is no more than -1 then it is skewed to the left and if
it is at least 1, then it is skewed to the right
Measuring Skewness
Statistical Data Analysis-
Normal Distribution
The normal distribution is a form presenting data by arranging the probability distribution of each
value in the data.Most values remain around the mean value making the arrangement symmetric.
Statistical Data Analysis-
P-Value
The p-value is about the strength of a hypothesis. We build hypothesis based on some statistical
model and compare the model's validity using p-value. One way to get the p-value is by using T-test.
Statistical Data Analysis-
Linear Regression
In Linear Regression these two variables are related through an equation, where exponent (power)
of both these variables is 1. Mathematically a linear relationship represents a straight line when
plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1
creates a curve.

Its output is as follows −


Thank you for listening

Asia University
Health|Care|Innovation|Excellence

You might also like