Data Science I: Charles C.N. Wang
Data Science I: Charles C.N. Wang
• EDUCATION
Ph.D. in Bioinformatics, Asia University, Taiwan
• RESEARCH FIELD
Bioinformatics, Systems Biology, Semantic Computing,
Text Mining and Knowledge Discovery
• Data Science
• Data Processing
• Data Visualization
Matplotlib Example
Data Processing- Data Operations in Numpy
Data Processing- Data Operations in Numpy
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas Series
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas DataFrame
Data Processing- Data Operations in Pandas
Pandas handles data through Series,Data Frame, and Panel.
Pandas Panel
Data Cleansing
• Missing data is always a problem in real life scenarios. Areas like
machine learning and data mining face severe issues in the accuracy
of their model predictions because of poor quality of data caused by
missing values. In these areas, missing value treatment is a major
point of focus to make their models more accurate and valid.
When and Why Is Data Missed?
Check for Missing Values
Cleaning / Filling Missing Data
Pandas provides various methods for cleaning the missing values. The fillna function
can “fill in” NA values with non-null data in a couple of ways, which we have
illustrated in the following sections.
Replace NaN with a Scalar Value
Fill NA Forward and Backward
Drop Missing Values
Replace Missing (or) Generic Values
Processing CSV Data
• Reading data from CSV(comma separated values) is a fundamental
necessity in Data Science.
Input as CSV File
Reading a CSV File
Reading Specific Rows
Reading Specific Columns
Reading Specific Columns and Rows
Reading Specific Columns for a Range of Rows
Processing JSON Data
JSON file stores data as text in human-readable format. JSON stands for JavaScript
Object Notation. Pandas can read JSON files using the read_json function.
Input Data
{ "ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ]
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }
Read the JSON File
Reading Specific Columns and Rows
Processing XLS Data
Microsoft Excel is a very widely used spread sheet program.
Creating a Chart
Adding Annotations
Data Visualization- Chart Styling
The charts created in python can have further styling by using some appropriate methods from the
libraries used for charting.
Adding Annotations
Data Visualization- Chart Styling
Adding Legends
Data Visualization- Chart Styling
Chart presentation Style
Data Visualization- Box Plots
Boxplots are a measure of how well distributed the data in a data set is. It divides the data set into
three quartiles.
Drawing a Box Plot
There are three main measures of central tendency which can be calculated using
the methods in pandas python library.
•Mean - It is the Average value of the data which is a division of sum of the values with the
number of values.
•Median - It is the middle value in distribution when the values are arranged in ascending or
descending order.
•Mode - It is the most commonly occurring value in a distribution.
Measuring Central Tendency-
Calculating Mean and Median
The pandas functions can be directly used to calculate these values.
Measuring Central Tendency-
Calculating Mode
Mode may or may not be available in a distribution depending on whether the data is continous or
whether there are values which has maximum frquency.
Asia University
Health|Care|Innovation|Excellence