0% found this document useful (0 votes)
3 views

Lecture 7 Working With Pandas (1)

Uploaded by

Fatima Chaudhry
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 7 Working With Pandas (1)

Uploaded by

Fatima Chaudhry
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Application Development (CSF 510)

Working with Pandas


Lecture Contents
• Data Science Application Development Process
• Pandas
• Pandas DataFrame
• CSV Files
• UCI Machine Learning Repository
• Reading CSVs into Pandas DataFrame
• Some Basic Pandas Commands to Know your Data

Lecture 7 Working with Pandas 2


Data Science Application Development
Process

Lecture 7 Working with Pandas 3


Data Science Application Development
Process
The key stages in the data science application development process are:
1. Problem identification and business understanding: This involves understanding
the business needs and goals, and identifying the specific problem that the data
science application will solve.
2. Data collection and exploration: This involves gathering the relevant data from a
variety of sources and exploring the data to understand its characteristics and
patterns.
3. Data preparation and cleaning: This involves cleaning and transforming the data
to make it suitable for analysis.
4. Data modeling and analysis: This involves building and evaluating machine
learning models to solve the identified problem.
5. Model deployment and monitoring: This involves deploying the trained model to
production, and monitoring its performance to ensure that it is meeting the business
needs.

Lecture 7 Working with Pandas 4


Example : Predicting Customer Churn
• Problem identification and business understanding :A telecommunications company wants to reduce
customer churn. Churn is when a customer cancels their subscription service. It can be costly for companies
to acquire new customers, so it is important to retain existing customers.
• Data collection and exploration: The telecommunications company collects data on its customers, such
as their demographics, usage patterns, and customer support interactions. The company also collects data
on its competitors' offerings. The company explores data to identify patterns and trends that may be
associated with customer churn. For example, the company may find that customers who use a certain
service less frequently are more likely to churn.
• Data preparation and cleaning: The company prepares the data for analysis by cleaning it to remove
errors and inconsistencies. The company may also transform the data to make it suitable for the chosen
machine learning algorithm.
• Data modeling and analysis :The company builds and evaluates machine learning models to predict
customer churn. The company may use a variety of machine learning algorithms, such as logistic
regression, decision trees, and random forests. The company evaluates the models on a held-out test set to
assess their accuracy. The company selects the model with the best performance on the test set.
• Model deployment and monitoring :The company deploys the trained model to production. The model is
used to predict the likelihood of customer churn. The company uses these predictions to identify customers
who are at risk of churning and to develop targeted interventions to retain these customers.
• The company monitors the performance of the model over time to ensure that it is still accurate. The
company may need to retrain the model as the data and the business environment change.

Lecture 7 Working with Pandas 5


What is Pandas ?
• Pandas is a Python Library used for data analysis and manipulation
• It provides necessary support for implementing series and DataFrame data
structures
• Pandas can be used to :
• Upload a CSV/ Excel/JSON file into a series or DataFrame
• Perform necessary data cleaning
• Apply different data pre-processing techniques
• Perform Statistical Analysis
• Make data ready for a variety of Machine Learning and prediction tasks

Lecture 7 Working with Pandas 6


Pandas DataFrame

• A DataFrame is a data structure to hold two-dimensional array with


heterogeneous data columns.
• Pandas DataFrame is the ideal data structure for EDA and Pre-Processing
Tasks.
• Pandas DataFrame Features:
• Reading different file types like Excel and CSV
• Default and user defined indexing
• Easy data access based on indices
• Easy handling of Missing Values
• Easy Data Splitting & Merging , Concatenation and integration
• Easy data sorting and aggregation
• Easy and efficient handling of time series data

Lecture 7 Working with Pandas 7


CSV – Comma Separated Values

• A CSV file is a text file


• Stores values which are separated by a comma ( so called CSV) or any other
delimiter like semicolon, colon , space or tab etc
• Each row corresponds to a record or row in a table/ dataset
• All the SQL/NOSQL Databases can be exported in this format
• Most used format to export , share and import datasets
• A CSV file typically stores tabular data (numbers and text) in plain text, in
which case each line will have the same number of fields.
• Can be viewed in Notepad or Excel

Lecture 7 Working with Pandas 8


CSV- Bank Marketing Dataset

Lecture 7 Working with Pandas 9


CSV- Bank Marketing Dataset ( Excel
View)

Lecture 7 Working with Pandas 10


UCI Machine Learning Repository
• UCI – University of California Irvine
• Maintains 600+ datasets as a service to Machine Learning community
• Open Source
• Related to Various ML Tasks
• Classification
• Regression
• Clustering
• Timeseries
• Text / NLP
• Bank Marketing Dataset

Lecture 7 Working with Pandas 11


Reading CSVs into Pandas DataFrame
• You need to import pandas library
import pandas as pd
• Use pd.read_csv()
• Different parameters :
• filepath_or_buffer ( path of the file on your system)
• header ( specify the row containing the row labels / column names)
• header=0 if first row contains column names ( Default) , if no column names, then header=None
• index_col
• Used to specify the index column
• If first column is the index column, then index_col=0
• If no index column exists, then index_col=None ( Default Value)
• There is an exhaustive list can be referred when required

Lecture 7 Working with Pandas 12


Some Basic Commands to Know your
Data
Command Description Syntax

shape() Returns the number of records and dimensions in a data set df. shape()
head() Displays the top five rows in a dataset df.head()
tail() Displays the last five rows in a dataset df.tail()
columns Displays the column names in a dataframe df.columns
info() Displays general information about a dataframe including data types , df.info()
number of records , number of columns with respect to data type etc.
describe() shows MCT and Dispersion statistics regarding numeric attributes df.describe()
dtypes Show the data types of each column df.dtypes
columns Shows the column names of the DataFrame df.columns
Isnull().sum() Shows the null values in each column df.isnull().sum()

Lecture 7 Working with Pandas 13


Indexing and Selecting Data with Pandas
• Slice and Dice
• The process of selecting a subset of rows and columns from the DataFrame.
• This can be done using a variety of methods, including:
• [ ] Operator
• loc[ ] - Label Indexing
• iloc[ ] Integer Indexing
• Boolean Indexing

Lecture 7 Working with Pandas 14


Resources / References/
Acknowledgments
• https://www.machinelearningplus.com/pandas/pandas-read_csv-completed/
• Data Manipulation using Pandas ( Sample Blog)
• https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/

Lecture 7 Working with Pandas 15

You might also like