PPT for Assignment-3 (Final_Pandas_Lab)
PPT for Assignment-3 (Final_Pandas_Lab)
Pandas
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
Why Use Pandas?
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
• Pandas can clean messy data sets, and make them readable
and relevant.
• Relevant data is very important in data science.
What Can Pandas Do?
• Pandas gives you answers about the data. Like:
– Is there a correlation between two or more columns?
– What is average value?
– Max value?
– Min value?
• Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
• CSV, Excel files converted to dataframe
Pandas Data Structures
•Data Structures: Pandas offers two primary data structures
»DataFrame and Series
Pandas DataFrames
• A Pandas DataFrame is a 2 dimensional data structure, like a 2
dimensional array, or a table with rows and columns.
• Pandas DataFrame is a two-dimensional, size-mutable, and
heterogeneous data structure (similar to a table in a relational
database or an Excel spreadsheet).
Pandas DataFrames
• Example:
• Create a simple Pandas DataFrame:
• import pandas as pd //Importing pandas library
data = {
’calories’: [420, 380, 390], Keys correspond to columns and values
’duration’: [50, 40, 45] respond to rows
}
} 2 Ford 2
myvar = pd.DataFrame(mydataset)
print(myvar)
Dataframe with one column
• X = df[[‘Length’]]
Dataframe with multiple columns
Pandas DataFrames
• df.head() – First five rows of the dataframe
• df.tail() - View the last n rows of the DataFrame (default is 5 rows).
• df.info(): This method provides a concise summary of the DataFrame, including
the number of non-null entries, column names, and data types.
• df.describe(): Returns description of the data in the DataFrame.
• count - The number of not-empty values.
mean - The average (mean) value.
std - The standard deviation.
min - the minimum value.
25% - The 25% percentile*.
50% - The 50% percentile*.
75% - The 75% percentile*.
max - the maximum value.
Pandas DataFrames
• Locate Row:
– As you can see from the result above, the DataFrame is like a table
with rows and columns.
– Pandas use the loc attribute to return one or more specified row(s)
Pandas DataFrames
• Example
• Return row 0:
• #refer to the row index:
print(df.loc[0])
Pandas Series
• A one-dimensional labeled array, essentially a single column or row of
data.
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
• Example:
– Create a simple Pandas Series from a list:
– import pandas as pd
a = [1, 7, 2]
data1 = pd.Series(a)
Labels
• If nothing else is specified, the values are labeled with their
index number. First value has index 0, second value has index 1
etc.
• This label can be used to access a specified value.
• With the index argument, you can name your own labels.
Labels
• Example
• Create your own labels:
myVar
• import pandas as pd x 1
y 7
a = [1, 7, 2]
z 2
print(myvar)
Df.loc[‘b’:’e’, ‘Artist’]
Navigating DataFrame – loc and iloc
• Both used for accessing rows and columns in dataframe
Label-based Indexing (loc) Integer location-based Indexing (iloc)
• Access rows and columns by their • Access rows and columns by their
labels integer positions
• Includes both the start and end • Excludes end positions in slicing
positions in slicing • Does not accept Boolean data
• df.loc[row_labels, column_labels] • df.iloc[row_labels, column_labels]
df.loc[1, 'column_name'] # Accesses the row with label 1 and column 'column_name'
df.loc[1:3, 'col1':'col3'] # Accesses rows 1 to 3 and columns 'col1' to 'col3'
df.iloc[1, 0] # Accesses the second row (index 1) and first column (index 0)
df.iloc[1:3, 0:2] # Accesses rows at index 1 and 2, and columns at index 0 and 1
Navigating Data Frame
• iloc exclusively uses integer positions for accessing data.
• As a result, it makes it particularly useful when dealing with data where labels
might be unknown or irrelevant.
• df.iloc[row number/slice]
• df.iloc[4], df.iloc[1:4], df.iloc[:],
• df.iloc[1:4, 5:8] --- SLICING
Dataframe Slicing
Dataframe Slicing
Navigating Data Frame
• df.iloc[4]-This command selects the 5th row (index 4) from the
DataFrame df. It returns a single row as a Series.
• df.iloc[1:4]:This command selects a slice of rows from index 1
to 3 (excluding index 4) from df. It returns multiple rows as a
DataFrame.
Navigating Data Frame
• df.iloc[:]
• -This command selects all rows and columns from df. It’s
essentially the same as df, returning the entire DataFrame.
• df.iloc[1:4, 5:8]:This command selects rows from index 1 to 3
(excluding 4) and columns from index 5 to 7 (excluding 8). It
returns the specified subset as a DataFrame.
Navigating Data Frame
• df.iloc[:,2]-This will select all rows (:) for the specified column
index (3rd column), effectively giving you the entire column
without specifically extracting any single row.
• This is the closest way to extract a column with .iloc without
targeting individual rows.
Listing Unique Values
create a new database consisting of songs from the 1980s and after
When you use the Boolean Series as an index (e.g., df[boolean_series]), pandas returns a new
DataFrame containing only the rows where the Boolean Series has True values.
Pandas Read CSV
• A simple way to store big data sets is to use
CSV files (comma separated files).
• CSV files contains plain text and is a well know
format that can be read by everyone including
Pandas.
• In our examples we will be using a CSV file
called 'data.csv'.
Pandas Read CSV
• Example:
• Load the CSV into a DataFrame:
• import pandas as pd
df = pd.read_csv('data.csv’)
df.head() #show only first 5 rows
print(df.to_string()) #show all the rows
Pandas Read CSV
• The pd.read_csv() function is used to read the data from the
data.csv file.
• df.to_string() converts the entire DataFrame df into a string
representation, showing all rows and columns.
• If you have a large DataFrame with many rows, Pandas will only
return the first 5 rows, and the last 5 rows:
Delete a column from Dataset
• You can delete a column or feature
from a dataset-
– df.drop(labels, axis, index, columns,
– df.drop(labels, axis, index, columns, inplace)
inplace) Method 2
Method 1
Dropping both
Dropping either – Index = labels
– Column = labels
– Axis = Whether to drop labels from the
index (0 or ‘index’) or columns (1 or
‘columns’)
– Labels = Index or column labels to drop
df.drop(labels=[‘Age’], axis=1, inplace=True)
df.drop(index=[1], columns=['Salary'],
inplace=True)
Delete a column from Dataset
•Row Selection: The first parameter ‘1’ is used to select the first row.
•Axis Parameter: axis=0 specifies you are dropping a row.
•Inplace=True - If you want to modify the DataFrame in place.
•Inplace=False- If you do not want to modify the DataFrame in place