Data Manipulation With Pandas - Yulei's Sandbox
Data Manipulation With Pandas - Yulei's Sandbox
Python
Pandas
DataAnalysis
Base on DataCamp.
DataFrames
Introducing DataFrames
Inspecting a DataFrame
.head() returns the first few rows (the “head” of the DataFrame).
.info() shows information on each of the columns, such as the data type and
number of missing values.
.shape returns the number of rows and columns of the DataFrame.
.describe() calculates a few summary statistics for each column.
Parts of a DataFrame
Sorting rows
Subsetting columns
Subsetting rows
| , .isin()
New columns
Combo-attack!
Aggregating Data
Summary Statistics
Summarizing dates
Efficient summaries
The .agg() method allows you to apply your own custom functions to a DataFrame, as
well as apply functions to more than one column of a DataFrame at once, making your
aggregations super efficient.
# A custom IQR function
def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)
Cumulative statistics
Counting
Dropping duplicates
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=['store','type'])
print(store_types.head())
# Subset the rows that are holiday weeks and drop duplicate dates
holiday_dates = sales[sales['is_holiday']==True].drop_duplicates('date')
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')['weekly_sales'].agg([np.min, np.max,
np.mean, np.median])
# Print sales_stats
print(sales_stats)
# Print unemp_fuel_stats
print(unemp_fuel_stats)
Pivot tables
The .pivot_table() method is just an alternative to .groupby() .
# Print mean_sales_by_type
print(mean_sales_by_type)
> weekly_sales
type
A 23674.667
B 25696.678
# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values='weekly_sales', index=
'type', aggfunc=[np.mean, np.median])
# Print mean_med_sales_by_type
print(mean_med_sales_by_type)
> mean median
weekly_sales weekly_sales
type
A 23674.667 11943.92
B 25696.678 13336.08
# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)
> is_holiday False True
type
A 23768.584 590.045
B 25751.981 810.705
The .pivot_table() method has several useful arguments, including fill_value and
margins .
# Print mean weekly_sales by department and type; fill missing values with 0
print(sales.pivot_table(values='weekly_sales', index='department',
columns='type', fill_value=0))
# Print the mean weekly_sales by department and type; fill missing values with
0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department",
columns="type", fill_value=0, margins=True))
Explicit indexes
# Look at temperatures
print(temperatures)
# Look at temperatures_ind
print(temperatures_ind)
You can only slice an index if the index is sorted (using .sort_index() ).
To slice at the outer level, first and last can be strings.
To slice at inner levels, first and last should be tuples.
If you pass a single slice to .loc[] , it will slice the rows.
# Try to subset rows from Lahore to Moscow (This will return nonsense.)
print(temperatures_srt.loc['Lahore':'Moscow'])
Add the date column to the index, then use .loc[] to perform the subsetting. The
important thing to remember is to keep your dates in ISO 8601 format, that is, yyyy-mm-
dd .
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures["date"] >= '2010-01-01') &
(temperatures["date"] <= '2011-12-31')]
print(temperatures_bool)
# Set date as an index
temperatures_ind = temperatures.set_index('date')
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc['2010-08':'2011-2'])
This is done using .iloc[] , and like .loc[] , it can take two arguments to let you
subset by rows and columns.
You can access the components of a date (year, month and day) using code of the
form dataframe["column"].dt.component . For example, the month component is
dataframe["column"].dt.month , and the year component is
dataframe["column"].dt.year .
A pivot table is just a DataFrame with sorted indexes. the .loc[] + slicing combination
is often helpful.
# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year==mean_temp_by_year.max()])
> year
2013 20.312
dtype: float64
# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city==mean_temp_by_city.min()])
> country city
China Harbin 4.877
dtype: float64
# Add a legend
plt.legend(["conventional", "organic"])
Missing values
.isna() , .any()
# Print a summary that shows whether any value in each column is missing or
not.
print(avocados_2016.isna().any())
# Show plot
plt.show()
.dropna()
Creating DataFrames
List of dictionaries
Dictionary of lists
CSV to DataFrame
# Create new col, bumps_per_10k: no. of bumps per 10k passengers for each
airline
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] /
airline_totals["total_passengers"] * 10000
# Print airline_totals
print(airline_totals)
DataFrame to CSV
# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k',
ascending=False)
# Print airline_totals_sorted
print(airline_totals_sorted)
# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")