0% found this document useful (0 votes)
19 views

Pandas Data Manipulation Extended CheatSheet 1731972219

This document is an extensive cheat sheet for data manipulation using Pandas, covering various operations such as importing/exporting data, data inspection, selection, cleaning, transformation, string operations, and more. It includes practical code snippets for each operation, making it a comprehensive guide for users looking to perform data analysis and manipulation tasks. The document is authored by Waleed Mousa and serves as a quick reference for both beginners and advanced users of the Pandas library.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Pandas Data Manipulation Extended CheatSheet 1731972219

This document is an extensive cheat sheet for data manipulation using Pandas, covering various operations such as importing/exporting data, data inspection, selection, cleaning, transformation, string operations, and more. It includes practical code snippets for each operation, making it a comprehensive guide for users looking to perform data analysis and manipulation tasks. The document is authored by Waleed Mousa and serves as a quick reference for both beginners and advanced users of the Pandas library.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

# [ Pandas Data Manipulation ] ( Extended-CheatSheet )

1. Importing and Exporting Data

● Import pandas: import pandas as pd


● Read CSV: df = pd.read_csv('file.csv')
● Read Excel: df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
● Read JSON: df = pd.read_json('file.json')
● Read SQL query: df = pd.read_sql_query("SELECT * FROM table", connection)
● Read HTML table: df = pd.read_html('https://example.com/table.html')[0]
● Read clipboard data: df = pd.read_clipboard()
● Write to CSV: df.to_csv('output.csv', index=False)
● Write to Excel: df.to_excel('output.xlsx', index=False)
● Write to JSON: df.to_json('output.json', orient='records')
● Write to SQL: df.to_sql('table_name', connection, if_exists='replace')
● Write to clipboard: df.to_clipboard()

2. Data Inspection

● Display first rows: df.head()


● Display last rows: df.tail()
● Display random sample: df.sample(n=5)
● Get dataframe info: df.info()
● Get dataframe statistics: df.describe()
● Get column names: df.columns
● Get data types: df.dtypes
● Get dimensions: df.shape
● Check for null values: df.isnull().sum()
● Get unique values: df['column'].unique()
● Get value counts: df['column'].value_counts()
● Get correlation matrix: df.corr()
● Get covariance matrix: df.cov()

3. Data Selection

● Select single column: df['column']


● Select multiple columns: df[['column1', 'column2']]
● Select rows by index: df.loc[0:5]
● Select rows by condition: df[df['column'] > 5]
● Select rows and columns: df.loc[0:5, ['column1', 'column2']]
● Select by integer position: df.iloc[0:5, 0:2]
● Boolean indexing: df[df['column'].isin(['value1', 'value2'])]
● Query method: df.query('column > 5 and column2 == "value"')

By: Waleed Mousa


● Select random rows: df.sample(n=10)
● Select every nth row: df.iloc[::n]

4. Data Cleaning

● Drop null values: df.dropna()


● Fill null values: df.fillna(value)
● Fill null with method: df.fillna(method='ffill')
● Replace values: df.replace(old_value, new_value)
● Remove duplicates: df.drop_duplicates()
● Reset index: df.reset_index(drop=True)
● Rename columns: df.rename(columns={'old_name': 'new_name'})
● Set column as index: df.set_index('column')
● Convert data types: df['column'] = df['column'].astype('int64')
● Handle outliers: df = df[(df['column'] < df['column'].quantile(0.95)) &
(df['column'] > df['column'].quantile(0.05))]

5. Data Transformation

● Apply function to column: df['new_column'] = df['column'].apply(lambda x:


x*2)
● Apply function to multiple columns: df[['col1', 'col2']] = df[['col1',
'col2']].apply(lambda x: x*2)
● Map values: df['column'] = df['column'].map({'old1': 'new1', 'old2':
'new2'})
● Binning: df['binned'] = pd.cut(df['column'], bins=[0, 25, 50, 75, 100],
labels=['Low', 'Medium', 'High', 'Very High'])
● One-hot encoding: pd.get_dummies(df, columns=['categorical_column'])
● Normalize data: (df - df.min()) / (df.max() - df.min())
● Standardize data: (df - df.mean()) / df.std()
● Log transformation: np.log(df)
● Exponential transformation: np.exp(df)
● Square root transformation: np.sqrt(df)

6. String Operations

● Lowercase: df['column'].str.lower()
● Uppercase: df['column'].str.upper()
● Titlecase: df['column'].str.title()
● Strip whitespace: df['column'].str.strip()
● Replace substring: df['column'].str.replace('old', 'new')
● Split string: df['column'].str.split(',')
● Get string length: df['column'].str.len()

By: Waleed Mousa


● Extract substring: df['column'].str[0:5]
● Pad string: df['column'].str.pad(10, fillchar='0')
● Check if contains: df['column'].str.contains('substring')

7. DateTime Operations

● Convert to datetime: pd.to_datetime(df['date_column'])


● Extract year: df['date_column'].dt.year
● Extract month: df['date_column'].dt.month
● Extract day: df['date_column'].dt.day
● Extract weekday: df['date_column'].dt.dayofweek
● Extract week of year: df['date_column'].dt.isocalendar().week
● Extract quarter: df['date_column'].dt.quarter
● Add time delta: df['date_column'] + pd.Timedelta(days=1)
● Calculate time difference: (df['end_date'] - df['start_date']).dt.days
● Resample time series: df.resample('M', on='date_column').mean()

8. Grouping and Aggregation

● Group by single column: df.groupby('column').mean()


● Group by multiple columns: df.groupby(['column1', 'column2']).sum()
● Group by with multiple aggregations: df.groupby('column').agg({'column1':
'mean', 'column2': 'sum'})
● Group by with custom function: df.groupby('column').apply(lambda x:
x['column2'].max() - x['column2'].min())
● Group by with size: df.groupby('column').size()
● Group by with filter: df.groupby('column').filter(lambda x: len(x) > 2)
● Group by with transform: df.groupby('column')['value'].transform('mean')
● Pivot table: pd.pivot_table(df, values='value', index='row',
columns='column', aggfunc='mean')
● Cross-tabulation: pd.crosstab(df['column1'], df['column2'])
● Rolling window calculations: df['rolling_mean'] =
df['column'].rolling(window=3).mean()

9. Merging and Combining Data

● Merge on column: pd.merge(df1, df2, on='key')


● Merge on index: pd.merge(df1, df2, left_index=True, right_index=True)
● Left join: pd.merge(df1, df2, how='left', on='key')
● Right join: pd.merge(df1, df2, how='right', on='key')
● Outer join: pd.merge(df1, df2, how='outer', on='key')
● Concatenate vertically: pd.concat([df1, df2], axis=0)
● Concatenate horizontally: pd.concat([df1, df2], axis=1)

By: Waleed Mousa


● Combine first: df1.combine_first(df2)
● Update values: df1.update(df2)
● Merge asof (nearest match join): pd.merge_asof(df1, df2, on='date')

10. Reshaping Data

● Melt dataframe: pd.melt(df, id_vars=['id'], value_vars=['column1',


'column2'])
● Pivot: df.pivot(index='date', columns='category', values='value')
● Stack: df.stack()
● Unstack: df.unstack()
● Wide to long: pd.wide_to_long(df, stubnames='column', i=['id'],
j='variable')

11. Advanced Indexing

● Multi-level indexing: df.set_index(['column1', 'column2'])


● Select from multi-index: df.loc[('level1', 'level2'), :]
● Cross-section of multi-index: df.xs('level1', level='column1')
● Swap levels: df.swaplevel('column1', 'column2')
● Sort index: df.sort_index()

12. Window Functions

● Rolling mean: df['rolling_mean'] = df['column'].rolling(window=3).mean()


● Expanding mean: df['expanding_mean'] = df['column'].expanding().mean()
● Shift values: df['previous'] = df['column'].shift(1)
● Lag difference: df['diff'] = df['column'].diff()
● Percent change: df['pct_change'] = df['column'].pct_change()

13. Time Series Operations

● Resample by frequency: df.resample('M', on='date').mean()


● Offset dates: df.index + pd.offsets.MonthEnd(1)
● Date range: pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
● Business days: pd.bdate_range(start='2023-01-01', end='2023-12-31')
● Time zone conversion:
df['date'].dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

14. Categorical Data

● Create categorical: df['column'] = pd.Categorical(df['column'])


● Get categorical codes: df['column'].cat.codes

By: Waleed Mousa


● Reorder categories: df['column'].cat.reorder_categories(['cat1', 'cat2',
'cat3'], ordered=True)
● Add new category: df['column'].cat.add_categories('new_category')
● Remove unused categories: df['column'].cat.remove_unused_categories()

15. Handling Missing Data

● Check for missing values: df.isnull().sum()


● Drop rows with any missing values: df.dropna()
● Drop columns with any missing values: df.dropna(axis=1)
● Fill missing with mean: df.fillna(df.mean())
● Fill missing with median: df.fillna(df.median())
● Fill missing with mode: df.fillna(df.mode().iloc[0])
● Fill missing with forward fill: df.fillna(method='ffill')
● Fill missing with backward fill: df.fillna(method='bfill')
● Interpolate missing values: df.interpolate()
● Replace inf values: df.replace([np.inf, -np.inf], np.nan)

16. Data Validation and Cleaning

● Remove duplicates: df.drop_duplicates(subset=['column'], keep='first')


● Check data types: df.dtypes
● Convert data types: df['column'] = df['column'].astype('int64')
● Handle mixed types: pd.to_numeric(df['column'], errors='coerce')
● Trim whitespace: df['column'] = df['column'].str.strip()
● Remove non-numeric characters: df['column'] =
df['column'].str.replace('[^0-9]', '')
● Replace values: df['column'].replace({'old': 'new'})
● Clip values to range: df['column'] = df['column'].clip(lower=0, upper=100)
● Round values: df['column'] = df['column'].round(2)
● Normalize column names: df.columns = df.columns.str.lower().str.replace('
', '_')

17. Advanced Selection and Filtering

● Select by data type: df.select_dtypes(include=['int64', 'float64'])


● Filter with complex conditions: df[(df['column1'] > 5) &
(df['column2'].isin(['A', 'B']))]
● Mask values: df['column'].mask(df['column'] < 0, 0)
● Where condition: df['column'].where(df['column'] > 0, 0)
● Filter with regex: df[df['column'].str.contains(r'^pattern$', regex=True)]

18. Performance Optimization

By: Waleed Mousa


● Use inplace operations: df.dropna(inplace=True)
● Optimize data types: df = df.astype({'column1': 'int32', 'column2':
'category'})
● Use chunking for large files: pd.read_csv('large_file.csv', chunksize=10000)
● Use generators: (chunk for chunk in pd.read_csv('large_file.csv',
chunksize=10000))
● Use SQL query with chunksize: pd.read_sql_query("SELECT * FROM large_table",
engine, chunksize=10000)

19. Statistical Operations

● Describe numeric columns: df.describe()


● Describe categorical columns: df.describe(include=['object', 'category'])
● Calculate correlation: df.corr()
● Calculate covariance: df.cov()
● Calculate skewness: df.skew()
● Calculate kurtosis: df.kurtosis()
● Calculate percentiles: df.quantile([0.25, 0.5, 0.75])
● Calculate cumulative sum: df['cumsum'] = df['column'].cumsum()
● Calculate cumulative max: df['cummax'] = df['column'].cummax()
● Calculate cumulative min: df['cummin'] = df['column'].cummin()

20. Advanced Grouping and Aggregation

● Group by with multiple aggregations: df.groupby('category').agg({'column1':


['mean', 'median'], 'column2': ['min', 'max']})
● Group by with custom aggregation: df.groupby('category').agg({'column':
lambda x: x.max() - x.min()})
● Group by with windowing:
df.groupby('category')['value'].rolling(window=3).mean()
● Group by with expanding window:
df.groupby('category')['value'].expanding().sum()
● Group by with unstack: df.groupby(['category1',
'category2'])['value'].mean().unstack()

21. Advanced Time Series

● Resample with custom aggregation: df.resample('M',


on='date').agg({'column1': 'mean', 'column2': 'sum'})
● Rolling window with custom function:
df['column'].rolling(window=3).apply(lambda x: x.max() - x.min())
● Expanding window with custom function: df['column'].expanding().apply(lambda
x: x.max() - x.min())

By: Waleed Mousa


● Calculate year-over-year growth:
df.groupby(df.index.year)['column'].pct_change()
● Shift business days: df.shift(periods=1, freq='B')
● Rolling correlation: df['column1'].rolling(window=30).corr(df['column2'])
● Seasonal decompose: from statsmodels.tsa.seasonal import
seasonal_decompose; result = seasonal_decompose(df['column'],
model='additive')

22. Text and String Operations

● Extract regex pattern: df['column'].str.extract(r'(\d+)')


● Find all regex matches: df['column'].str.findall(r'\d+')
● Replace regex pattern: df['column'].str.replace(r'\d+', 'number',
regex=True)
● Split string and expand: df['column'].str.split(',', expand=True)
● Concatenate strings: df['full_name'] = df['first_name'] + ' ' +
df['last_name']
● Pad strings: df['column'].str.pad(10, side='left', fillchar='0')
● Remove accents: df['column'].str.normalize('NFKD').str.encode('ASCII',
errors='ignore').str.decode('ASCII')
● Count occurrences: df['column'].str.count('pattern')

23. Advanced Merging and Joining

● Merge with indicator: pd.merge(df1, df2, on='key', how='outer',


indicator=True)
● Merge on multiple keys: pd.merge(df1, df2, on=['key1', 'key2'])
● Merge with suffixes: pd.merge(df1, df2, on='key', suffixes=('_left',
'_right'))
● Merge with validation: pd.merge(df1, df2, on='key', validate='one_to_one')
● Merge closest match: pd.merge_asof(df1, df2, on='date',
direction='nearest')
● Merge with tolerance: pd.merge_asof(df1, df2, on='date',
tolerance=pd.Timedelta('1d'))

24. Advanced Reshaping

● Pivot with multiple values: df.pivot(index='date', columns='category',


values=['value1', 'value2'])
● Pivot with aggregation: df.pivot_table(values='value', index='row',
columns='column', aggfunc='sum', fill_value=0)
● Melt with id vars: pd.melt(df, id_vars=['id', 'date'], var_name='variable',
value_name='value')
● Crosstab with margins: pd.crosstab(df['row'], df['column'], margins=True)
By: Waleed Mousa
● Explode list column: df.explode('list_column')

25. Advanced Indexing and Selection

● Boolean indexing with isin: df[df['column'].isin(['value1', 'value2'])]


● Select with query method: df.query('column1 > 5 and column2 == "value"')
● Select with eval method: df[df.eval('column1 > column2 + 5')]
● Indexing with loc and boolean array: df.loc[df['column'] > 5,
'other_column']
● Conditional selection with numpy where: df[np.where(df['column'] > 5, 'High',
'Low')]

26. Memory Optimization

● Reduce memory usage: df.select_dtypes(include=['int']).astype('int32')


● Use categorical data type: df['column'] = df['column'].astype('category')
● Use sparse data structures: df['sparse_column'] =
pd.arrays.SparseArray(df['column'])
● Use datetime64[ns] for timestamps: df['date'] = pd.to_datetime(df['date'])

27. Advanced Aggregation

● Aggregate with named aggregation: df.groupby('category').agg(total=('value',


'sum'), average=('value', 'mean'))
● Weighted average: df.groupby('category').apply(lambda x:
np.average(x['value'], weights=x['weight']))
● First and last aggregation: df.groupby('category').agg({'value': ['first',
'last']})
● Aggregate with Q1 and Q3: df.groupby('category').agg({'value': [lambda x:
x.quantile(0.25), lambda x: x.quantile(0.75)]})
● Aggregate with custom function: df.groupby('category').agg({'value': lambda
x: x.nlargest(3).mean()})

28. Advanced Time Series Analysis

● Autocorrelation: df['column'].autocorr(lag=1)
● Partial autocorrelation: from statsmodels.tsa.stattools import pacf;
pacf(df['column'], nlags=10)
● Granger causality test: from statsmodels.tsa.stattools import
grangercausalitytests; grangercausalitytests(df[['column1', 'column2']],
maxlag=5)
● ARIMA model: from statsmodels.tsa.arima.model import ARIMA; model =
ARIMA(df['column'], order=(1,1,1)).fit()

By: Waleed Mousa


● Prophet forecasting: from fbprophet import Prophet; m =
Prophet().fit(df[['ds', 'y']])

29. Advanced Visualization with Pandas

● Basic line plot: df.plot(x='date', y='value')


● Multiple line plot: df.plot(x='date', y=['value1', 'value2'])
● Scatter plot: df.plot.scatter(x='column1', y='column2')
● Bar plot: df.plot.bar(x='category', y='value')
● Histogram: df['column'].plot.hist(bins=20)
● Box plot: df.boxplot(column=['value1', 'value2'])
● Heatmap: df.pivot('row', 'column', 'value').plot.heatmap()
● Pair plot: pd.plotting.scatter_matrix(df)
● Parallel coordinates: pd.plotting.parallel_coordinates(df, 'category')
● Andrews curves: pd.plotting.andrews_curves(df, 'category')

By: Waleed Mousa

You might also like