0% found this document useful (0 votes)
85 views

Dataframe in Pandas - Cheatsheet

Uploaded by

elvin aliyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Dataframe in Pandas - Cheatsheet

Uploaded by

elvin aliyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

# [ Dataframe Operations With Pandas ] ( CheatSheet )

1. Creating and Reading Data

● Creating an Empty DataFrame: pd.DataFrame()


● Creating a DataFrame from a Dictionary: pd.DataFrame({'col1': [1, 2],
'col2': [3, 4]})
● Reading CSV File: pd.read_csv('file.csv')
● Reading Excel File: pd.read_excel('file.xlsx')
● Reading JSON File: pd.read_json('file.json')
● Reading SQL Query: pd.read_sql_query('SELECT * FROM table', connection)

2. Viewing and Inspecting Data

● Viewing the First Few Rows: df.head()


● Viewing the Last Few Rows: df.tail()
● Getting DataFrame Info: df.info()
● Getting Summary Statistics: df.describe()
● Displaying Column Names: df.columns
● Displaying Row Indices: df.index

3. Data Selection and Indexing

● Selecting a Single Column: df['column']


● Selecting Multiple Columns: df[['col1', 'col2']]
● Selecting Rows by Position: df.iloc[10:20]
● Selecting Rows by Index Label: df.loc['index1':'index2']
● Conditional Selection: df[df['column'] > 0]
● Setting Index: df.set_index('column')
● Resetting Index: df.reset_index()

4. Data Cleaning

● Dropping Columns: df.drop(columns=['col'])


● Dropping Rows: df.drop(index=['index'])
● Renaming Columns: df.rename(columns={'old': 'new'})
● Checking for Missing Values: df.isnull()
● Dropping Missing Values: df.dropna()
● Filling Missing Values: df.fillna(value)

By: Waleed Mousa


● Converting Data Types: df.astype({'col': 'int'})

5. Data Manipulation

● Applying Functions: df.apply(lambda x: x + 1)


● Mapping Values: df['column'].map({'a': 1, 'b': 2})
● Replacing Values: df.replace({'old': 'new'})
● Adding New Columns: df['new_col'] = df['col1'] + df['col2']
● Deleting Columns: del df['column']
● Concatenating DataFrames: pd.concat([df1, df2])
● Merging DataFrames: pd.merge(df1, df2, on='col')

6. Sorting and Ranking

● Sorting by an Index: df.sort_index()


● Sorting by a Column: df.sort_values(by='col')
● Ranking Data: df['col'].rank()

7. Unique Values, Value Counts, and Membership

● Getting Unique Values: df['col'].unique()


● Counting Unique Values: df['col'].nunique()
● Getting Value Counts: df['col'].value_counts()
● Checking Membership: df['col'].isin(['val1', 'val2'])

8. Grouping and Aggregation

● Grouping Data: df.groupby('col')


● Aggregate Functions: df.agg({'col1': 'sum', 'col2': 'mean'})
● Grouping and Aggregating: df.groupby('col').agg({'col1': 'sum', 'col2':
'mean'})
● Pivot Tables: df.pivot_table(values='D', index='A', columns='C')

9. Time Series Data

● Convert String to DateTime: pd.to_datetime(df['col'])


● Resampling Time Series Data: df.resample('M').mean()
● Shifting Dates and Times: df.shift(1)
● Window Functions: df.rolling(window=5).mean()

By: Waleed Mousa


10. Visualization

● Plotting Data: df.plot()


● Histograms: df['col'].hist()
● Box Plots: df.boxplot(column=['col1', 'col2'])
● Scatter Plots: df.plot.scatter(x='col1', y='col2')

11. File Writing and Output

● Writing to CSV: df.to_csv('file.csv')


● Writing to Excel: df.to_excel('file.xlsx')
● Writing to JSON: df.to_json('file.json')
● Writing to SQL Database: df.to_sql('table', connection)

12. Advanced DataFrame Operations

● MultiIndex / Hierarchical Index: df.set_index(['col1', 'col2'])


● Crosstab: pd.crosstab(df['col1'], df['col2'])
● Normalizing Data: (df - df.mean()) / df.std()
● Binning Data: pd.cut(df['col'], bins)

13. Missing Data and Interpolation

● Interpolating Missing Values: df.interpolate()


● Dropping Duplicate Rows: df.drop_duplicates()
● Replacing Outliers: df['col'][df['col'] > threshold] = new_value

14. Combining and Reshaping Data

● Stacking and Unstacking: df.stack(), df.unstack()


● Melting Data: pd.melt(df)
● Pivoting: df.pivot('row', 'col', 'values')
● Concatenating Along an Axis: pd.concat([df1, df2], axis=1)

15. Advanced String Operations

● String Methods: df['col'].str.upper()


● Regular Expressions: df['col'].str.extract('(pattern)', expand=True)
● String Replacement: df['col'].str.replace('old', 'new')

By: Waleed Mousa


16. Handling Large Data

● Chunking Large Files: pd.read_csv('file.csv', chunksize=1000)


● Dask for Parallel Computing: import dask.dataframe as dd;
dd.from_pandas(df, npartitions=10)
● Sampling Data: df.sample(frac=0.1)

17. Efficiency and Performance

● Query Method for Filtering: df.query('col > 0')


● Evaluating Expressions: df.eval('new_col = col1 + col2')
● Using Categorical Data: df['col'] = df['col'].astype('category')

18. Memory Management

● Reducing Memory Usage: df.astype('float32')


● Memory Usage of DataFrame: df.memory_usage()

19. Multi-Threading and Parallel Processing

● Parallel Apply with Dask: import dask.dataframe as dd; dd.from_pandas(df,


npartitions=10).map_partitions(lambda df: df.apply(func))

20. Dataframe Styling and Formatting

● Styling DataFrames: df.style.apply(highlight_func)


● Setting Display Format: pd.options.display.float_format = '{:.2f}'.format

21. Advanced Indexing and Slicing

● Index Slicing with loc and iloc: df.loc['row1':'row2', 'col1':'col2']


● Conditional Slicing: df.loc[df['col'] > 0]
● Indexing with isin: df[df['col'].isin([val1, val2])]

22. Data Integrity and Validation

● Verifying Integrity: df.validate_subset(['col1', 'col2'])


● Ensuring No NA Values: df.dropna(subset=['col1', 'col2'])

23. Advanced Merging and Joining

By: Waleed Mousa


● Inner Join: df1.merge(df2, on='col', how='inner')
● Outer Join: df1.merge(df2, on='col', how='outer')
● Left Join: df1.merge(df2, on='col', how='left')
● Right Join: df1.merge(df2, on='col', how='right')

24. Data Type Conversion and Management

● Converting Types: df['col'].astype('int')


● Handling Time Series Data Type: pd.to_datetime(df['date_col'])
● Converting to Category for Efficiency: df['col'].astype('category')

25. Saving and Serializing Dataframes

● Saving DataFrame to Pickle: df.to_pickle('df.pkl')


● Loading DataFrame from Pickle: pd.read_pickle('df.pkl')
● Saving to HDF5: df.to_hdf('data.h5', 'df')
● Loading from HDF5: pd.read_hdf('data.h5', 'df')

26. Working with External Databases

● Querying from SQL Database: pd.read_sql('SELECT * FROM table',


connection)
● Writing to SQL Database: df.to_sql('table', connection,
if_exists='replace')

27. Advanced Dataframe Features

● Using applymap for Elementwise Function: df.applymap(func)


● Expanding Data with explode: df.explode('list_col')
● Aggregating with Named Agg: df.groupby('col').agg(min_col=('col', 'min'),
max_col=('col', 'max'))
● Transforming Data with transform: df.groupby('col').transform(lambda x: x
- x.mean())

28. Multi-Level Indexing (Hierarchical Indexing)

● Creating MultiIndex from Tuples: pd.MultiIndex.from_tuples([('a', 1),


('a', 2)], names=['letter', 'number'])
● Setting MultiIndex in DataFrame: df.set_index(['Col1', 'Col2'])
● Sorting by MultiIndex: df.sort_index(level=0)

By: Waleed Mousa


● Index Slicing with MultiIndex: df.loc[('index1', 'subindex1')]
● Stacking and Unstacking with MultiIndex: df.stack(), df.unstack()

29. Advanced Grouping and Aggregation

● Custom Aggregation Functions: df.groupby('col').agg(custom_agg_function)


● Named Aggregation: df.groupby('col').agg(total=('col2', 'sum'),
average=('col2', 'mean'))
● Grouping with Different Functions per Column:
df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'})
● Transform Function with Groupby: df.groupby('col').transform('mean')
● Filtering After GroupBy: df.groupby('col').filter(lambda x:
x['col2'].mean() > value)

30. Time Series and Date Handling

● Resampling Time Series Data: df.resample('M').mean()


● Shifting and Lagging Time Series Data: df.shift(1)
● Rolling Window Functions on Time Series: df.rolling(window=3).mean()
● Expanding Window Functions: df.expanding(2).sum()
● Custom Resampling of Time Series:
df.resample('3T').apply(custom_resampler)

31. Advanced Text and String Manipulation

● Vectorized String Operations: df['col'].str.upper()


● Extracting Substrings: df['col'].str.extract(r'(regex)')
● Replacing Text with Regular Expression: df['col'].str.replace(r'[abc]',
'X')
● Splitting and Expanding Strings: df['col'].str.split('_').str[0]
● Aggregating Strings: df.groupby('col')['text'].agg(' '.join)

32. Handling Missing and Duplicated Data

● Filling Missing Values with Interpolation: df.interpolate()


● Filling Missing Values with Backward or Forward Fill: df.bfill(),
df.ffill()
● Dropping Duplicates: df.drop_duplicates()
● Identifying Duplicate Data: df.duplicated()
● Counting Missing Values: df.isnull().sum()

By: Waleed Mousa


33. Pivot and Cross Tabulation

● Pivot Without Aggregation: df.pivot(index='date', columns='col',


values='val')
● Pivot Table with Multiple Aggregations: pd.pivot_table(df, values='D',
index=['A', 'B'], columns=['C'], aggfunc=[np.sum, np.mean])
● Crosstabulation of Two Factors: pd.crosstab(df['A'], df['B'])
● Normalizing Crosstab: pd.crosstab(df['A'], df['B'], normalize='index')

34. Styling and Display

● Styling DataFrame Output: df.style.applymap(color_negative_red)


● Conditional Formatting: df.style.apply(highlight_max, axis=0)
● Bar Charts in DataFrame Cells: df.style.bar(subset=['A', 'B'],
color='#d65f5f')
● Setting Global Display Options: pd.set_option('display.max_rows', 500)

35. Saving and Serializing

● Writing Data to a SQL Database: df.to_sql('table', conn, index=False,


if_exists='append')
● Reading Data from SQL Database: pd.read_sql('SELECT * FROM table', conn)
● Saving DataFrame as Markdown: df.to_markdown()
● Saving DataFrame as HTML: df.to_html()

36. Joins and Merges

● Merging with Different Join Types: pd.merge(df1, df2, on='key',


how='left/right/outer/inner')
● Joining on Index: df1.join(df2)
● Concatenating Along a Particular Axis: pd.concat([df1, df2], axis=1)
● Adding a Prefix or Suffix to Column Names: df.add_prefix('X_'),
df.add_suffix('_Y')

37. Visualization with Pandas

● Line Plot: df.plot()


● Bar Plot: df.plot.bar()
● Histogram: df.plot.hist()
● Box Plot: df.plot.box()

By: Waleed Mousa


38. Optimization and Performance

● Using Categories for Efficiency: df['col'] = df['col'].astype('category')


● Querying DataFrames: df.query('col > 0')
● Evaluating Expression: df.eval('new_col = col1 + col2')
● Parallelizing apply with Dask or Modin: import modin.pandas as pd;
df.apply(func)

39. Geospatial Data

● Working with Geospatial Data: import geopandas as gpd;


gpd.GeoDataFrame(df)
● Plotting Geospatial Data: gdf.plot()
● Spatial Joins: gpd.sjoin(gdf1, gdf2, op='within')

40. Advanced DataFrame Features

● Using Query Method for Complex Filtering: df.query('col > 0 & col < 10')
● Using eval for Efficient Operations: df.eval('col = col1 + col2')
● MultiIndex Slicing: df.xs(key='value', level='level2')

41. Handling Large Data

● Chunking Large Files for Reading: pd.read_csv('large_file.csv',


chunksize=10000)
● Using Dask for Large DataFrames: import dask.dataframe as dd;
dd.from_pandas(df, npartitions=10)
● Efficiently Combine Many Files: pd.concat((pd.read_csv(f) for f in files))

42. Data Cleaning at Scale

● Cleaning with replace: df.replace(to_replace="old_value",


value="new_value")
● Batch Removing Missing Data: df.dropna(thresh=2)

By: Waleed Mousa

You might also like