0% found this document useful (0 votes)
3 views

ELT Using Pandas

This cheat sheet provides a comprehensive guide on using Pandas for Extract, Load, and Transform (ELT) processes, covering data extraction, loading, transformation, cleaning, and analysis. It includes various functions and methods for handling different data formats, performing advanced transformations, and optimizing performance. Additionally, it addresses data integration, serialization, and automation of ETL workflows.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ELT Using Pandas

This cheat sheet provides a comprehensive guide on using Pandas for Extract, Load, and Transform (ELT) processes, covering data extraction, loading, transformation, cleaning, and analysis. It includes various functions and methods for handling different data formats, performing advanced transformations, and optimizing performance. Additionally, it addresses data integration, serialization, and automation of ETL workflows.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

# [ ELT Using Pandas ] ( CheatSheet )

1. Data Extraction

● Read CSV File: pd.read_csv('filename.csv')


● Read Excel File: pd.read_excel('filename.xlsx')
● Read JSON File: pd.read_json('filename.json')
● Read SQL Database: pd.read_sql(query, connection)
● Read HTML Table: pd.read_html('http://url')
● Read Parquet File: pd.read_parquet('filename.parquet')
● Read from Clipboard: pd.read_clipboard()
● Read from a Python Dictionary: pd.DataFrame.from_dict(dict)
● Read from Multiple Files: [pd.read_csv(f) for f in file_list]

2. Data Loading

● Write to CSV File: df.to_csv('filename.csv')


● Write to Excel File: df.to_excel('filename.xlsx')
● Write to JSON File: df.to_json('filename.json')
● Write to SQL Database: df.to_sql(table_name, connection)
● Write to Parquet File: df.to_parquet('filename.parquet')
● Write to HTML File: df.to_html('filename.html')
● Append to Existing File or Database: df.to_sql(table_name, connection,
if_exists='append')
● Save to Python Pickle Format: df.to_pickle('filename.pkl')

3. Data Transformation

● Filtering Rows: df[df['column'] > value]


● Selecting Columns: df[['col1', 'col2']]
● Renaming Columns: df.rename(columns={'old_name': 'new_name'})
● Dropping Columns: df.drop(columns=['col1', 'col2'])
● Handling Missing Data: df.fillna(value) or df.dropna()
● Type Conversion: df.astype({'col': 'int32'})
● String Operations: df['col'].str.upper()
● Datetime Conversion: pd.to_datetime(df['col'])
● Sorting Data: df.sort_values(by='col')
● Grouping and Aggregation: df.groupby('col').sum()
● Pivot Tables: df.pivot_table(index='col1', values='col2', aggfunc='mean')

By: Waleed Mousa


● Merging DataFrames: pd.merge(df1, df2, on='col')
● Concatenating DataFrames: pd.concat([df1, df2])
● Joining DataFrames: df1.join(df2, on='col')
● Reshaping with Melt: pd.melt(df, id_vars=['col1'], value_vars=['col2'])
● Reshaping with Stack/Unstack: df.stack() or df.unstack()
● Creating Dummy Variables: pd.get_dummies(df['col'])
● Applying Functions: df.apply(lambda x: custom_function(x))
● Regular Expressions: df['col'].str.extract('(regex_pattern)')
● Handling Time Series Data: df.resample('D').mean()
● Rolling Window Calculations: df.rolling(window=5).mean()
● Conditional Logic: np.where(df['col'] > value, 'yes', 'no')
● Data Normalization: (df - df.mean()) / df.std()

4. Advanced Data Transformation

● Binning Numerical Data: pd.cut(df['col'], bins)


● Discretizing Numerical Data: pd.qcut(df['col'], q=4)
● Transforming with Map: df['col'].map(mapping_dict)
● Exploding List-Like Data: df.explode('list_col')
● Pivot Longer and Wider: df.pivot_longer() and df.pivot_wider() (Using
janitor library)
● Multi-Index Creation and Slicing: df.set_index(['col1', 'col2'])
● Cross-Tabulation: pd.crosstab(df['col1'], df['col2'])
● Aggregation with Custom Functions: df.groupby('col').agg(custom_agg_func)
● Correlation Matrix: df.corr()
● Data Standardization for Machine Learning:
StandardScaler().fit_transform(df)

5. Data Cleaning

● Trimming Whitespace: df['col'].str.strip()


● Replacing Values: df.replace({'old_value': 'new_value'})
● Dropping Duplicates: df.drop_duplicates()
● Data Validation Checks: pd.testing.assert_frame_equal(df1, df2)
● Regular Interval Resampling for Time Series: df.resample('5T').mean()

6. Exploratory Data Analysis

● Descriptive Statistics: df.describe()


● Histograms for Distribution: df['col'].hist(bins=20)
By: Waleed Mousa
● Box Plots for Outliers: df.boxplot(column='col')
● Pair Plots for Relationships: sns.pairplot(df)
● Heatmap for Correlation Analysis: sns.heatmap(df.corr(), annot=True)

7. Handling Large Data

● Chunking Large Data Files: pd.read_csv('large_file.csv', chunksize=10000)


● Memory Usage of DataFrame: df.memory_usage(deep=True)
● Optimizing Data Types: df.astype({'col': 'category'})
● Lazy Evaluation with Dask: dask.dataframe.from_pandas(df)

8. Data Anonymization

● Hashing for Anonymization: df['col'].apply(lambda x:


hashlib.sha256(x.encode()).hexdigest())
● Randomized Data Perturbation: df['col'] + np.random.normal(0, 1,
df.shape[0])

9. Text Data Specific Operations

● Word Count: df['text'].str.split().str.len()


● Text Cleaning (e.g., removing punctuation):
df['text'].str.replace('[^\w\s]', '', regex=True)
● Term Frequency: df['text'].str.split().explode().value_counts()

10. Visualization for EDA

● Bar Plots: df['col'].value_counts().plot(kind='bar')


● Line Plots: df.plot(kind='line', x='x_col', y='y_col')
● Scatter Plots: df.plot.scatter(x='x_col', y='y_col')
● KDE Plots for Density: df['col'].plot.kde()

11. Advanced Data Loading and Transformation

● Integrating with Web APIs: requests.get(api_url)


● Loading Data from Remote Sources: pd.read_csv(remote_file_url)
● Complex Data Transformations: df.pipe(custom_complex_transformation)

12. Feature Engineering

By: Waleed Mousa


● Date Part Extraction: df['date_col'].dt.year, df['date_col'].dt.month,
etc.
● Lag Features for Time Series: df['feature'].shift(periods=1)
● Rolling Features for Time Series: df['feature'].rolling(window=5).mean()
● Differential Features: df['feature'].diff(periods=1)

13. Data Integration

● Combining Multiple Data Sources: pd.concat([df1, df2], axis=0)


● Merging Data on Keys: pd.merge(df1, df2, on='key_column')
● Creating Database Connections for Extraction/Loading:
sqlalchemy.create_engine(db_string)

14. Performance Optimization

● Parallel Processing with Swifter: df.swifter.apply(custom_function)


● Optimizing DataFrames with Eval/Query: df.eval('new_col = col1 + col2')
● Categorical Data Optimization: df['cat_col'] =
df['cat_col'].astype('category')

15. Error Handling and Data Quality

● Error Handling in Data Loading: try: pd.read_csv('file.csv') except:


handle_error()
● Data Quality Checks: assert df['column'].notnull().all()

16. Data Serialization and Compression

● Saving DataFrames in Compressed Format: df.to_csv('file.csv.gz',


compression='gzip')
● Reading Compressed Data: pd.read_csv('file.csv.gz', compression='gzip')

17. Using Pandas with Other Libraries for ETL/ELT

● Converting DataFrame to Spark DataFrame: spark.createDataFrame(df)


● Using Pandas with PySpark for Distributed Processing: spark_df =
spark.read.csv('file.csv')
● Integration with NumPy for Mathematical Operations:
np.log(df['numeric_column'])

By: Waleed Mousa


18. Workflow Automation and Scripting

● Automating ETL Processes: schedule.every().day.at("10:30").do(etl_job)


● Running Pandas Operations in Scripts: python etl_script.py

19. Ensuring Data Consistency

● Data Type Validation: df['column'].dtype == 'expected_dtype'


● Consistency Checks Between DataFrames: pd.testing.assert_frame_equal(df1,
df2)

20. Reporting and Documentation

● Generating Summary Reports: profile = pandas_profiling.ProfileReport(df)

21. Database Specific Operations

● Querying Databases Directly: pd.read_sql_query('SELECT * FROM table',


engine)

By: Waleed Mousa

You might also like