This cheat sheet provides a comprehensive guide on using Pandas for Extract, Load, and Transform (ELT) processes, covering data extraction, loading, transformation, cleaning, and analysis. It includes various functions and methods for handling different data formats, performing advanced transformations, and optimizing performance. Additionally, it addresses data integration, serialization, and automation of ETL workflows.
This cheat sheet provides a comprehensive guide on using Pandas for Extract, Load, and Transform (ELT) processes, covering data extraction, loading, transformation, cleaning, and analysis. It includes various functions and methods for handling different data formats, performing advanced transformations, and optimizing performance. Additionally, it addresses data integration, serialization, and automation of ETL workflows.
● Read Excel File: pd.read_excel('filename.xlsx') ● Read JSON File: pd.read_json('filename.json') ● Read SQL Database: pd.read_sql(query, connection) ● Read HTML Table: pd.read_html('http://url') ● Read Parquet File: pd.read_parquet('filename.parquet') ● Read from Clipboard: pd.read_clipboard() ● Read from a Python Dictionary: pd.DataFrame.from_dict(dict) ● Read from Multiple Files: [pd.read_csv(f) for f in file_list]
2. Data Loading
● Write to CSV File: df.to_csv('filename.csv')
● Write to Excel File: df.to_excel('filename.xlsx') ● Write to JSON File: df.to_json('filename.json') ● Write to SQL Database: df.to_sql(table_name, connection) ● Write to Parquet File: df.to_parquet('filename.parquet') ● Write to HTML File: df.to_html('filename.html') ● Append to Existing File or Database: df.to_sql(table_name, connection, if_exists='append') ● Save to Python Pickle Format: df.to_pickle('filename.pkl')
● Discretizing Numerical Data: pd.qcut(df['col'], q=4) ● Transforming with Map: df['col'].map(mapping_dict) ● Exploding List-Like Data: df.explode('list_col') ● Pivot Longer and Wider: df.pivot_longer() and df.pivot_wider() (Using janitor library) ● Multi-Index Creation and Slicing: df.set_index(['col1', 'col2']) ● Cross-Tabulation: pd.crosstab(df['col1'], df['col2']) ● Aggregation with Custom Functions: df.groupby('col').agg(custom_agg_func) ● Correlation Matrix: df.corr() ● Data Standardization for Machine Learning: StandardScaler().fit_transform(df)
5. Data Cleaning
● Trimming Whitespace: df['col'].str.strip()
● Replacing Values: df.replace({'old_value': 'new_value'}) ● Dropping Duplicates: df.drop_duplicates() ● Data Validation Checks: pd.testing.assert_frame_equal(df1, df2) ● Regular Interval Resampling for Time Series: df.resample('5T').mean()
6. Exploratory Data Analysis
● Descriptive Statistics: df.describe()
● Histograms for Distribution: df['col'].hist(bins=20) By: Waleed Mousa ● Box Plots for Outliers: df.boxplot(column='col') ● Pair Plots for Relationships: sns.pairplot(df) ● Heatmap for Correlation Analysis: sns.heatmap(df.corr(), annot=True)
7. Handling Large Data
● Chunking Large Data Files: pd.read_csv('large_file.csv', chunksize=10000)
● Memory Usage of DataFrame: df.memory_usage(deep=True) ● Optimizing Data Types: df.astype({'col': 'category'}) ● Lazy Evaluation with Dask: dask.dataframe.from_pandas(df)
8. Data Anonymization
● Hashing for Anonymization: df['col'].apply(lambda x:
hashlib.sha256(x.encode()).hexdigest()) ● Randomized Data Perturbation: df['col'] + np.random.normal(0, 1, df.shape[0])
9. Text Data Specific Operations
● Word Count: df['text'].str.split().str.len()
● Text Cleaning (e.g., removing punctuation): df['text'].str.replace('[^\w\s]', '', regex=True) ● Term Frequency: df['text'].str.split().explode().value_counts()
10. Visualization for EDA
● Bar Plots: df['col'].value_counts().plot(kind='bar')
● Line Plots: df.plot(kind='line', x='x_col', y='y_col') ● Scatter Plots: df.plot.scatter(x='x_col', y='y_col') ● KDE Plots for Density: df['col'].plot.kde()
11. Advanced Data Loading and Transformation
● Integrating with Web APIs: requests.get(api_url)
● Loading Data from Remote Sources: pd.read_csv(remote_file_url) ● Complex Data Transformations: df.pipe(custom_complex_transformation)
12. Feature Engineering
By: Waleed Mousa
● Date Part Extraction: df['date_col'].dt.year, df['date_col'].dt.month, etc. ● Lag Features for Time Series: df['feature'].shift(periods=1) ● Rolling Features for Time Series: df['feature'].rolling(window=5).mean() ● Differential Features: df['feature'].diff(periods=1)
13. Data Integration
● Combining Multiple Data Sources: pd.concat([df1, df2], axis=0)
● Merging Data on Keys: pd.merge(df1, df2, on='key_column') ● Creating Database Connections for Extraction/Loading: sqlalchemy.create_engine(db_string)
14. Performance Optimization
● Parallel Processing with Swifter: df.swifter.apply(custom_function)
● Optimizing DataFrames with Eval/Query: df.eval('new_col = col1 + col2') ● Categorical Data Optimization: df['cat_col'] = df['cat_col'].astype('category')
15. Error Handling and Data Quality
● Error Handling in Data Loading: try: pd.read_csv('file.csv') except:
handle_error() ● Data Quality Checks: assert df['column'].notnull().all()
16. Data Serialization and Compression
● Saving DataFrames in Compressed Format: df.to_csv('file.csv.gz',
● Converting DataFrame to Spark DataFrame: spark.createDataFrame(df)
● Using Pandas with PySpark for Distributed Processing: spark_df = spark.read.csv('file.csv') ● Integration with NumPy for Mathematical Operations: np.log(df['numeric_column'])
Enhancing Operational Effectiveness of U.S. Naval Forces in Highly Degraded Environments: Autonomy and Artificial Intelligence in Unmanned Aircraft Systems: Abbreviated Version of Full Report (2022)