0% found this document useful (0 votes)
2 views

Data Analysis

The document outlines a step-by-step guide for performing data analysis using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn on a sales dataset. It covers essential tasks including data loading, exploration, cleaning, visualization, feature engineering, and building a predictive model for Total_Sales. The example demonstrates practical applications of data manipulation, visualization techniques, and machine learning model evaluation.

Uploaded by

Messih Grmay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analysis

The document outlines a step-by-step guide for performing data analysis using Python libraries such as Pandas, Matplotlib, Seaborn, and Scikit-learn on a sales dataset. It covers essential tasks including data loading, exploration, cleaning, visualization, feature engineering, and building a predictive model for Total_Sales. The example demonstrates practical applications of data manipulation, visualization techniques, and machine learning model evaluation.

Uploaded by

Messih Grmay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Analysis

Data analysis using Pandas and other technologies is a common approach for handling,
analyzing, and visualizing datasets in Python. Below is a step-by-step example of how to
perform data analysis using Pandas, alongside other tools such as Matplotlib, Seaborn, and
Scikit-learn.

Let's walk through an example of performing data analysis on a CSV dataset that contains
information about customer sales transactions.

Step 1: Install Necessary Libraries


bash
CopyEdit
pip install pandas matplotlib seaborn scikit-learn

Step 2: Import Libraries


python
CopyEdit
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

Step 3: Load the Dataset

We'll load a sample dataset into Pandas using the read_csv method. Assume the dataset is a
CSV file named sales_data.csv, with columns like Date, Product, Price, Quantity,
Total_Sales, and Customer_ID.

python
CopyEdit
# Load dataset
df = pd.read_csv('sales_data.csv')

# Display first few rows of the dataset


df.head()

Sample Data (sales_data.csv):

Date Product Price Quantity Total_Sales Customer_ID

2021-01-01 Widget 25 2 50 101

2021-01-02 Gadget 15 3 45 102


Date Product Price Quantity Total_Sales Customer_ID

2021-01-03 Widget 25 5 125 103

2021-01-04 Widget 25 3 75 101

2021-01-05 Gadget 15 4 60 102

Step 4: Basic Data Exploration

Before starting analysis, it’s important to explore and clean the data.

python
CopyEdit
# Data summary and info
print(df.info()) # Check data types and null values
print(df.describe()) # Get summary statistics

# Check for missing values


print(df.isnull().sum())

# Convert 'Date' column to datetime type


df['Date'] = pd.to_datetime(df['Date'])

# Check for duplicate rows


df.drop_duplicates(inplace=True)

Step 5: Data Cleaning (if necessary)

In case there are missing or inconsistent values in the dataset, we can handle them:

python
CopyEdit
# Fill missing values (if any)
df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)

# Drop rows with missing target variable (e.g., 'Total_Sales')


df.dropna(subset=['Total_Sales'], inplace=True)

Step 6: Data Visualization

Data visualization helps to better understand trends, relationships, and distributions in the
dataset.

Example 1: Sales Distribution by Product


python
CopyEdit
# Bar plot showing total sales for each product
product_sales = df.groupby('Product')['Total_Sales'].sum().sort_values()
product_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.show()
Example 2: Scatter Plot for Price vs. Total Sales
python
CopyEdit
# Scatter plot to analyze the relationship between Price and Total Sales
plt.figure(figsize=(8,6))
sns.scatterplot(x='Price', y='Total_Sales', data=df)
plt.title('Price vs Total Sales')
plt.show()
Example 3: Sales Trends Over Time
python
CopyEdit
# Line plot to show sales trends over time
df_grouped = df.groupby('Date')['Total_Sales'].sum()
df_grouped.plot(kind='line', figsize=(10,6), color='green')
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.show()

Step 7: Feature Engineering

In case you want to create new features or variables for predictive models:

python
CopyEdit
# Extract year and month from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Feature engineering: Calculate profit assuming a 30% profit margin


df['Profit'] = df['Total_Sales'] * 0.30

Step 8: Build a Simple Predictive Model (Example: Predicting Total Sales)

Let’s build a simple machine learning model to predict Total_Sales based on features like
Price, Quantity, and Product.

1. Split Data into Training and Testing Sets


python
CopyEdit
# Convert 'Product' into numerical category
df['Product'] = df['Product'].astype('category').cat.codes

# Features and target variable


X = df[['Price', 'Quantity', 'Product']]
y = df['Total_Sales']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
2. Train the Model (Linear Regression Example)
python
CopyEdit
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test data


y_pred = model.predict(X_test)
3. Evaluate the Model
python
CopyEdit
# Calculate Mean Absolute Error (MAE) to evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

Step 9: Save the Results

You can save the model or processed data for future use:

python
CopyEdit
# Save the processed dataset to a new CSV file
df.to_csv('processed_sales_data.csv', index=False)

# Save the trained model using pickle


import pickle
with open('sales_prediction_model.pkl', 'wb') as model_file:
pickle.dump(model, model_file)

Example Summary:

In this example, we loaded a sales dataset, performed data exploration, cleaning, and
visualization, and then built a machine learning model to predict Total_Sales. Along the way,
we used:

 Pandas: for data manipulation and cleaning.


 Matplotlib and Seaborn: for data visualization (scatter plots, line plots, and bar charts).
 Scikit-learn: for machine learning, including data splitting, model training, and evaluation.

This is just a simple demonstration. In real-world scenarios, the data analysis process can involve
more complex transformations, more advanced machine learning models, and more sophisticated
visualizations.

You might also like