Jupyter Notebook.docx
Jupyter Notebook.docx
Jupyter Notebook is an interactive tool for writing and running code, mainly used for
data analysis, machine learning, and scientific research. It allows you to combine code, text, and
visualizations in one document.
Using Anaconda: Anaconda is a distribution that includes Jupyter Notebook along with
Python and other essential libraries.
To install:
○ Download and install Anaconda from https://www.anaconda.com/.
○ Open Anaconda Navigator and launch Jupyter Notebook.
Using pip:
If you prefer not to use Anaconda:
○ Ensure Python and pip are installed on your system.
○ Run the command: pip install notebook
○ Start Jupyter Notebook by typing jupyter notebook in your terminal or
command prompt.
Menus is Located at the top, menus provide options for file management, editing,
viewing, running code, and more. Toolbar contains shortcuts for common tasks such as saving,
adding cells, and running code. Cells is The main building blocks of a notebook. Cells can
contain code, Markdown, or raw text.
2. Essential Features and Shortcuts
Jupyter Notebook has several features and shortcuts to help you work more efficiently.
Code Cells is used to write and execute programming code. Outputs, such as printed
results or graphs, appear below the cell. Markdown Cells is for adding formatted text, links,
images, and equations using Markdown syntax. Raw Cells display text as-is without formatting
or execution.
Cell Navigation:
○ Enter: Edit the current cell.
○ Esc: Switch to command mode.
○ Arrow keys: Move between cells.
Cell Operations:
○ A: Insert a cell above.
○ B: Insert a cell below.
○ X: Cut the selected cell.
○ Z: Undo cell deletion.
Execution:
○ Shift + Enter: Run the current cell and move to the next.
○ Ctrl + Enter: Run the current cell and stay in place.
○ Alt + Enter: Run the current cell and insert a new cell below.
● Save your notebook using Ctrl + S or the save icon in the toolbar.
● Export notebooks as HTML, PDF, or other formats via the File > Download as menu.
● Share your notebook by uploading it to platforms like GitHub or Google Drive.
Markdown helps you format text easily. It's used in Jupyter Notebooks to add headings,
lists, links, images, and more.
Hyperlinks:
[Link Text](https://example.com)
Images:

Tables:
| Column1 | Column2 |
|----------- -|---------- --|
| Data 1 | Data 2 |
Data manipulation refers to the process of cleaning, transforming, and analyzing data. In
Python, this is typically done using libraries like Pandas and NumPy.
Pandas is a powerful Python library used for data manipulation and analysis. It provides
data structures like DataFrames that make it easy to work with structured data.
4.1 Loading Data into Pandas DataFrames (CSV, Excel, SQL, JSON)
● Head and Tail: Use df.head() and df.tail() to view the first and last rows of the
DataFrame.
● Info: Use df.info() to get a summary of the DataFrame, including data types and
non-null counts.
● Describe: Use df.describe() for statistical summaries of numerical columns.
4.3 Cleaning Data: Handling Missing Values, Duplicates, and Outliers
Missing Values:
● Identify: df.isnull().sum()
● Fill: df.fillna(value)
● Drop: df.dropna()
Duplicates:
● Identify: df.duplicated()
● Remove: df.drop_duplicates()
Outliers:
● Detect: Use statistical methods like z-scores or IQR.
Handle:
● Replace or remove outlier values.
Data visualization is crucial for presenting insights from data clearly. In Python, popular
libraries such as Matplotlib, Seaborn, and Plotly/Bokeh are used to create a variety of
visualizations. Below is a breakdown of the core features of these libraries.
Matplotlib is a widely used library in Python for creating static, animated, and
interactive visualizations.
8.1 Visualizing Distributions: Histograms, KDE, and Boxplots
9. Interactive Visualizations
Plotly and Bokeh are libraries used to create interactive visualizations, making it easier
to explore the data.
Plotly: A powerful library for creating interactive visualizations such as line charts, bar
charts, scatter plots, and more.
example:
import plotly.express as px
fig = px.scatter(df, x="x", y="y", title="Interactive Plotly Chart")
fig.show()
Bokeh: Another interactive visualization library, great for creating complex
visualizations for web applications.
Example:
show(p)
Dashboards combine multiple visualizations and controls (like sliders, dropdowns, and
buttons).
Widgets let users adjust filters or parameters to explore data in real time.
Example: A dashboard with a slider to update a graph dynamically.
9.3 Exporting Interactive Visualizations
You can save interactive visualizations as HTML files or embed them in web apps.
This allows others to interact with the visualizations without needing Python installed.
Data analysis techniques help to extract meaningful insights from data. Below are some
essential techniques used in data analysis, including Exploratory Data Analysis (EDA), Time
Series Analysis, and Statistical Analysis.
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their
main characteristics, often using statistical graphics and plots.
Analyze basic statistics like mean, median, standard deviation, and distributions.
Use tools like Pandas to calculate these metrics.
Example:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.describe())
10.2 Detecting and Handling Outliers
Outliers can skew results and need attention.
Methods:
■ Visualization: Box plots and scatter plots.
■ Statistical: Use Z-scores or the IQR (Interquartile Range) method.
Outliers can be removed or transformed based on the analysis.
df.resample('M').mean()
11.3 Visualizing Trends, Seasonality, and Anomalies
Correlation: Measure how two variables move together (e.g., Pearson's correlation).
Regression: Predict the value of one variable based on another.
Confidence intervals give a range of values likely to include the true parameter.
Example: "We are 95% confident that the mean lies between X and Y."
Data preprocessing converts raw data into a clean, structured format, enhancing its
quality and making it suitable for analysis or modeling. It involves handling missing values,
correcting inconsistent formats, removing irrelevant features, and scaling or encoding variables.
These steps improve the accuracy and reliability of machine learning models.
Linear regression is used for predicting a continuous dependent variable based on one or
more independent variables. It fits a straight line (or hyperplane for multiple variables) to the
data. Key metrics for evaluation include:
Random Forests: Combine multiple decision trees (an ensemble method) to reduce
overfitting and improve accuracy. Features are randomly sampled for each tree,
increasing model diversity.
Automating reports with visualizations involves generating pre-defined plots and tables
that are dynamically updated. Libraries like Matplotlib and Plotly are often integrated
into workflows for this purpose.
Vaex and Modin offer fast, scalable dataframe operations. Vaex is optimized for
out-of-core processing, while Modin provides a pandas-like interface with parallel computation.
17.3 Introduction to Spark DataFrames in Jupyter
ARIMA models are used for forecasting based on time series data, incorporating trends
and seasonality. Prophet simplifies forecasting with intuitive parameters and handles missing
data effectively.
Logistic regression predicts probabilities for binary outcomes. It uses a sigmoid function
to map predictions to a [0, 1] range and is evaluated using metrics like AUC-ROC.
GLMs extend linear models by allowing non-normal response distributions and link
functions. Common examples include Poisson regression for count data and logistic regression
for binary classification.
Introduction
This document provides a detailed analysis of retail sales trends and forecasts for the UK
retail sector for the years 2025 and 2026. It was prepared by the Centre for Retail Research
(CRR) on 28 November 2024, considering the economic and political landscape shaped by the
2024 General Election and subsequent policies.
Background
The Centre for Retail Research (CRR) has over 15 years of experience in economic
forecasting for retail businesses. Led by Prof. Joshua Bamfield, CRR adopts a dynamic model of
the economy, emphasizing human behavior’s role in shaping economic outcomes.
Current Retail Sales
The total volume of retail sales for 2023 was £427.267 billion, reflecting sales
through both physical stores and online platforms, excluding automotive fuel. Sales
volumes have been adjusted to eliminate inflationary effects, ensuring real-term
comparisons.
● 2022:
Retail sales fell by -4.6%.
● 2023: Continued decline with a -2.8% decrease.
● 2024: Preliminary data suggests a modest decline of -0.2%, with late-year
improvements offset by inflation and rising energy costs.
Retail sales are projected to experience further declines, though less severe than in
2022-2023.
Challenges:
Opportunities:
Conclusion
While the UK retail sector faces short-term challenges, strategic investments and
a focus on sustainability may drive long-term growth. However, the forecast underscores
the importance of adapting to economic and geopolitical shifts to ensure resilience in the
retail industry.
Background
1. Problem Statement
2. Data Overview
Dataset: The data was sourced from a public repository (e.g., Kaggle), containing
customer details such as:
Data Preprocessing:
3. Methodology
● Account length
● Total day minutes
● Total international calls
● Monthly charges
1. Random Forest (RF): An ensemble learning method using multiple decision
trees.
2. XGBoost: An optimized gradient-boosting framework for better performance.
3. Hybrid Model: A combination of RF and XGBoost using a voting mechanism to
unify predictions.
4. Results
● Random Forest:
○ Accuracy: 95.32%
○ TPR: 80.45%
○ TNR: 97.86%
● XGBoost:
○ Accuracy: 95.68%
○ TPR: 81.25%
○ TNR: 98.12%
● Hybrid Model:
○ Accuracy: 95.92%
○ TPR: 81.60%
○ TNR: 98.45%
5. Business Impact
6. Conclusion
The hybrid predictive model combining Random Forest and XGBoost offers a
robust solution for customer churn prediction. By leveraging machine learning,
telecommunications companies can optimize customer retention strategies, improve
profitability, and gain a competitive edge.
Future Work:
Further enhancements can include:
Problem Statement:
Dataset Description:
● Historical price data for multiple assets (e.g., stocks, bonds, ETFs, commodities).
● Key metrics such as daily returns, annualized returns, and volatility.
● Macro-economic indicators influencing financial markets (e.g., interest rates,
inflation rates).
Results:
Future Work:
This phase involves sourcing data from APIs, databases, or web scraping, followed by
handling missing values, outliers, and duplicates for clean datasets.
EDA identifies trends, correlations, and anomalies using visualizations like histograms,
scatter plots, and heatmaps, providing insights for model building.
Models are trained and validated, and insights are presented through dashboards or
reports, often accompanied by actionable recommendations.
Magic commands improve workflow efficiency. %timeit measures code execution time,
%matplotlib inline embeds plots, and %run executes external scripts.
Customization involves adjusting notebook appearance with themes and layouts, making
them more visually appealing for presentations.
Ipywidgets enable interactive elements like sliders, dropdowns, and buttons, making data
analysis dynamic and user-friendly.
Input forms allow users to filter or adjust parameters in real-time, enhancing the
flexibility of analysis and visualization.
Widgets can be linked with plots to create interactive dashboards, providing a seamless
exploratory experience.
Jupyter notebooks can be exported in various formats for sharing, using tools like
nbconvert or third-party plugins.
Git and GitHub enable collaborative notebook development with version control,
facilitating teamwork and tracking changes.
JupyterHub allows hosting notebooks for teams on shared servers, while Google Colab
provides a cloud-based solution with free GPU support.
LAB
Beginner-Level Lab Exercises
1: Getting Started with Jupyter Notebook
Ex2: Write a markdown cell with formatted text, including bold, italics, and bullet points.
In a Markdown cell, write the following content:
# Simple Markdown Example
**Bold Text**
*Italic Text*
Here’s a list:
- First item
- Second item
- Third item
When you run the Markdown cell (by pressing Shift + Enter), it will display:
Code:
Here’s a simple Python program to calculate the factorial of a number using a for
loop:
number = int(input("Enter a number: ")) # Input: Get a number from the user
How It Works:
○ Input: The user provides a number (e.g., 5).
○ Loop: The program multiplies all integers from 1 to the given number.
○ For 5, the loop computes 1 * 2 * 3 * 4 * 5.
Output:
The result is displayed as the factorial.
Ex4: Use a while loop to generate the Fibonacci sequence up to a given number.
Code:
max_value = int(input("Enter the maximum value for the Fibonacci sequence: "))
# Input: Get the maximum value for the Fibonacci sequence
How It Works:
1. Input: The user provides the maximum value for the Fibonacci sequence
(e.g., 20).
2. Initialization: Start the sequence with a = 0 and b = 1.
3. While Loop: Continue generating the next Fibonacci number (a + b) until
it exceeds the maximum value.
Output:
Print each number in the sequence.
Exploring a Dataset
1. Demographics: Name, nationality, city, latitude, longitude, gender, and ethnic
group.
2. Academic Grades: English, Math, Science, and Language grades.
3. Ratings: Portfolio, cover letter, and recommendation letter ratings.
4. Age: Age of the student.
Missing Data:
• Missing values in numeric columns were replaced with the mean, and in
non-numeric columns, with the mode.
This dataset is designed to analyze student performance and identify patterns across
demographics and academic factors.
Ex5: Load a CSV file into a Pandas DataFrame and display the first 10 rows.
Code:
Explanation:
1.import pandas as pd: This imports the Pandas library, which is essential for
working with tabular data like CSV files.
3.df.head(10): Displays the first 10 rows of the DataFrame. The head() method
allows you to specify the number of rows to view. If no number is passed, it
defaults to 5.
Output:
Ex6: Count the number of rows and columns, and display the column data types.
Code:
import pandas as pd # Import pandas library for working with CSV files
Explanation:
1. df.shape: This returns a tuple containing the number of rows and columns in the
DataFrame.
2. print(f"Number of rows: {rows}"): Prints the number of rows using the first
element of the tuple.
3. print(f"Number of columns: {columns}"): Prints the number of columns using
the second element of the tuple.
4. df.dtypes: Displays the data type of each column in the DataFrame (e.g., int64,
float64, object).
Output:
Data Cleaning
Ex7: Identify and replace missing values in a dataset with:
The mean for numeric columns.
Code:
import pandas as pd
import numpy as np
print("Original Dataset:")
print(df)
Explanation:
1. Imports necessary libraries: pandas for data handling, numpy for handling missing
values.
2. Creates a dataset with some missing values (np.nan for Age and Salary).
3. Identifies missing values using isnull().sum() to count them.
4. Fills missing values in Age and Salary columns with their respective column
means using fillna().
5. Displays the updated dataset after filling the missing values.
This approach is used to handle missing data by replacing it with the mean of the column.
Output:
Code:
import pandas as pd
print("Original Dataset:")
print(df)
Output:
3: Visualization Basics
Code:
Explanation:
This code plots a line graph showing temperature changes over a week, with days
on the x-axis and temperature (°C) on the y-axis. The line is styled with blue color ('b'),
circular markers ('o'), and a solid line ('-') for clarity. Titles, labels, and gridlines are
added to enhance readability. Finally, the graph is displayed with plt.show(), ensuring a
clean layout with plt.tight_layout().
Output:
Ex10: Create a bar chart to compare the sales of three products in different regions.
Code:
regions = ["North", "South", "East", "West"] # Data for the bar chart
products = ["Product A", "Product B", "Product C"]
sales = {
"Product A": [200, 150, 300, 250],
"Product B": [180, 130, 270, 220],
"Product C": [210, 160, 310, 260]
}
Explanation:
This code creates a grouped bar chart to compare sales of three products (Product
A, B, and C) across four regions (North, South, East, West). It uses np.arange to set bar
positions and plt.bar to plot each product's sales with different colors. Labels, title, and
legend are added for clarity, and the y-axis grid is enabled for easier reading. Finally, the
chart is displayed with plt.show().
Output:
Using Seaborn for Advanced Visualizations
Ex11: Use Seaborn to create a histogram of a numeric column from a dataset.
Code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
Explanation:
This code creates a histogram to visualize the distribution of ages in a dataset
using Seaborn's histplot function, with 8 bins and a kernel density estimate (KDE)
overlay. It customizes the appearance with a skyblue color, labels, and a grid for better
readability. The chart is displayed with appropriate titles and axis labels using plt.show().
Output:
Ex12: Create a boxplot to visualize the distribution of sales by product category.
Code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
Explanation:
This code creates a boxplot to visualize the distribution of sales across different
product categories (A, B, C) using Seaborn. It uses the "Set2" color palette for styling and
adds axis labels and a title for clarity. The grid is enabled on the y-axis for better
readability, and the plot is displayed with plt.show().
Output:
Intermediate-Level Exercises
4: Data Aggregation and Grouping
Code:
import pandas as pd
total_salary_by_department = df.groupby('Department')['Salary'].sum()#
Calculate total salary by department
Explanation:
1.Dataset Creation:
The dataset is defined as a dictionary and converted into a pandas
DataFrame for analysis.
4.Output:
Output:
Ex14: Group sales data by region and year, and calculate the average revenue.
Code:
import pandas as pd
Output:
Advanced Operations
Ex15: Create a pivot table showing total sales for each product category in each region.
Code:
import pandas as pd
Explanation:
1.Dataset:
2.Pivot Table:
3.Output:
● The pivot table shows the total sales for each product category in each
region.
Output:
Ex16: Add a new column to calculate the percentage contribution of each product to total sales.
Code:
import pandas as pd
Explanation:
Total Sales:
● The sum() function calculates the total sales across all rows.
Percentage Contribution:
Output:
Creating a subplot with two charts — a line chart for sales trends over time and a bar chart for
monthly revenue:
Code:
Explanation:
1. Data:
● months: Represents the x-axis labels.
● sales and revenue: Represent y-values for the line and bar charts,
respectively.
2. Subplots:
● plt.subplots(2, 1): Creates a figure with 2 rows and 1 column of
plots.
● axs[0]: Refers to the first subplot (line chart).
● axs[1]: Refers to the second subplot (bar chart).
3. Line Chart:
● Plots sales trends over time with plot().
● Includes a title, labels, and grid.
4. Bar Chart:
● Plots monthly revenue with bar().
5. plt.tight_layout():
● Ensures there’s no overlap between subplots.
Output:
● The first chart shows a line plot of sales trends over months.
● The second chart shows a bar chart of monthly revenue.
Ex18: Overlay a scatter plot on a line chart to show individual sales transactions along a
trend line.
To overlay a scatter plot on a line chart, we use the plot() function to draw the trend line and the
scatter() function to display individual data points. Below is the Python implementation:
Code:
import matplotlib.pyplot as plt
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'] # Data
sales_trend = [250, 300, 350, 400, 450, 500] # Trend line data
sales_transactions = [260, 290, 340, 410, 460, 490] # Individual sales
transactions
Explanation:
1.Data:
● months: Represents the time period on the x-axis.
● sales_trend: Represents the overall sales trend over the months.
● sales_transactions: Represents specific sales transactions as individual
data points.
2. Visualization:
● The line chart is plotted using plt.plot() to show the overall sales trend.
● The scatter plot is overlaid using plt.scatter() to highlight individual sales
transactions.
3. Customization:
● The line is colored blue with circular markers for better visualization.
● Scatter points are colored red and slightly larger (s=100) for emphasis.
● Labels, legend, and grid are added to make the plot easy to interpret.
Output:
To visualize a correlation matrix with annotations, we use Seaborn’s heatmap() function. Below
is the implementation:
Code:
Explanation:
1. Data:
• A sample dataset is created using a dictionary and converted into a
pandas DataFrame.
• The correlation matrix is computed using df.corr().
2. Seaborn Heatmap:
• The sns.heatmap() function is used to visualize the correlation matrix.
• annot=True: Displays the correlation values within the cells.
• cmap="coolwarm": Uses a diverging colormap to highlight positive and
negative correlations.
• fmt=".2f": Limits the correlation values to 2 decimal places.
• linewidths=0.5: Adds grid lines between cells.
• cbar_kws={"shrink": 0.8}: Shrinks the color bar for better fit.
3. Customization:
• A figure size of 8x6 is set using plt.figure().
• The plot includes a title for clarity.
Output:
A stacked area chart visualizes cumulative data for multiple categories over a specific time
period. Below is the implementation:
Code:
1.Data:
● months: Represents the time period on the x-axis.
● product_A, product_B, product_C: Represent the sales of three products
over the months.
2. Visualization:
● plt.stackplot() is used to create the stacked area chart.
● Each product’s sales are stacked on top of the others to show cumulative
sales over time.
3. Customization:
● labels: Adds a legend to identify each product.
● colors: Specifies colors for each product’s area.
● The chart includes labels for the x-axis, y-axis, and a title.
Output:
Code:
df = pd.read_csv('sample_dataset.csv')
Explanation:
1. Load Data: Reads a CSV file into a DataFrame with pd.read_csv().
2. Convert Dates: Converts date_column to a datetime object using
pd.to_datetime().
3. Extract Components: Extracts year, month, and day into separate
columns with .dt.
4. View Changes: Displays the updated DataFrame with print(df.head()).
5. Reusable: Works for any dataset with a date column.
Output:
Ex22: Resample a time series dataset to calculate monthly averages.
Code:
import pandas as pd
df = pd.read_csv('sample_dataset.csv') # Read the sample CSV file
df['date_column'] = pd.to_datetime(df['date_column']) # Ensure proper datetime
format
df.set_index('date_column', inplace=True) # Indexing by date for time-based
operations
monthly_avg = df['temperature'].resample('M').mean() # Compute monthly
averages
Explanation:
Output:
Code:
import pandas as pd
import matplotlib.pyplot as plt
Explanation:
1. Sample Data: A date range and corresponding sales data are generated.
2. Convert Dates: The date column is converted to a datetime format and set as the
DataFrame index.
3. Rolling Average: A 7-day rolling mean is calculated using
.rolling(window=7).mean().
4. Plotting: Both the daily sales and the rolling average are plotted with labels and
gridlines for clarity.
Output:
Ex24: Highlight weekends and holidays on a sales time series plot using custom markers.
Code:
import pandas as pd
import matplotlib.pyplot as plt
Explanation:
Output:
Advanced-Level Exercises:
Advanced Cleaning
Ex25: Load a messy dataset with inconsistent date formats and Standardize the date
format across the dataset. Split full names into separate "First Name" and "Last Name"
columns.
Code:
import pandas as pd
data = { 'Full Name': ['John Doe', 'Jane Smith', 'Robert Brown', 'Emily Davis'],
'Date': ['2023/01/05', '05-02-2022', 'March 3, 2021', '2020.04.15']} # Sample
messy dataset
df = pd.DataFrame(data)
def clean_data(df): # Function to clean data
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df[['First Name', 'Last Name']] = df['Full Name'].str.split(' ', 1, expand=True)
df.drop(columns=['Full Name'], inplace=True)
return df
df_cleaned = clean_data(df)
print("Cleaned Dataset:")
print(df_cleaned)
Explanation:
Create a sample dataset with inconsistent date formats and a single "Full Name"
column.Use pd.to_datetime() to convert all date formats into a consistent format
(default: yyyy-mm-dd).Use str.split() to split the "Full Name" column into "First Name"
and "Last Name".Drop the original w"Full Name" column after splitting.Output the
cleaned dataset with standardized dates and separate name columns.
Output:
Ex26: Remove outliers from a numeric column using the IQR method.
Code:
import numpy as np
data_numeric = {'Values': [10, 15, 14, 102, 13, 12, 16, 17, 108, 11]}
df_numeric = pd.DataFrame(data_numeric)
def remove_outliers(df, column):
Q1 = df[column].quantile(0.25) # First quartile (25th percentile)
Q3 = df[column].quantile(0.75) # Third quartile (75th percentile)
IQR = Q3 - Q1 # Interquartile range
df = df[(df[column] >= (Q1 - 1.5 * IQR)) & (df[column] <= (Q3 + 1.5 * IQR))]
# Filtering Outliers
return df
df_no_outliers = remove_outliers(df_numeric, 'Values')
print("Data after outlier removal:")
print(df_no_outliers)
Explanation:
Output:
Data Transformation
Ex27: Normalize numeric columns to a range of 0–1 using MinMaxScaler from sklearn.
Code:
data_to_normalize = {'Feature1': [10, 20, 30, 40, 50], 'Feature2': [5, 10, 15, 20,
25]}
df_normalize = pd.DataFrame(data_to_normalize)
def normalize_data(df, columns):
scaler = MinMaxScaler() # Initializes MinMaxScaler
df[columns] = scaler.fit_transform(df[columns]) # Applies MinMax scaling to
columns
return df
df_normalized = normalize_data(df_normalize, ['Feature1', 'Feature2'])
print("Normalized Data:")
print(df_normalized)
Explanation:
Output:
Code:
Explanation:
Ex29: Load a dataset, split it into training and testing sets, and standardize numeric
features.
Code:
data_ml = {
'Feature1': [10, 20, 30, 40, 50, 60],
'Feature2': [5, 10, 15, 20, 25, 30],
'Target': [1, 0, 1, 0, 1, 0]
}
df_ml = pd.DataFrame(data_ml)
def preprocess_ml_data(df, target):
X = df.drop(columns=[target]) # Separates features from target
y = df[target]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
#Splits data
scaler = StandardScaler() # Initializes StandardScaler
X_train = scaler.fit_transform(X_train) # Standardizes training data
X_test = scaler.transform(X_test) # Standardize testing data
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = preprocess_ml_data(df_ml, 'Target')
print("Standardized Training Data:")
print(X_train)
print("Standardized Testing Data:")
print(X_test)
Explanation:
Create a dataset with features and a target variable for machine learning. Use
train_test_split() to split the data into training and testing sets.Use StandardScaler to
standardize the numeric features so they have a mean of 0 and a standard deviation of
1.Output the standardized training and testing data for use in machine learning models.
Output:
Ex30: Train a linear regression model to predict house prices and display:
Model coefficients.
Mean squared error (MSE) on the test set.
Dataset:
size,bedrooms,age,price
1500,3,10,300000
1800,4,15,350000
2400,4,20,400000
3000,5,5,500000
3500,5,8,550000
2200,3,12,370000
2000,3,10,320000
Code:
data = pd.read_csv('house_prices.csv')
X = data[['size', 'bedrooms', 'age']] # Define features variable
y = data['price'] # Target variable # Define target variable
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)#Split data
model = LinearRegression()# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)# Predictions
print("Model Coefficients:", model.coef_)# Model coefficients
mse = mean_squared_error(y_test, y_pred)# Mean squared error
print("Mean Squared Error:", mse)
Explanation:
Create a dataset with features (size, bedrooms, age) and target (price) for house
prices.Use train_test_split() to create training and testing sets.Train a Linear Regression
model using the training data (model.fit()).Use the trained model to predict house prices
on the testing set.Output the model coefficients and calculate the Mean Squared Error
(MSE) to evaluate model performance.Display the model coefficients and the MSE to
assess the model's predictive ability.
Output:
Clustering
Ex31: Use K-Means clustering to segment customers based on purchase behavior and
visualize the clusters using Seaborn.
Dataset:
purchase_freq,spending
5,200
10,500
2,150
15,600
7,400
3,250
8,450
20,1000
4,300
6,350
Code:
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = pd.read_csv('customer_data.csv')
X = data[['purchase_freq', 'spending']]# Feature selection (e.g., purchase behavior)
kmeans = KMeans(n_clusters=3, random_state=42)# Apply KMeans
data['Cluster'] = kmeans.fit_predict(X)
sns.scatterplot(x='purchase_freq', y='spending', hue='Cluster', data=data,
palette='viridis')
plt.title('Customer Segmentation')# Plot clusters using Seaborn
plt.show()
Explanation:
Output:
Performance Optimization
Code:
import pandas as pd
import numpy as np
Explanation:
Ex33: Use Dask to process a large dataset that exceeds memory limits.
Dataset:
A,B
0.1,0.2
0.3,0.4
0.5,0.6
0.7,0.8
0.9,1.0
1.1,1.2
1.3,1.4
1.5,1.6
Code:
import dask.dataframe as dd
data = dd.read_csv('large_dataset.csv')# Load a large dataset with Dask
mean_value = data['B'].mean().compute()# Perform operations (eg:calculate mean
of a column)
print("Mean:", mean_value)
Explanation:
Output:
Automating Reports
Ex34: Generate a summary report for a dataset using nbconvert to export the notebook as a PDF
or HTML file.
Code:
Explanation:
Output:
Ex35: Analyze a dataset with fields like Order ID, Customer ID, Product, Quantity, Price,
and Region:
Identify the top 10 customers by revenue.
Calculate average sales per product category.
Visualize monthly sales trends using Matplotlib.
Code:
import pandas as pd
import matplotlib.pyplot as plt
data = {
'Order ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Customer ID': [101, 102, 103, 101, 104, 105, 102, 106, 107, 108],
'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C'],
'Quantity': [5, 3, 4, 2, 5, 6, 4, 3, 2, 7],
'Price': [20, 30, 20, 15, 30, 15, 20, 20, 30, 15],
'Region': ['North', 'South', 'East', 'North', 'West', 'South', 'East', 'West', 'North',
'East'],
'Date': pd.to_datetime(['2023-01-15', '2023-02-10', '2023-02-20', '2023-03-05',
'2023-03-10', '2023-04-12', '2023-05-14', '2023-06-20', '2023-07-10',
'2023-08-02'])
}
df = pd.DataFrame(data)# Create a DataFrame
df['Revenue'] = df['Quantity'] * df['Price']# Calculate the Revenue (Quantity *
Price)
top_10_customers = df.groupby('Customer ID')['Revenue'].sum().nlargest(10)
average_sales_per_product = df.groupby('Product')['Revenue'].mean()
Explanation:
A DataFrame simulating sales data (Order ID, Customer ID, Product, Quantity,
Price, and Region) is created.A new column Revenue is computed as
df['Quantity'] * df['Price'] .
df.groupby('CustomerID')['Revenue'].sum().nlargest(10) groups data by
customer and extracts the top 10 revenue-generating
customers.df.groupby('Product')['Revenue'].mean() computes average revenue
per product category.df['Date'].dt.to_period('M') extracts the month from the date
column.df.groupby('Month')['Revenue'].sum() aggregates total revenue per
month.plt.subplot() creates subplots for: Monthly sales trend (line plot), Top 10
customers by revenue (bar chart) plt.show() displays the plots.
Output:
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Explanation:
A DataFrame with Customer ID, Last Purchase Date, Purchase Frequency, and
Monetary Value is created. pd.to_datetime() ensures correct date calculations.
(current_date - df['Last Purchase Date']). dt.days computes the number of days since the
last purchase.pd.qcut() splits data into quantiles (High, Medium, Low). Each customer
is assigned a combined RFM segment based on Recency, Frequency, and Monetary
values.A function assigns labels: "High-Value": Best customers, "Medium-Value":
Moderate customers, "Low-Value": Least active customers. A bar chart displays
customer segment distribution.A pie chart represents segment proportions.
Output:
Case Study: Time Series Forecasting
Ex37: Use ARIMA or Prophet to forecast monthly sales data for the next 12 months.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from prophet import Prophet
from datetime import datetime, timedelta
date_rng = pd.date_range(start='2015-01-01', end='2023-12-01', freq='MS')#
Generate data
np.random.seed(42)
sales = np.random.randint(100, 500, size=(len(date_rng)))
Explanation:
A date range (2015-2023) and corresponding sales values are created. plt.plot()
visualizes historical sales trends. ARIMA(df['y'], order=(5,1,0)) fits an ARIMA model
with specified parameters. .forecast(steps=12) predicts sales for the next 12 months.
forecast_dates = pd.date_range(start=df['ds'].iloc[-1] + timedelta(days=30),
periods=12, freq='MS') generates future dates. A DataFrame is created with predicted
sales values. Prophet().fit(df) initializes and trains the Prophet model. Future sales are
predicted and plotted.
Output:
Ex38: Evaluate the forecast model using metrics like RMSE and MAE, and plot the
forecast with confidence intervals.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.arima.model import ARIMA
from datetime import datetime, timedelta
Explanation:
Generate monthly sales data from January 2020 using pd.date_range() for dates
and np.random.randint() for sales values. Convert the data into a DataFrame with dates
as the index. Split the data into a training set (train with 24 months) and a test set (test
with 12 months). Use the ARIMA model with parameters (5, 1, 0) on the training data to
capture the time series pattern. Fit the model using model.fit(). Forecast the next 12
months using model_fit.forecast() with a 95% confidence interval. Calculate RMSE
(np.sqrt(mean_squared_error(test, forecast))) and MAE (mean_absolute_error(test,
forecast)) to evaluate model accuracy. Plot the training data, test data, and forecasted
values using plt.plot(). Display the confidence intervals with plt.fill_between() to show
forecast uncertainty. Print RMSE and MAE values. Display the forecast plot with
confidence intervals.
Output:
11: Advanced Visualization and Storytelling
Interactive Visualizations
Ex39: Create an interactive dashboard using Plotly and Dash to display key performance
indicators (KPIs) and trends.
Code:
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.graph_objects as go
import pandas as pd
kpi_data = {
"Month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
"Sales": [1000, 1500, 2000, 1800, 2200, 2400],
"Profit": [200, 300, 500, 400, 600, 700],
}# Sample Data
df = pd.DataFrame(kpi_data)
app.layout = html.Div([
html.H1("Interactive KPI Dashboard", style={"textAlign": "center"}),
html.Div([
html.Div([
html.H3("Total Sales"),
html.P(id="total-sales", style={"fontSize": "24px", "color": "blue"}),
], style={"padding": "20px", "border": "1px solid black", "flex": 1}),
html.Div([
html.H3("Total Profit"),
html.P(id="total-profit", style={"fontSize": "24px", "color": "green"}),
], style={"padding": "20px", "border": "1px solid black", "flex": 1}),
], style={"display": "flex", "justifyContent": "space-around"}),
html.Div([
dcc.Graph(id="kpi-trends")
])
])
@app.callback(
[Output("total-sales", "children"), Output("total-profit", "children"),
Output("kpi-trends", "figure")],
[Input("kpi-trends", "id")]
)
def update_dashboard(_):
total_sales = df["Sales"].sum()
total_profit = df["Profit"].sum()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Month"], y=df["Sales"],
mode="lines+markers", name="Sales"))
fig.add_trace(go.Scatter(x=df["Month"], y=df["Profit"],
mode="lines+markers", name="Profit")) #
if __name__ == "__main__":
app.run_server(debug=True)
Explanation:
● The update_dashboard callback dynamically calculates total sales and profit and
updates the trend graph using Plotly's Scatter plots.
● Flex layout is used to align KPIs side-by-side, and the graph displays trends over
months.
Output: