0% found this document useful (0 votes)
3 views

Jupyter Notebook.docx

The document provides a comprehensive guide on using Jupyter Notebook for data analytics, covering its installation, interface, and essential features. It also delves into data manipulation with Python using libraries like Pandas, data visualization techniques with Matplotlib and Seaborn, and introduces machine learning basics including data preprocessing and supervised learning models. Overall, it serves as a resource for effectively analyzing and visualizing data using Python.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Jupyter Notebook.docx

The document provides a comprehensive guide on using Jupyter Notebook for data analytics, covering its installation, interface, and essential features. It also delves into data manipulation with Python using libraries like Pandas, data visualization techniques with Matplotlib and Seaborn, and introduces machine learning basics including data preprocessing and supervised learning models. Overall, it serves as a resource for effectively analyzing and visualizing data using Python.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Jupyter Notebook for Data Analytics and Data Analysis

Part 1: Introduction to Jupyter Notebook

Jupyter Notebook is an interactive tool for writing and running code, mainly used for
data analysis, machine learning, and scientific research. It allows you to combine code, text, and
visualizations in one document.

1. Getting Started with Jupyter Notebook

1.1 What is a Jupyter Notebook?

Jupyter Notebook is an open-source web-based interactive computing environment that


allows users to create and share documents containing live code, equations, visualizations, and
narrative text. It supports various programming languages, most commonly Python, and is
widely used for data analysis, machine learning, and scientific computing.

1.2 Installing Jupyter Notebook via Anaconda and pip

Using Anaconda: Anaconda is a distribution that includes Jupyter Notebook along with
Python and other essential libraries.

To install:
○​ Download and install Anaconda from https://www.anaconda.com/.
○​ Open Anaconda Navigator and launch Jupyter Notebook.

Using pip:
If you prefer not to use Anaconda:
○​ Ensure Python and pip are installed on your system.
○​ Run the command: pip install notebook
○​ Start Jupyter Notebook by typing jupyter notebook in your terminal or
command prompt.

1.3 Overview of the Jupyter Interface: Menus, Toolbar, and Cells

Menus is Located at the top, menus provide options for file management, editing,
viewing, running code, and more. Toolbar contains shortcuts for common tasks such as saving,
adding cells, and running code. Cells is The main building blocks of a notebook. Cells can
contain code, Markdown, or raw text.
2. Essential Features and Shortcuts
​ Jupyter Notebook has several features and shortcuts to help you work more efficiently.

2.1 Cell Types: Code, Markdown, and Raw Cells

Code Cells is used to write and execute programming code. Outputs, such as printed
results or graphs, appear below the cell. Markdown Cells is for adding formatted text, links,
images, and equations using Markdown syntax. Raw Cells display text as-is without formatting
or execution.

2.2 Keyboard Shortcuts for Efficiency

Cell Navigation:
○​ Enter: Edit the current cell.
○​ Esc: Switch to command mode.
○​ Arrow keys: Move between cells.
Cell Operations:
○​ A: Insert a cell above.
○​ B: Insert a cell below.
○​ X: Cut the selected cell.
○​ Z: Undo cell deletion.
Execution:
○​ Shift + Enter: Run the current cell and move to the next.
○​ Ctrl + Enter: Run the current cell and stay in place.
○​ Alt + Enter: Run the current cell and insert a new cell below.

2.3 Saving, Exporting, and Sharing Notebooks

●​ Save your notebook using Ctrl + S or the save icon in the toolbar.
●​ Export notebooks as HTML, PDF, or other formats via the File > Download as menu.
●​ Share your notebook by uploading it to platforms like GitHub or Google Drive.

3. Markdown Basics for Documentation

Markdown helps you format text easily. It's used in Jupyter Notebooks to add headings,
lists, links, images, and more.

3.1 Text Formatting: Bold, Italics, Lists, and Headings

Bold: **bold text** → bold text


Italics: *italic text* → italic text
Lists:
●​ Ordered: 1. Item 1\n2. Item 2
●​ Unordered: - Item 1\n- Item 2
Headings: Use # for headings (e.g., # Heading 1, ## Heading 2).
3.2 Adding Hyperlinks, Images, and Tables

Hyperlinks:
[Link Text](https://example.com)
Images:
![Alt Text](image_url)
Tables:
| Column1 | Column2 |
|----------- -|---------- --|
| Data 1 | Data 2 |

3.3 Embedding Code Outputs and Equations


Code Outputs:
Include outputs below code cells by executing them.
Equations:
Use LaTeX syntax inside $ symbols. Example: $E = mc^2$ renders as .

Part 2: Data Manipulation with Python

Data manipulation refers to the process of cleaning, transforming, and analyzing data. In
Python, this is typically done using libraries like Pandas and NumPy.

4. Working with Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It provides
data structures like DataFrames that make it easy to work with structured data.

4.1 Loading Data into Pandas DataFrames (CSV, Excel, SQL, JSON)

●​ CSV: Use pd.read_csv('file.csv') to load data from a CSV file.

●​ Excel: Use pd.read_excel('file.xlsx') to load data from an Excel file.

●​ SQL: Use pd.read_sql(query, connection) to load data from an SQL database.

●​ JSON: Use pd.read_json('file.json') to load data from a JSON file.

4.2 Exploring DataFrames: Head, Tail, Info, and Describe

●​ Head and Tail: Use df.head() and df.tail() to view the first and last rows of the
DataFrame.
●​ Info: Use df.info() to get a summary of the DataFrame, including data types and
non-null counts.
●​ Describe: Use df.describe() for statistical summaries of numerical columns.
4.3 Cleaning Data: Handling Missing Values, Duplicates, and Outliers

Missing Values:
●​ Identify: df.isnull().sum()
●​ Fill: df.fillna(value)
●​ Drop: df.dropna()
Duplicates:
●​ Identify: df.duplicated()
●​ Remove: df.drop_duplicates()
Outliers:
●​ Detect: Use statistical methods like z-scores or IQR.
Handle:
●​ Replace or remove outlier values.

5. Advanced DataFrame Operations

Pandas offers powerful functionality to manipulate dataframes, such as filtering,


grouping, and merging data. These operations are essential for more advanced data analysis.

5.1 Filtering and Subsetting Data with Conditions

Use boolean indexing to filter rows:


df[df['column'] > value]
Combine conditions with & (and) or | (or):
df[(df['column1'] > value1) & (df['column2'] < value2)]

5.2 Grouping and Aggregating Data

Group data using df.groupby('column').


Apply aggregation functions like sum(), mean(), or count(). Example:
df.groupby('column')['value_column'].sum().

5.3 Merging, Joining, and Concatenating DataFrames

Merging: Use pd.merge(df1, df2, on='key_column') to merge two DataFrames.


Joining: Use df1.join(df2) for joining on index.
Concatenating: Use pd.concat([df1, df2]) to concatenate along a specific axis.
6. Data Transformation
Data transformation involves modifying your data to make it more suitable for analysis.
In Pandas, you can apply functions to data, reshape your data, and handle date/time information
easily.

6.1 Applying Functions to Rows and Columns

Apply a function to a column:


df['column'].apply(func)
Apply a function to rows:
df.apply(func, axis=1)

6.2 Pivot Tables and Reshaping Data with Melt

Pivot Tables: Use pd.pivot_table(df, values='value', index='row', columns='column').


Melt: Reshape wide-format data into long-format using pd.melt(df, id_vars=['id']).

6.3 Handling Date and Time Data with Pandas

Parse dates while loading data:


pd.read_csv('file.csv', parse_dates=['date_column'])
Extract date components:
df['column'].dt.year, df['column'].dt.month
Perform date arithmetic:
df['column'] + pd.Timedelta(days=10)

Part 3: Data Visualization

​ Data visualization is crucial for presenting insights from data clearly. In Python, popular
libraries such as Matplotlib, Seaborn, and Plotly/Bokeh are used to create a variety of
visualizations. Below is a breakdown of the core features of these libraries.

7. Basic Visualization with Matplotlib

Matplotlib is a widely used library in Python for creating static, animated, and
interactive visualizations.

7.1 Creating Line Charts, Bar Graphs, and Pie Charts

Line Charts: Use plt.plot(x, y) to create a line chart.


Bar Graphs: Use plt.bar(x, height) to create a bar graph.
Pie Charts: Use plt.pie(sizes, labels=labels) to create a pie chart.
7.2 Customizing Visualizations: Titles, Labels, and Legends

Add a title: plt.title('Title').


Label axes: plt.xlabel('X-axis'), plt.ylabel('Y-axis').
Add a legend: plt.legend().

7.3 Saving and Exporting Figures

Save a figure: plt.savefig('filename.png').


Specify format: plt.savefig('filename.pdf').

8. Advanced Visualizations with Seaborn


Seaborn builds on Matplotlib and provides a high-level interface for drawing attractive
and informative statistical graphics.


8.1 Visualizing Distributions: Histograms, KDE, and Boxplots

Histograms: Use sns.histplot(data, bins=10).


KDE (Kernel Density Estimate): Use sns.kdeplot(data).
Boxplots: Use sns.boxplot(x='category', y='value', data=df).

8.2 Pair Plots and Correlation Heatmaps

Pair Plots: Use sns.pairplot(df) to visualize pairwise relationships.


Correlation Heatmaps: Use sns.heatmap(df.corr(), annot=True).

8.3 Categorical Data Visualization

Use sns.countplot(x='category', data=df) for count plots.


Use sns.barplot(x='category', y='value', data=df) for bar plots.

9. Interactive Visualizations

Plotly and Bokeh are libraries used to create interactive visualizations, making it easier
to explore the data.

9.1 Introduction to Plotly and Bokeh

Plotly: A powerful library for creating interactive visualizations such as line charts, bar
charts, scatter plots, and more.
example:
import plotly.express as px
fig = px.scatter(df, x="x", y="y", title="Interactive Plotly Chart")
fig.show()
Bokeh: Another interactive visualization library, great for creating complex
visualizations for web applications.

Example:

from bokeh.plotting import figure, show

p = figure(title="Interactive Bokeh Chart")

p.scatter(x='x', y='y', source=df)

show(p)

9.2 Creating Interactive Dashboards with Widgets

Dashboards combine multiple visualizations and controls (like sliders, dropdowns, and
buttons).
Widgets let users adjust filters or parameters to explore data in real time.
Example: A dashboard with a slider to update a graph dynamically.

9.3 Exporting Interactive Visualizations

You can save interactive visualizations as HTML files or embed them in web apps.
This allows others to interact with the visualizations without needing Python installed.

Part 4: Data Analysis Techniques

Data analysis techniques help to extract meaningful insights from data. Below are some
essential techniques used in data analysis, including Exploratory Data Analysis (EDA), Time
Series Analysis, and Statistical Analysis.

10. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their
main characteristics, often using statistical graphics and plots.

10.1 Summary Statistics and Descriptive Analysis

Analyze basic statistics like mean, median, standard deviation, and distributions.
Use tools like Pandas to calculate these metrics.
Example:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.describe())
10.2 Detecting and Handling Outliers
Outliers can skew results and need attention.
Methods:
■​ Visualization: Box plots and scatter plots.
■​ Statistical: Use Z-scores or the IQR (Interquartile Range) method.
Outliers can be removed or transformed based on the analysis.

10.3 Analyzing Relationships Between Variables

Examine how variables influence each other.


Tools: Correlation heatmaps, scatter plots, and pair plots.
Example: Identify trends like how sales depend on advertising spend.

11. Time Series Analysis


Time series analysis focuses on data points collected or recorded at regular intervals over
time.

11.1 Preparing Time Series Data in Pandas

Format timestamps and set them as indices.


Example:
​df['date'] = pd.to_datetime(df['date']) ​
​df.set_index('date', inplace=True)

11.2 Resampling and Aggregating Time Series

Group data into specific time intervals (e.g., weekly, monthly).


Example:

df.resample('M').mean()

11.3 Visualizing Trends, Seasonality, and Anomalies

Use line plots to spot patterns over time.


Tools like Matplotlib or specialized libraries like statsmodels can highlight seasonality
and trends.

12. Statistical Analysis

Statistical analysis is used to test hypotheses, quantify relationships, and make


predictions.

12.1 Hypothesis Testing: t-tests and ANOVA

t-tests: Compare means between two groups.


ANOVA: Compare means across multiple groups.
Example:
​ ​ from scipy.stats import ttest_ind ​
​t_stat, p_value = ttest_ind(group1, group2)

12.2 Correlation and Regression Analysis

Correlation: Measure how two variables move together (e.g., Pearson's correlation).
Regression: Predict the value of one variable based on another.

12.3 Building Confidence Intervals and Interpreting Results

Confidence intervals give a range of values likely to include the true parameter.
Example: "We are 95% confident that the mean lies between X and Y."

Part 5: Machine Learning Basics

13. Data Preprocessing for Machine Learning

Data preprocessing converts raw data into a clean, structured format, enhancing its
quality and making it suitable for analysis or modeling. It involves handling missing values,
correcting inconsistent formats, removing irrelevant features, and scaling or encoding variables.
These steps improve the accuracy and reliability of machine learning models.

13.1 Splitting Data into Training and Test Sets​



​ Splitting data ensures fair evaluation of a model’s performance. The dataset is typically
divided into training and test sets, with ratios like 80/20 or 70/30. The training set is used to build
the model, while the test set evaluates its performance on unseen data. Cross-validation further
enhances this process by splitting the data into multiple folds, training and testing on each fold to
ensure robustness. Stratification is crucial for imbalanced datasets to maintain proportional class
distributions in each split, especially for classification tasks.

13.2 Normalization and Standardization Techniques​



​ Normalization scales all feature values to a specific range, often [0, 1], making them
uniform. This technique is commonly applied in algorithms like k-Nearest Neighbors and neural
networks. Standardization, on the other hand, centers data around a mean of 0 with a standard
deviation of 1, which is especially beneficial for algorithms like SVMs and PCA. Both methods
prevent features with large magnitudes from dominating the learning process, ensuring fair
contributions from all variables.
13.3 Encoding Categorical Data​

​ Machine learning algorithms require numerical input, necessitating the encoding of
categorical features.
●​ Label Encoding: Assigns a unique integer to each category but may introduce
ordinal relationships.
●​ One-Hot Encoding: Creates binary columns for each category, avoiding any
implicit ranking.
●​ Ordinal Encoding: Maintains a meaningful order for categories (e.g., small <
medium < large).
●​ Target Encoding: Replaces categories with their mean target value, ideal for
high-cardinality features but prone to overfitting if not used carefully. Proper
encoding ensures that categorical data is efficiently utilized without introducing
biases.

14. Supervised Learning Models


14.1 Building and Evaluating Linear Regression Models

Linear regression is used for predicting a continuous dependent variable based on one or
more independent variables. It fits a straight line (or hyperplane for multiple variables) to the
data. Key metrics for evaluation include:

●​ Mean Squared Error (MSE): Measures average squared differences between


predictions and actual values.
●​ R-squared: Indicates the proportion of variance in the dependent variable
explained by the model.
●​ Adjusted R-squared: Considers the number of predictors, penalizing overfitting.​
Regularization techniques like Lasso and Ridge regression help address
overfitting by penalizing large coefficients, ensuring better generalization.

14.2 Decision Trees and Random Forests

Decision Trees: Use hierarchical splitting based on feature thresholds to create


interpretable models. However, they are prone to overfitting on training data.

Random Forests: Combine multiple decision trees (an ensemble method) to reduce
overfitting and improve accuracy. Features are randomly sampled for each tree,
increasing model diversity.

Feature Importance: Random forests provide feature importance scores, helping


identify key predictors in the data.
14.3 Evaluating Classification Models with Precision, Recall, and F1 Score

Classification models are assessed using various metrics:

●​ Precision: The proportion of true positive predictions among all positive


predictions (important for reducing false positives).
●​ Recall (Sensitivity): The proportion of true positives identified by the model
among all actual positives (important for minimizing false negatives).
●​ F1 Score: The harmonic mean of precision and recall, balancing the trade-off
between the two.​
Other metrics like the ROC-AUC score measure a model’s ability to distinguish
between classes, and confusion matrices provide a detailed view of classification
errors.

15. Unsupervised Learning


15.1 K-Means Clustering and Visualization
K-Means is a popular clustering algorithm that partitions data into a pre-specified number
of clusters (k) by minimizing intra-cluster variance. The algorithm iteratively:
●​ Assigns each data point to the nearest cluster centroid.
●​ Recomputes centroids based on the current cluster members.
●​ Stops when centroids stabilize or a maximum number of iterations is reached.

Visualization: Dimensionality reduction techniques like PCA (Principal Component


Analysis) can project high-dimensional clusters into 2D or 3D for visualization.​
Choosing Optimal Clusters: The elbow method evaluates cluster coherence by plotting the
within-cluster sum of squares (WCSS) for different k values and selecting the point where the
WCSS curve bends ("elbow").

15.2 Dimensionality Reduction with PCA


Principal Component Analysis (PCA) reduces the number of features while preserving
most of the data's variance. It achieves this by:
●​ Computing principal components, which are linear combinations of the original
features.
●​ Ordering components by the variance they explain, retaining only the top ones.
Benefits of PCA:
●​ Improves Efficiency: Reduces computational cost for high-dimensional datasets.
●​ Aids Visualization: Projects complex datasets into 2D or 3D for intuitive analysis.
●​ Noise Reduction: Removes redundant features, enhancing model performance.
PCA is widely used before clustering or for exploratory data analysis.
15.3 Customer Segmentation Using RFM Analysis
RFM (Recency, Frequency, Monetary) analysis is a technique for segmenting customers
based on their purchasing behavior:
●​ Recency: How recently a customer made a purchase.
●​ Frequency: How often a customer makes purchases.
●​ Monetary Value: The total amount a customer has spent.
Steps for RFM Analysis:
1.​ Assign RFM scores to each customer based on percentile rankings.
2.​ Segment customers into groups (e.g., high-value, at-risk, inactive) using
clustering or thresholds.
3.​ Use insights to tailor marketing strategies, predict churn, or prioritize customer
retention.

Part 6: Advanced Applications

16. Automating Data Workflows

16.1 Scheduling Notebooks with Papermill and nbconvert

Papermill allows parameterized execution of Jupyter notebooks, while nbconvert enables


exporting notebooks into various formats. Together, they help schedule and automate
workflows like daily reports.

16.2 Parameterizing Notebooks for Reusability

Parameterization in notebooks allows dynamic inputs, enabling reuse with different


datasets or scenarios. Tools like Papermill simplify this process.

16.3 Creating Automated Reports with Visualizations

Automating reports with visualizations involves generating pre-defined plots and tables
that are dynamically updated. Libraries like Matplotlib and Plotly are often integrated
into workflows for this purpose.

17. Big Data and Distributed Processing


17.1 Working with Large Datasets Using Dask

Dask extends pandas to handle large datasets by parallelizing computations. It operates


on chunks of data, enabling analysis beyond memory constraints.

17.2 Processing Dataframes with Vaex and Modin

Vaex and Modin offer fast, scalable dataframe operations. Vaex is optimized for
out-of-core processing, while Modin provides a pandas-like interface with parallel computation.
17.3 Introduction to Spark DataFrames in Jupyter

Spark DataFrames allow distributed processing of large datasets in Jupyter notebooks.


PySpark, the Python API for Apache Spark, enables scalable data transformations and analysis.

18. Advanced Statistical Modeling


18.1 ARIMA and Prophet for Time Series Forecasting

ARIMA models are used for forecasting based on time series data, incorporating trends
and seasonality. Prophet simplifies forecasting with intuitive parameters and handles missing
data effectively.

18.2 Logistic Regression for Classification Problems

Logistic regression predicts probabilities for binary outcomes. It uses a sigmoid function
to map predictions to a [0, 1] range and is evaluated using metrics like AUC-ROC.

18.3 Building Generalized Linear Models (GLMs)

GLMs extend linear models by allowing non-normal response distributions and link
functions. Common examples include Poisson regression for count data and logistic regression
for binary classification.

Part 7: Real-World Projects

19. Case Studies


19.1 Retail Sales Analysis: Trend Analysis and Forecasting

Introduction

This document provides a detailed analysis of retail sales trends and forecasts for the UK
retail sector for the years 2025 and 2026. It was prepared by the Centre for Retail Research
(CRR) on 28 November 2024, considering the economic and political landscape shaped by the
2024 General Election and subsequent policies.

Background

The Centre for Retail Research (CRR) has over 15 years of experience in economic
forecasting for retail businesses. Led by Prof. Joshua Bamfield, CRR adopts a dynamic model of
the economy, emphasizing human behavior’s role in shaping economic outcomes.
Current Retail Sales

The total volume of retail sales for 2023 was £427.267 billion, reflecting sales
through both physical stores and online platforms, excluding automotive fuel. Sales
volumes have been adjusted to eliminate inflationary effects, ensuring real-term
comparisons.

Recent Trends (2022-2024)

●​ 2022:
Retail sales fell by -4.6%.
●​ 2023: Continued decline with a -2.8% decrease.
●​ 2024: Preliminary data suggests a modest decline of -0.2%, with late-year
improvements offset by inflation and rising energy costs.

Forecast for 2025-2026

Retail sales are projected to experience further declines, though less severe than in
2022-2023.

●​ 2025: Sales decline of -2.1%.


●​ 2026: Sales decline of -2.5%.

Consumer caution, increased taxation, and inflationary pressures are expected to


contribute to reduced spending, despite improving consumer confidence since early 2024.

Key Influences on Retail Sales

1.​ Government Policy:


○​ The Labour Government’s October 2024 Budget introduced a £40 billion
tax increase, raising employer National Insurance Contributions and
minimum wages.
○​ Infrastructure investments, including plans to build 1.5 million homes,
may yield long-term benefits but have limited short-term retail impact.

2.​ Economic Conditions:


○​ Consumer Confidence: Improved from -49 in September 2022 to -21 by
November 2024.
○​ Inflation: Reduced to 2% by mid-2024 but projected to rise due to
budgetary policies.

3.​ Global Factors:


○​ Geopolitical tensions, such as the Ukraine war, continue to affect energy
and food prices.
○​ Potential trade tariffs under the Trump administration could impact
exports and consumer prices.
Retail Sector Performance (2018-2026)

Volume Sales Growth (Real Terms):

●​ 2018-2019: Modest growth of +2.2% and +2.8%.


●​ 2020-2021: Mixed performance, with growth rebounding to +3.9% in 2021 after a
pandemic-induced slowdown.
●​ 2022-2023: Sharp declines of -4.6% and -2.8%.
●​ 2024-2026: Forecasted declines of -0.2% (2024), -2.1% (2025), and -2.5% (2026).

Challenges and Opportunities

Challenges:

●​ Rising costs due to tax increases and inflation.


●​ Potential job losses in retail and hospitality sectors
(≥100,000).
●​ Geopolitical uncertainties, including trade tariffs and ongoing conflicts.

Opportunities:

●​ Infrastructure projects and net-zero initiatives may foster long-term


growth.
●​ Improved consumer confidence could support gradual recovery.

Conclusion

While the UK retail sector faces short-term challenges, strategic investments and
a focus on sustainability may drive long-term growth. However, the forecast underscores
the importance of adapting to economic and geopolitical shifts to ensure resilience in the
retail industry.

19.2 Customer Churn Prediction Using Machine Learning Models

Background

Customer churn is a critical concern for businesses, particularly in the


telecommunications sector, where competition is intense. Churn prediction models aim to
identify customers who are likely to stop using a service, enabling businesses to take
proactive measures to retain them. This case study demonstrates the application of
machine learning models, including Random Forest (RF), XGBoost, and a hybrid
ensemble model, to predict customer churn in a real-world telecommunications setting.

1. Problem Statement

Telecommunication companies lose significant revenue annually due to customer


churn. Identifying churn-prone customers in advance can help companies offer tailored
retention strategies. This case study uses a hybrid machine learning model to predict
customer churn accurately, leveraging a dataset from a telecommunications provider.

2. Data Overview

Dataset: The data was sourced from a public repository (e.g., Kaggle), containing
customer details such as:

●​ Customer demographics: Gender, tenure, and service type.


●​ Service usage: Data usage, call minutes, and plan subscriptions.
●​ Billing information: Monthly charges and payment methods.
●​ Churn status: Whether a customer churned or not (binary label).

Data Preprocessing:

1.​ Removal of irrelevant features like customer ID and phone number.


2.​ Conversion of categorical features to numerical values (e.g., Yes/No to 1/0).
3.​ Normalization of numerical data to ensure uniform scaling.
4.​ Splitting the dataset into training (75%) and testing (25%) subsets.

3. Methodology

The predictive model development involved three stages:

Stage 1: Feature Engineering

Features were selected using Pearson's Correlation Coefficient, ensuring only


the most relevant variables were included in the model. Example features included:

●​ Account length
●​ Total day minutes
●​ Total international calls
●​ Monthly charges

Stage 2: Model Development

Three machine learning models were developed and evaluated:

1.​ Random Forest (RF): An ensemble learning method using multiple decision
trees.
2.​ XGBoost: An optimized gradient-boosting framework for better performance.
3.​ Hybrid Model: A combination of RF and XGBoost using a voting mechanism to
unify predictions.

Stage 3: Performance Evaluation

Models were evaluated based on the following metrics:


●​ Accuracy: Overall correctness of predictions.
●​ True Positive Rate (TPR): Correct identification of churners.
●​ True Negative Rate (TNR): Correct identification of non-churners.

4. Results

●​ Random Forest:
○​ Accuracy: 95.32%
○​ TPR: 80.45%
○​ TNR: 97.86%
●​ XGBoost:
○​ Accuracy: 95.68%
○​ TPR: 81.25%
○​ TNR: 98.12%
●​ Hybrid Model:
○​ Accuracy: 95.92%
○​ TPR: 81.60%
○​ TNR: 98.45%

The hybrid model outperformed individual models, demonstrating the benefits of


combining their strengths.

5. Business Impact

Implementing the hybrid model in a telecommunications business can


significantly enhance customer retention efforts. Key benefits include:

●​ Early Identification: Flagging potential churners allows for timely interventions,


such as offering discounts or personalized services.
●​ Cost Efficiency: Retaining existing customers is more cost-effective than
acquiring new ones.
●​ Revenue Growth: A 1% improvement in retention can lead to substantial
revenue gains.

6. Conclusion

The hybrid predictive model combining Random Forest and XGBoost offers a
robust solution for customer churn prediction. By leveraging machine learning,
telecommunications companies can optimize customer retention strategies, improve
profitability, and gain a competitive edge.

Future Work:​
Further enhancements can include:

●​ Integration of deep learning models for better feature extraction.


●​ Real-time churn prediction using streaming data.
●​ Expansion to other industries with similar churn challenges, such as banking or
retail.

19.3 Financial Data Analysis: Portfolio Optimization

Case Study: Optimal Asset Allocation for Maximizing Returns

Introduction: Financial markets are complex and unpredictable, making investment


decisions a challenging task for individuals and institutions alike. Portfolio optimization
involves strategically allocating assets to maximize returns while minimizing risks,
adhering to an investor's risk tolerance and financial goals. This case study explores
portfolio optimization using machine learning and statistical techniques to enhance
investment outcomes.

Problem Statement:

The primary objective is to create an optimized investment portfolio by analyzing


historical financial data. The portfolio should aim to:

1.​ Maximize returns for a given level of risk.


2.​ Minimize risk while achieving a target return.
3.​ Ensure diversification across various asset classes to reduce unsystematic risk.

Dataset Description:

The dataset used for this case study consists of:

●​ Historical price data for multiple assets (e.g., stocks, bonds, ETFs, commodities).
●​ Key metrics such as daily returns, annualized returns, and volatility.
●​ Macro-economic indicators influencing financial markets (e.g., interest rates,
inflation rates).

Tools and Libraries:

●​ Python with libraries such as NumPy, pandas, matplotlib, seaborn.


●​ Machine learning libraries: scikit-learn, PyPortfolioOpt.
●​ Optimization techniques: Monte Carlo simulation, Sharpe Ratio maximization,
Efficient Frontier.
Methodology:

1.​ Data Preprocessing:​

○​ Import and clean financial data, ensuring no missing or anomalous entries.


○​ Compute daily and annualized returns and assess historical volatility.
2.​ Exploratory Data Analysis (EDA):​

○​ Visualize historical trends, returns, and correlations between assets.


○​ Identify patterns or anomalies that may affect optimization.
3.​ Portfolio Construction:​

○​ Calculate expected returns and the covariance matrix of asset returns.


○​ Define constraints (e.g., no short selling, weight limits on assets).
4.​ Optimization Techniques:​

○​ Efficient Frontier Analysis:


■​ Identify portfolios offering the highest return for a given risk level.
■​ Visualize risk-return trade-offs to guide investor decisions.
○​ Sharpe Ratio Maximization:
■​ Determine the portfolio with the highest risk-adjusted returns.
○​ Monte Carlo Simulations:
■​ Generate random portfolio weights to explore potential outcomes.
5.​ Validation:​

○​ Backtest the optimized portfolio on out-of-sample data.


○​ Compare performance metrics such as returns, Sharpe Ratio, and
drawdowns against benchmarks (e.g., S&P 500).

Results:

The optimized portfolio achieved the following outcomes:

●​ Expected Annualized Return: 12.3%.


●​ Annualized Volatility (Risk): 8.7%.
●​ Sharpe Ratio: 1.41 (indicating superior risk-adjusted performance).
●​ Diversification across asset classes reduced unsystematic risk significantly.

The analysis demonstrated that machine learning techniques and statistical


optimization could enhance portfolio performance, providing a balance between risk and
return.
Conclusion:

Portfolio optimization is a crucial aspect of financial data analysis, enabling


investors to make data-driven decisions. By leveraging machine learning algorithms,
statistical methods, and financial theories, investors can construct portfolios tailored to
their risk-return preferences. This case study highlights the importance of diversification,
efficient resource allocation, and continuous monitoring for successful investment
outcomes.

Future Work:

1.​ Incorporate real-time data streams for dynamic portfolio adjustments.


2.​ Explore advanced machine learning models, such as reinforcement learning, for
adaptive optimization.
3.​ Integrate ESG (Environmental, Social, Governance) factors for sustainable
investing.

20. End-to-End Data Analysis Project


20.1 Data Collection and Cleaning

This phase involves sourcing data from APIs, databases, or web scraping, followed by
handling missing values, outliers, and duplicates for clean datasets.

20.2 Exploratory Analysis and Visualization

EDA identifies trends, correlations, and anomalies using visualizations like histograms,
scatter plots, and heatmaps, providing insights for model building.

20.3 Predictive Modeling and Insights Presentation

Models are trained and validated, and insights are presented through dashboards or
reports, often accompanied by actionable recommendations.

Part 8: Advanced Features and Extensions

21. Jupyter Extensions and Magic Commands


21.1 Enabling and Using Jupyter Extensions (nbextensions)

Extensions enhance productivity by adding features like table of contents, variable


inspector, and code folding. They are managed using Jupyter's extensions configurator.
21.2 Useful Magic Commands: %timeit, %matplotlib inline, and %run

Magic commands improve workflow efficiency. %timeit measures code execution time,
%matplotlib inline embeds plots, and %run executes external scripts.

21.3 Customizing Notebooks with Themes and Layouts

Customization involves adjusting notebook appearance with themes and layouts, making
them more visually appealing for presentations.

22. Interactive Widgets in Jupyter


22.1 Building Interactive Widgets with ipywidgets

Ipywidgets enable interactive elements like sliders, dropdowns, and buttons, making data
analysis dynamic and user-friendly.

22.2 Creating Input Forms for Dynamic Analysis

Input forms allow users to filter or adjust parameters in real-time, enhancing the
flexibility of analysis and visualization.

22.3 Integrating Widgets with Visualizations

Widgets can be linked with plots to create interactive dashboards, providing a seamless
exploratory experience.

23. Collaborating and Sharing Notebooks


23.1 Exporting Notebooks to HTML, PDF, and Slides

Jupyter notebooks can be exported in various formats for sharing, using tools like
nbconvert or third-party plugins.

23.2 Sharing and Version Control with Git and GitHub

Git and GitHub enable collaborative notebook development with version control,
facilitating teamwork and tracking changes.

23.3 Hosting Notebooks on JupyterHub or Google Colab

JupyterHub allows hosting notebooks for teams on shared servers, while Google Colab
provides a cloud-based solution with free GPU support.
LAB
Beginner-Level Lab Exercises
1: Getting Started with Jupyter Notebook

Exploring the Interface


Ex1: Create a new notebook and perform the following: Rename the notebook. Create text,
markdown, and code cells.

Step 1: Open Jupyter Notebook


​ 1.​ Launch Jupyter Notebook from your terminal or Anaconda Navigator. It
will open in your default browser.
​ 2.​ Navigate to the folder where you want to create your notebook.

Step 2: Create a New Notebook


​ 1.​ Click on the “New” button on the top-right corner.
​ 2.​ Select “Python 3 (ipykernel)”. A new notebook will open with the default
name Untitled.

Step 3: Rename the Notebook


​ 1.​ At the top of the notebook, click on the current name (Untitled).
​ 2.​ A pop-up will appear. Type the new name (e.g., Lab_Exercise_1) and
click Rename.

Step 4: Create Cells


Jupyter Notebook supports different types of cells. Here’s how to create and modify
them:

​ 1.​ Create a Text Cell:


​ •​ Select a cell and change its type to “Raw NBConvert” from the dropdown
in the toolbar.
​ •​ Enter plain text, like:

​ 2.​ Create a Markdown Cell:


​ •​ Select a cell and change its type to “Markdown” from the dropdown.
​ •​ Write Markdown content, such as:
# Markdown Cell
This is a **Markdown cell**, which allows *formatted text*.

​ 3.​ Create a Code Cell:


​ •​ Select a cell (default type is Code).
​ •​ Enter Python code, for example:

​ 4.​ Run the Cells:


​ •​ Press Shift + Enter to execute the content of any cell.

Ex2: Write a markdown cell with formatted text, including bold, italics, and bullet points.
In a Markdown cell, write the following content:
# Simple Markdown Example
**Bold Text**
*Italic Text*
Here’s a list:
- First item
- Second item
- Third item

When you run the Markdown cell (by pressing Shift + Enter), it will display:

Basic Python in Jupyter


Ex3: Write a Python program to calculate the factorial of a number using a for loop.

Code:

Here’s a simple Python program to calculate the factorial of a number using a for
loop:

# Program to calculate factorial using a for loop

number = int(input("Enter a number: ")) # Input: Get a number from the user

factorial = 1 # Initialize factorial to 1

for i in range(1, number + 1): # Calculate factorial using a for loop


factorial *= i

print(f"The factorial of {number} is {factorial}") # Output: Display the factorial

How It Works:
○​ Input: The user provides a number (e.g., 5).
○​ Loop: The program multiplies all integers from 1 to the given number.
○​ For 5, the loop computes 1 * 2 * 3 * 4 * 5.

Output:
The result is displayed as the factorial.

Ex4: Use a while loop to generate the Fibonacci sequence up to a given number.

Code:

Here’s a simple Python program to generate the Fibonacci sequence up to a given


number using a while loop:

max_value = int(input("Enter the maximum value for the Fibonacci sequence: "))
# Input: Get the maximum value for the Fibonacci sequence

a, b = 0, 1 # Initialize the first two numbers in the sequence

print("Fibonacci sequence up to", max_value, "is:") # Display the sequence

while a <= max_value:


print(a, end=" ")
a, b = b, a + b

How It Works:

1.​ Input: The user provides the maximum value for the Fibonacci sequence
(e.g., 20).
2.​ Initialization: Start the sequence with a = 0 and b = 1.
3.​ While Loop: Continue generating the next Fibonacci number (a + b) until
it exceeds the maximum value.
Output:
Print each number in the sequence.

2: Data Manipulation with Pandas

Exploring a Dataset

Dataset Overview: Student Dataset


The Student Performance Dataset contains information about students, including
demographics, academic grades, and ratings of their portfolios, cover letters, and
recommendation letters. It has 16 columns and [insert number] rows.
Key Features:

1.​ Demographics: Name, nationality, city, latitude, longitude, gender, and ethnic
group.
2.​ Academic Grades: English, Math, Science, and Language grades.
3.​ Ratings: Portfolio, cover letter, and recommendation letter ratings.
4.​ Age: Age of the student.

Missing Data:

​ •​ Ethnic Group: 307 missing values.

​ •​ Missing values in numeric columns were replaced with the mean, and in
non-numeric columns, with the mode.
This dataset is designed to analyze student performance and identify patterns across
demographics and academic factors.
Ex5: Load a CSV file into a Pandas DataFrame and display the first 10 rows.

Code:

import pandas as pd # Importing pandas library to handle CSV file

# Load the CSV file into a DataFrame


df = pd.read_csv('/student-dataset .csv') # Reading the CSV file

# Display the first 10 rows of the DataFrame


print(df.head(10)) # Using head() method to show the top 10 rows

Explanation:

​ 1.import pandas as pd: This imports the Pandas library, which is essential for
working with tabular data like CSV files.

​ 2.pd.read_csv('/path/to/csv'): Reads the CSV file into a Pandas DataFrame.


Replace '/path/to/csv' with the file path.

​ 3.df.head(10): Displays the first 10 rows of the DataFrame. The head() method
allows you to specify the number of rows to view. If no number is passed, it
defaults to 5.

Output:

Ex6: Count the number of rows and columns, and display the column data types.

Code:

import pandas as pd # Import pandas library for working with CSV files

# Load the CSV file into a DataFrame


df = pd.read_csv('student-dataset.csv') # Reading the uploaded CSV file
# Count the number of rows and columns in the DataFrame
rows, columns = df.shape # shape attribute returns a tuple (rows, columns)
print(f"Number of rows: {rows}") # Display the number of rows
print(f"Number of columns: {columns}") # Display the number of columns

# Display the data types of each column


print("Column data types:")
print(df.dtypes) # dtypes attribute shows the data type of each column

Explanation:
1.​ df.shape: This returns a tuple containing the number of rows and columns in the
DataFrame.
2.​ print(f"Number of rows: {rows}"): Prints the number of rows using the first
element of the tuple.
3.​ print(f"Number of columns: {columns}"): Prints the number of columns using
the second element of the tuple.
4.​ df.dtypes: Displays the data type of each column in the DataFrame (e.g., int64,
float64, object).

Output:
Data Cleaning
Ex7: Identify and replace missing values in a dataset with:
The mean for numeric columns.

Code:

import pandas as pd
import numpy as np

data = { # Sample dataset with missing values


"Name": ["Alice", "Bob", "Charlie", "Daisy"],
"Age": [25, np.nan, 30, 35],
"Salary": [50000, 60000, np.nan, 80000]
}
df = pd.DataFrame(data)

print("Original Dataset:")
print(df)

missing_values = df.isnull().sum() # Identify missing values


print("\nMissing Values Count:")
print(missing_values)

df["Age"].fillna(df["Age"].mean(), inplace=True) # Replace missing values with


the mean for numeric columns
df["Salary"].fillna(df["Salary"].mean(), inplace=True)

print("\nDataset After Replacing Missing Values:")


print(df)

Explanation:
1.​ Imports necessary libraries: pandas for data handling, numpy for handling missing
values.
2.​ Creates a dataset with some missing values (np.nan for Age and Salary).
3.​ Identifies missing values using isnull().sum() to count them.
4.​ Fills missing values in Age and Salary columns with their respective column
means using fillna().
5.​ Displays the updated dataset after filling the missing values.

This approach is used to handle missing data by replacing it with the mean of the column.
Output:

Ex8: Rename columns to more descriptive names (e.g., "temp" → "Temperature").

Code:

import pandas as pd

data = { # Sample dataset


"temp": [72, 75, 78],
"hum": [45, 50, 55],
"precip": [0.1, 0.0, 0.2]
}
df = pd.DataFrame(data)

print("Original Dataset:")
print(df)

df.rename(columns={ # Rename columns to more descriptive names


"temp": "Temperature",
"hum": "Humidity",
"precip": "Precipitation"
}, inplace=True)

print("\nDataset with Renamed Columns:")


print(df)
Explanation:
1.​ Create a dataset: The data dictionary contains columns for temperature
(temp), humidity (hum), and precipitation (precip).
2.​ Create a DataFrame: The df DataFrame is created from the data
dictionary.
3.​ Rename columns: The rename() method is used to change the column
names to more descriptive ones (Temperature, Humidity, Precipitation).
4.​ Display the updated DataFrame: The DataFrame with the renamed
columns is printed.

Output:

3: Visualization Basics

Basic Visualization with Matplotlib


Ex9: Plot a line graph showing temperature changes over a week.

Code:

import matplotlib.pyplot as plt

days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",


"Sunday"]
temperatures = [30, 32, 31, 29, 28, 27, 30] # Example data for temperature
changes over a week

plt.figure(figsize=(10, 6)) # Plotting the line graph


plt.plot(days, temperatures, marker='o', color='b', linestyle='-', linewidth=2,
markersize=8)
plt.title("Temperature Changes Over a Week", fontsize=16) # Adding titles and
labels
plt.xlabel("Days of the Week", fontsize=12)
plt.ylabel("Temperature (°C)", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.tight_layout() # Display the graph


plt.show()

Explanation:
This code plots a line graph showing temperature changes over a week, with days
on the x-axis and temperature (°C) on the y-axis. The line is styled with blue color ('b'),
circular markers ('o'), and a solid line ('-') for clarity. Titles, labels, and gridlines are
added to enhance readability. Finally, the graph is displayed with plt.show(), ensuring a
clean layout with plt.tight_layout().

Output:

Ex10: Create a bar chart to compare the sales of three products in different regions.

Code:

import matplotlib.pyplot as plt


import numpy as np

regions = ["North", "South", "East", "West"] # Data for the bar chart
products = ["Product A", "Product B", "Product C"]
sales = {
"Product A": [200, 150, 300, 250],
"Product B": [180, 130, 270, 220],
"Product C": [210, 160, 310, 260]
}

x = np.arange(len(regions)) # Setting up bar positions


bar_width = 0.25

plt.figure(figsize=(10, 6)) # Creating the bar chart


plt.bar(x - bar_width, sales["Product A"], width=bar_width, label="Product A",
color="blue")
plt.bar(x, sales["Product B"], width=bar_width, label="Product B",
color="green")
plt.bar(x + bar_width, sales["Product C"], width=bar_width, label="Product C",
color="orange")

plt.xlabel("Regions", fontsize=12) # Adding labels, title, and legend


plt.ylabel("Sales (Units)", fontsize=12)
plt.title("Sales Comparison of Products Across Regions", fontsize=16)
plt.xticks(x, regions, fontsize=10)
plt.legend(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout() # Display the bar chart


plt.show()

Explanation:​
​ This code creates a grouped bar chart to compare sales of three products (Product
A, B, and C) across four regions (North, South, East, West). It uses np.arange to set bar
positions and plt.bar to plot each product's sales with different colors. Labels, title, and
legend are added for clarity, and the y-axis grid is enabled for easier reading. Finally, the
chart is displayed with plt.show().

Output:
Using Seaborn for Advanced Visualizations
Ex11: Use Seaborn to create a histogram of a numeric column from a dataset.

Code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = { # Example dataset


"Age": [22, 25, 29, 34, 28, 32, 35, 40, 30, 27, 31, 33, 36, 38, 29, 24, 26, 28, 37,
39]
}
df = pd.DataFrame(data)

plt.figure(figsize=(8, 5)) # Creating the histogram


sns.histplot(df["Age"], bins=8, kde=True, color="skyblue")

plt.title("Distribution of Age", fontsize=16) # Adding labels and title


plt.xlabel("Age", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout() # Display the histogram


plt.show()

Explanation:
This code creates a histogram to visualize the distribution of ages in a dataset
using Seaborn's histplot function, with 8 bins and a kernel density estimate (KDE)
overlay. It customizes the appearance with a skyblue color, labels, and a grid for better
readability. The chart is displayed with appropriate titles and axis labels using plt.show().

Output:
Ex12: Create a boxplot to visualize the distribution of sales by product category.

Code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = { # Example dataset


"Product Category": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "A", "B",
"C"],
"Sales": [200, 220, 250, 180, 190, 170, 300, 310, 320, 240, 200, 330]
}
df = pd.DataFrame(data)

plt.figure(figsize=(8, 5))# Creating the boxplot


sns.boxplot(x="Product Category", y="Sales", data=df, palette="Set2")

plt.title("Sales Distribution by Product Category", fontsize=16) # Adding labels


and title
plt.xlabel("Product Category", fontsize=12)
plt.ylabel("Sales (Units)", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.tight_layout()# Display the boxplot


plt.show()

Explanation:
This code creates a boxplot to visualize the distribution of sales across different
product categories (A, B, C) using Seaborn. It uses the "Set2" color palette for styling and
adds axis labels and a title for clarity. The grid is enabled on the y-axis for better
readability, and the plot is displayed with plt.show().

Output:
Intermediate-Level Exercises
4: Data Aggregation and Grouping

Grouping and Aggregation with Pandas


Ex13: Load a dataset of employee salaries and:
Calculate the total salary by department.
Find the highest salary in each job role.

Code:

import pandas as pd

data = { # Sample dataset


'EmployeeID': [1, 2, 3, 4, 5, 6],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
'JobRole': ['Manager', 'Developer', 'Assistant', 'Developer', 'Analyst', 'Manager'],
'Salary': [60000, 80000, 50000, 85000, 75000, 90000],
}

df = pd.DataFrame(data) # Create DataFrame

total_salary_by_department = df.groupby('Department')['Salary'].sum()#
Calculate total salary by department

highest_salary_by_jobrole = df.groupby('JobRole')['Salary'].max()# Find the


highest salary in each job role

print("Total Salary by Department:") # Display the results


print(total_salary_by_department)
print("\nHighest Salary by Job Role:")
print(highest_salary_by_jobrole)

Explanation:

1.Dataset Creation:
The dataset is defined as a dictionary and converted into a pandas
DataFrame for analysis.

2.Total Salary by Department:

●​ groupby('Department')['Salary'].sum() groups the data by Department and


calculates the sum of the Salary for each department.

3.Highest Salary by Job Role:


●​ groupby('JobRole')['Salary'].max() groups the data by JobRole and finds
the maximum Salary for each role.

4.Output:

●​ Results are displayed using the print function.

Output:

Ex14: Group sales data by region and year, and calculate the average revenue.

Code:

import pandas as pd

data = { # Sample sales data


'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
'Year': [2020, 2021, 2020, 2021, 2020, 2021, 2020, 2021],
'Revenue': [100000, 120000, 80000, 95000, 70000, 75000, 90000, 85000]
}

df = pd.DataFrame(data) # Create a DataFrame

average_revenue = df.groupby(['Region', 'Year'])['Revenue'].mean()# Group data


by Region and Year, then calculate the average revenue

print("Average Revenue by Region and Year:") # Display the result


print(average_revenue)
Explanation:

1.​ Dataset Creation:


○​ The data is defined in a dictionary format with columns: Region,
Year, and Revenue.
○​ A pandas DataFrame is created from the dictionary.
2.​ Grouping and Aggregation:
○​ groupby(['Region', 'Year']): Groups the data by Region and Year.
○​ ['Revenue'].mean(): Calculates the average revenue for each group.
3.​ Output:
○​ The result is a pandas Series with a multi-level index (Region and
Year) and the calculated average revenue.

Output:

Advanced Operations
Ex15: Create a pivot table showing total sales for each product category in each region.

Code:

import pandas as pd

data = {# Sample sales data


'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Furniture',
'Electronics', 'Furniture', 'Electronics', 'Furniture'],
'Sales': [50000, 30000, 40000, 35000, 45000, 25000, 48000, 28000]
}
df = pd.DataFrame(data)# Create a DataFrame

pivot_table = pd.pivot_table(df, values='Sales', index='Region',


columns='ProductCategory', aggfunc='sum', fill_value=0)# Create a pivot table

print("Pivot Table: Total Sales by Product Category in Each Region")# Display


the pivot table
print(pivot_table)

Explanation:

1.Dataset:

●​ Includes columns for Region, ProductCategory, and Sales.

2.Pivot Table:

●​ pd.pivot_table() creates a pivot table:


○​ values='Sales': The column to aggregate.
○​ index='Region': Rows will represent regions.
○​ columns='ProductCategory': Columns will represent product
categories.
○​ aggfunc='sum': Aggregates sales data using the sum.
○​ fill_value=0: Fills missing values with 0.

3.Output:

●​ The pivot table shows the total sales for each product category in each
region.

Output:​

Ex16: Add a new column to calculate the percentage contribution of each product to total sales.

Code:
import pandas as pd

data = {# Sample sales data


'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
'ProductCategory': ['Electronics', 'Furniture', 'Electronics', 'Furniture',
'Electronics', 'Furniture', 'Electronics', 'Furniture'],
'Sales': [50000, 30000, 40000, 35000, 45000, 25000, 48000, 28000]
}

df = pd.DataFrame(data) # Create a DataFrame

total_sales = df['Sales'].sum() # Calculate total sales

df['PercentageContribution'] = (df['Sales'] / total_sales) * 100 # Add a new


column for percentage contribution

print("Sales Data with Percentage Contribution:") # Display the updated


DataFrame
print(df)

Explanation:

Total Sales:

●​ The sum() function calculates the total sales across all rows.

Percentage Contribution:

●​ Each product's percentage contribution is calculated


●​ The result is added as a new column to the DataFrame.

Output:

●​ The DataFrame now includes an additional column for the percentage


contribution.
5: Advanced Visualization

Combining Multiple Plots


Ex17: Create a subplot with two charts:
A line chart showing sales trends over time.
A bar chart showing monthly revenue for the same period.

Creating a subplot with two charts — a line chart for sales trends over time and a bar chart for
monthly revenue:

Code:

import matplotlib.pyplot as plt


plt.showmonths = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [250, 300, 350, 400, 450, 500]
revenue = [1000, 1200, 1500, 1600, 1800, 2000]

fig, axs = plt.subplots(2, 1, figsize=(8, 6)) # 2 rows, 1 column # Create a figure


with subplots

axs[0].plot(months, sales, marker='o', color='blue', label='Sales') # Line chart for


sales trends
axs[0].set_title("Sales Trends Over Time")
axs[0].set_xlabel("Months")
axs[0].set_ylabel("Sales")
axs[0].legend()
axs[0].grid(True)

axs[1].bar(months, revenue, color='green', label='Revenue') # Bar chart for


monthly revenue
axs[1].set_title("Monthly Revenue")
axs[1].set_xlabel("Months")
axs[1].set_ylabel("Revenue")
axs[1].legend()

plt.tight_layout()# Adjust layout for better spacing

plt.show()# Display the plots

Explanation:

​ 1.​ Data:
●​ months: Represents the x-axis labels.
●​ sales and revenue: Represent y-values for the line and bar charts,
respectively.
​ 2.​ Subplots:
●​ plt.subplots(2, 1): Creates a figure with 2 rows and 1 column of
plots.
●​ axs[0]: Refers to the first subplot (line chart).
●​ axs[1]: Refers to the second subplot (bar chart).
​ 3.​ Line Chart:
●​ Plots sales trends over time with plot().
●​ Includes a title, labels, and grid.
​ 4.​ Bar Chart:
●​ Plots monthly revenue with bar().
​ 5.​ plt.tight_layout():
●​ Ensures there’s no overlap between subplots.

Output:

●​ The first chart shows a line plot of sales trends over months.
●​ The second chart shows a bar chart of monthly revenue.

Ex18: Overlay a scatter plot on a line chart to show individual sales transactions along a
trend line.

To overlay a scatter plot on a line chart, we use the plot() function to draw the trend line and the
scatter() function to display individual data points. Below is the Python implementation:

Code:
import matplotlib.pyplot as plt
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'] # Data
sales_trend = [250, 300, 350, 400, 450, 500] # Trend line data
sales_transactions = [260, 290, 340, 410, 460, 490] # Individual sales
transactions

plt.figure(figsize=(8, 5))# Create the plot

plt.plot(months, sales_trend, color='blue', marker='o', label='Sales Trend') # Line


chart for sales trend

plt.scatter(months, sales_transactions, color='red', label='Individual Transactions',


s=100) # Scatter plot for individual sales transactions

plt.title("Sales Trend with Individual Transactions") # Add titles and labels


plt.xlabel("Months")
plt.ylabel("Sales")
plt.legend()

plt.grid(True) # Add grid for better visualization

plt.show() # Show the plot

Explanation:

​ 1.Data:
●​ months: Represents the time period on the x-axis.
●​ sales_trend: Represents the overall sales trend over the months.
●​ sales_transactions: Represents specific sales transactions as individual
data points.
​ 2. Visualization:
●​ The line chart is plotted using plt.plot() to show the overall sales trend.
●​ The scatter plot is overlaid using plt.scatter() to highlight individual sales
transactions.
​ 3. Customization:
●​ The line is colored blue with circular markers for better visualization.
●​ Scatter points are colored red and slightly larger (s=100) for emphasis.
●​ Labels, legend, and grid are added to make the plot easy to interpret.

Output:

The resulting plot consists of:


​ • A blue line representing the overall sales trend.
​ • Red scatter points representing specific sales transactions.
Customizing Visualizations

Ex19: Customize a Seaborn heatmap to display a correlation matrix with annotations.

To visualize a correlation matrix with annotations, we use Seaborn’s heatmap() function. Below
is the implementation:

Code:

import seaborn as sns


plt.showatplotlib.pyplot as plt
import pandas as pd

data = { # Sample data


'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1],
'C': [2, 3, 4, 5, 6],
'D': [10, 9, 8, 7, 6]
}

df = pd.DataFrame(data)# Create a DataFrame

correlation_matrix = df.corr() # Compute the correlation matrix

plt.figure(figsize=(8, 6))# Plot the heatmap


sns.heatmap(
correlation_matrix,
annot=True, # Add annotations
cmap="coolwarm", # Use a diverging colormap
fmt=".2f", # Format the correlation values
linewidths=0.5, # Add grid lines between cells
cbar_kws={"shrink": 0.8} # Customize the color bar
)

plt.title("Correlation Matrix Heatmap", fontsize=16)# Add title

plt.show() # Show the plot

Explanation:

​ 1. Data:
​ ​ • A sample dataset is created using a dictionary and converted into a
pandas DataFrame.
​ ​ • The correlation matrix is computed using df.corr().
​ 2. Seaborn Heatmap:
​ • The sns.heatmap() function is used to visualize the correlation matrix.
​ • annot=True: Displays the correlation values within the cells.
​ • cmap="coolwarm": Uses a diverging colormap to highlight positive and
negative correlations.
​ • fmt=".2f": Limits the correlation values to 2 decimal places.
​ • linewidths=0.5: Adds grid lines between cells.
​ • cbar_kws={"shrink": 0.8}: Shrinks the color bar for better fit.
​ 3. Customization:
​ • A figure size of 8x6 is set using plt.figure().
​ • The plot includes a title for clarity.

Output:

The heatmap displays:


●​ A matrix where the cells show the correlation values between pairs of
variables.
●​ Positive correlations are represented in shades of red, and negative
correlations are in shades of blue.
●​ Annotated values within each cell for easy interpretation.
Ex20: Use Matplotlib to create a stacked area chart showing cumulative sales for multiple
products.

A stacked area chart visualizes cumulative data for multiple categories over a specific time
period. Below is the implementation:

Code:

import matplotlib.pyplot as plt


plt.showmonths = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
product_A = [100, 150, 200, 250, 300, 350]
product_B = [80, 120, 160, 200, 240, 280]
product_C = [50, 70, 90, 110, 130, 150]

plt.figure(figsize=(8, 5)) # Create a stacked area chart


plt.stackplot(months, product_A, product_B, product_C,
labels=['Product A', 'Product B', 'Product C'],
colors=['#ff9999', '#66b3ff', '#99ff99'])

plt.xlabel("Months") # Add labels, legend, and title


plt.ylabel("Cumulative Sales")
plt.title("Cumulative Sales for Multiple Products")
plt.legend(loc='upper left')

plt.show()# Show the chart


Explanation:

1.Data:
●​ months: Represents the time period on the x-axis.
●​ product_A, product_B, product_C: Represent the sales of three products
over the months.
2. Visualization:
●​ plt.stackplot() is used to create the stacked area chart.
●​ Each product’s sales are stacked on top of the others to show cumulative
sales over time.
3. Customization:
●​ labels: Adds a legend to identify each product.
●​ colors: Specifies colors for each product’s area.
●​ The chart includes labels for the x-axis, y-axis, and a title.

Output:

The stacked area chart displays:


●​ Monthly cumulative sales for three products.
●​ A clear visual representation of how each product contributes to total sales
over time.
6: Time Series Analysis

Time Series Data Handling


Ex21: Load a dataset with date columns, and:
Convert the date column to a Pandas datetime object.
Extract the year, month, and day into separate columns.

Code:

df = pd.read_csv('sample_dataset.csv')

df['date_column'] = pd.to_datetime(df['date_column'])# Convert the 'date_column'


to a Pandas datetime object

df['year'] = df['date_column'].dt.year# Extract the year, month, and day into


separate columns
df['month'] = df['date_column'].dt.month
df['day'] = df['date_column'].dt.day

print(df.head())# Display the updated DataFrame

Explanation:

1.​ Load Data: Reads a CSV file into a DataFrame with pd.read_csv().
2.​ Convert Dates: Converts date_column to a datetime object using
pd.to_datetime().
3.​ Extract Components: Extracts year, month, and day into separate
columns with .dt.
4.​ View Changes: Displays the updated DataFrame with print(df.head()).
5.​ Reusable: Works for any dataset with a date column.

Output:
Ex22: Resample a time series dataset to calculate monthly averages.

Code:

import pandas as pd
df = pd.read_csv('sample_dataset.csv') # Read the sample CSV file
df['date_column'] = pd.to_datetime(df['date_column']) # Ensure proper datetime
format
df.set_index('date_column', inplace=True) # Indexing by date for time-based
operations
monthly_avg = df['temperature'].resample('M').mean() # Compute monthly
averages

print(monthly_avg) # Output the monthly averages # Display the result

Explanation:

1.​ Convert Dates: Ensure the date_column is a Pandas datetime object.


2.​ Set Index: Set the date_column as the DataFrame index for resampling.
3.​ Resample Monthly: Use resample('M') to group data by month and
calculate the mean for the temperature column.

Output:

Visualizing Time Series


Ex23: Plot a time series of daily sales data with a rolling 7-day average overlay.

Code:

import pandas as pd
import matplotlib.pyplot as plt

data = { # Sample daily sales data


"date": pd.date_range(start="2025-01-01", periods=30, freq="D"),
"sales": [100, 120, 130, 140, 150, 110, 115, 130, 140, 145,
150, 155, 160, 170, 175, 180, 190, 195, 200, 205,
210, 220, 230, 235, 240, 250, 255, 260, 270, 280]
}

df = pd.DataFrame(data) # Create a DataFrame

df['date'] = pd.to_datetime(df['date']) # Convert 'date' column to datetime format


and set as index
df.set_index('date', inplace=True)

df['rolling_avg'] = df['sales'].rolling(window=7).mean() # Calculate the rolling


7-day average

plt.figure(figsize=(10, 6))# Plot the time series and rolling average


plt.plot(df.index, df['sales'], label='Daily Sales', color='blue')
plt.plot(df.index, df['rolling_avg'], label='7-Day Rolling Average', color='orange',
linewidth=2)
plt.title('Daily Sales with 7-Day Rolling Average')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

1.​ Sample Data: A date range and corresponding sales data are generated.
2.​ Convert Dates: The date column is converted to a datetime format and set as the
DataFrame index.
3.​ Rolling Average: A 7-day rolling mean is calculated using
.rolling(window=7).mean().
4.​ Plotting: Both the daily sales and the rolling average are plotted with labels and
gridlines for clarity.

Output:
Ex24: Highlight weekends and holidays on a sales time series plot using custom markers.

Code:

import pandas as pd
import matplotlib.pyplot as plt

data = { # Sample daily sales data


"date": pd.date_range(start="2025-01-01", periods=30, freq="D"),
"sales": [100, 120, 130, 140, 150, 110, 115, 130, 140, 145,
150, 155, 160, 170, 175, 180, 190, 195, 200, 205,
210, 220, 230, 235, 240, 250, 255, 260, 270, 280]
}

holidays = ["2025-01-06", "2025-01-15"]# List of holidays (custom dates)

df = pd.DataFrame(data) # Create a DataFrame

df['date'] = pd.to_datetime(df['date']) # Convert 'date' column to datetime and set


as index
df.set_index('date', inplace=True)

df['is_weekend'] = df.index.weekday >= 5 # Saturday (5) and Sunday (6) #


Identify weekends and holidays
df['is_holiday'] = df.index.isin(pd.to_datetime(holidays))

plt.figure(figsize=(12, 6)) # Plot the time series


plt.plot(df.index, df['sales'], label='Daily Sales', color='blue')

plt.scatter(# Highlight weekends


df.index[df['is_weekend']],
df['sales'][df['is_weekend']],
color='orange', label='Weekends', s=100, marker='o'
)

plt.scatter( # Highlight holidays


df.index[df['is_holiday']],
df['sales'][df['is_holiday']],
color='red', label='Holidays', s=150, marker='*'
)

plt.title('Sales Time Series with Weekends and Holidays Highlighted')# Customize


the plot
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

Generate Data: Create a DataFrame with date and sales columns.


Define Holidays: Specify custom holiday dates in the holidays list.
Identify Weekends and Holidays:

●​ weekday >= 5 checks for Saturdays and Sundays.


●​ isin() checks if a date is in the list of holidays.

Plot Sales Data:

●​ Plot the sales time series as a line.


●​ Use scatter to overlay weekend and holiday markers with different styles.

Customize Appearance: Add titles, labels, legends, and gridlines.

Output:

Advanced-Level Exercises:

7: Data Cleaning and Transformation

Advanced Cleaning

Ex25: Load a messy dataset with inconsistent date formats and Standardize the date
format across the dataset. Split full names into separate "First Name" and "Last Name"
columns.
Code:

import pandas as pd

data = { 'Full Name': ['John Doe', 'Jane Smith', 'Robert Brown', 'Emily Davis'],
'Date': ['2023/01/05', '05-02-2022', 'March 3, 2021', '2020.04.15']} # Sample
messy dataset
df = pd.DataFrame(data)
def clean_data(df): # Function to clean data
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df[['First Name', 'Last Name']] = df['Full Name'].str.split(' ', 1, expand=True)
df.drop(columns=['Full Name'], inplace=True)
return df
df_cleaned = clean_data(df)
print("Cleaned Dataset:")
print(df_cleaned)

Explanation:

Create a sample dataset with inconsistent date formats and a single "Full Name"
column.Use pd.to_datetime() to convert all date formats into a consistent format
(default: yyyy-mm-dd).Use str.split() to split the "Full Name" column into "First Name"
and "Last Name".Drop the original w"Full Name" column after splitting.Output the
cleaned dataset with standardized dates and separate name columns.

Output:

Ex26: Remove outliers from a numeric column using the IQR method.

Code:

import numpy as np

data_numeric = {'Values': [10, 15, 14, 102, 13, 12, 16, 17, 108, 11]}
df_numeric = pd.DataFrame(data_numeric)
def remove_outliers(df, column):
Q1 = df[column].quantile(0.25) # First quartile (25th percentile)
Q3 = df[column].quantile(0.75) # Third quartile (75th percentile)
IQR = Q3 - Q1 # Interquartile range
df = df[(df[column] >= (Q1 - 1.5 * IQR)) & (df[column] <= (Q3 + 1.5 * IQR))]
# Filtering Outliers
return df
df_no_outliers = remove_outliers(df_numeric, 'Values')
print("Data after outlier removal:")
print(df_no_outliers)

Explanation:

Create a dataset containing some extreme values (outliers) in a numeric


column.Calculate the first quartile (Q1) and third quartile (Q3) using quantile().Compute
the Interquartile Range (IQR) as Q3 - Q1.Filter out rows where values lie outside the
range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR].Output the cleaned dataset after removing
outliers.

Output:

Data Transformation

Ex27: Normalize numeric columns to a range of 0–1 using MinMaxScaler from sklearn.

Code:

from sklearn.preprocessing import MinMaxScaler

data_to_normalize = {'Feature1': [10, 20, 30, 40, 50], 'Feature2': [5, 10, 15, 20,
25]}
df_normalize = pd.DataFrame(data_to_normalize)
def normalize_data(df, columns):
scaler = MinMaxScaler() # Initializes MinMaxScaler
df[columns] = scaler.fit_transform(df[columns]) # Applies MinMax scaling to
columns
return df
df_normalized = normalize_data(df_normalize, ['Feature1', 'Feature2'])
print("Normalized Data:")
print(df_normalized)

Explanation:

Create a dataset with numeric features that need to be normalized.Use


MinMaxScaler from sklearn to scale the data between 0 and 1.Apply scaling to the
selected columns and return the transformed data.Output the normalized dataset where
values are scaled between 0 and 1.

Output:

Ex28: Encode categorical variables using one-hot encoding.

Code:

data_categorical = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue']}


df_categorical = pd.DataFrame(data_categorical)
def encode_categorical(df, columns):
return pd.get_dummies(df, columns=columns) # Converts into binary variables
df_encoded = encode_categorical(df_categorical, ['Color'])
print("One-Hot Encoded Data:")
print(df_encoded)

Explanation:

Create a dataset with categorical data (e.g., colors).Use pd.get_dummies() to


convert the categorical variable into binary columns representing each
category.Output the dataset with the categorical data replaced by binary (one-hot
encoded) variables.
Output:

8: Machine Learning Basics

Preprocessing for Machine Learning

Ex29: Load a dataset, split it into training and testing sets, and standardize numeric
features.

Code:

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler

data_ml = {
'Feature1': [10, 20, 30, 40, 50, 60],
'Feature2': [5, 10, 15, 20, 25, 30],
'Target': [1, 0, 1, 0, 1, 0]
}
df_ml = pd.DataFrame(data_ml)
def preprocess_ml_data(df, target):
X = df.drop(columns=[target]) # Separates features from target
y = df[target]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
#Splits data
scaler = StandardScaler() # Initializes StandardScaler
X_train = scaler.fit_transform(X_train) # Standardizes training data
X_test = scaler.transform(X_test) # Standardize testing data
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = preprocess_ml_data(df_ml, 'Target')
print("Standardized Training Data:")
print(X_train)
print("Standardized Testing Data:")
print(X_test)
Explanation:

Create a dataset with features and a target variable for machine learning. Use
train_test_split() to split the data into training and testing sets.Use StandardScaler to
standardize the numeric features so they have a mean of 0 and a standard deviation of
1.Output the standardized training and testing data for use in machine learning models.

Output:

Ex30: Train a linear regression model to predict house prices and display:
Model coefficients.
Mean squared error (MSE) on the test set.

Dataset:

size,bedrooms,age,price
1500,3,10,300000
1800,4,15,350000
2400,4,20,400000
3000,5,5,500000
3500,5,8,550000
2200,3,12,370000
2000,3,10,320000

Code:

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

data = pd.read_csv('house_prices.csv')
X = data[['size', 'bedrooms', 'age']] # Define features variable
y = data['price'] # Target variable # Define target variable
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)#Split data
model = LinearRegression()# Train the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)# Predictions
print("Model Coefficients:", model.coef_)# Model coefficients
mse = mean_squared_error(y_test, y_pred)# Mean squared error
print("Mean Squared Error:", mse)

Explanation:

Create a dataset with features (size, bedrooms, age) and target (price) for house
prices.Use train_test_split() to create training and testing sets.Train a Linear Regression
model using the training data (model.fit()).Use the trained model to predict house prices
on the testing set.Output the model coefficients and calculate the Mean Squared Error
(MSE) to evaluate model performance.Display the model coefficients and the MSE to
assess the model's predictive ability.

Output:

Clustering

Ex31: Use K-Means clustering to segment customers based on purchase behavior and
visualize the clusters using Seaborn.

Dataset:

purchase_freq,spending
5,200
10,500
2,150
15,600
7,400
3,250
8,450
20,1000
4,300
6,350

Code:

import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = pd.read_csv('customer_data.csv')
X = data[['purchase_freq', 'spending']]# Feature selection (e.g., purchase behavior)
kmeans = KMeans(n_clusters=3, random_state=42)# Apply KMeans
data['Cluster'] = kmeans.fit_predict(X)
sns.scatterplot(x='purchase_freq', y='spending', hue='Cluster', data=data,
palette='viridis')
plt.title('Customer Segmentation')# Plot clusters using Seaborn
plt.show()

Explanation:

The dataset is loaded using pd.read_csv('customer_data.csv').The dataset


contains customer-related data, including purchase_freq (frequency of purchases) and
spending (amount spent).The relevant features (purchase_freq, spending) are selected to
apply clustering.KMeans(n_clusters=3, random_state=42) initializes the K-Means
clustering model with 3 clusters. .fit_predict(X) trains the model and assigns a cluster
label (0, 1, or 2) to each data point.A scatter plot is created using sns.scatterplot(), where
data points are color-coded based on their assigned cluster.plt.show() displays the
clustered data visually.

Output:

9: Optimization and Automation

Performance Optimization

Ex32: Optimize a slow-running notebook using vectorized Pandas operations instead of


loops.

Code:

import pandas as pd
import numpy as np

data = pd.DataFrame({'A': np.random.rand(1000000), 'B':


np.random.rand(1000000)})#dataset
result = []# Slow operation using loops
for i in range(len(data)):
result.append(data['A'][i] + data['B'][i])
data['C'] = data['A'] + data['B']# Optimized operation using vectorization

Explanation:

A DataFrame with columns A and B, containing random values, is created.A for


loop iterates through all rows in the dataset.Each value in column A is added to
the corresponding value in column B, and results are stored in a list.Instead of
using a loop, Pandas vectorized operations (data['A'] + data['B']) perform
addition directly, reducing execution time significantly.

Ex33: Use Dask to process a large dataset that exceeds memory limits.

Dataset:

A,B
0.1,0.2
0.3,0.4
0.5,0.6
0.7,0.8
0.9,1.0
1.1,1.2
1.3,1.4
1.5,1.6​

Code:

import dask.dataframe as dd
data = dd.read_csv('large_dataset.csv')# Load a large dataset with Dask
mean_value = data['B'].mean().compute()# Perform operations (eg:calculate mean
of a column)
print("Mean:", mean_value)

Explanation:

dd.read_csv('large_dataset.csv') loads a dataset too large for memory.Unlike


Pandas, Dask processes data in chunks.data['B'].mean() calculates the mean but
does not execute immediately (Dask uses lazy execution). .compute() triggers
actual computation when needed.

Output:

Automating Reports

Ex34: Generate a summary report for a dataset using nbconvert to export the notebook as a PDF
or HTML file.

Code:

!jupyter nbconvert --to html my_notebook.ipynb

Explanation:

The command jupyter nbconvert --to html my_notebook.ipynb is run in a


terminal or command prompt.This converts a Jupyter Notebook (.ipynb) into a
PDF report.The tool extracts all content (code, markdown, output) and formats it
into the specified file type (PDF or HTML).

Output:

10: Real-World Applications

Case Study: Sales Analysis

Ex35: Analyze a dataset with fields like Order ID, Customer ID, Product, Quantity, Price,
and Region:
Identify the top 10 customers by revenue.
Calculate average sales per product category.
Visualize monthly sales trends using Matplotlib.

Code:

import pandas as pd
import matplotlib.pyplot as plt
data = {
'Order ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Customer ID': [101, 102, 103, 101, 104, 105, 102, 106, 107, 108],
'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'A', 'B', 'C'],
'Quantity': [5, 3, 4, 2, 5, 6, 4, 3, 2, 7],
'Price': [20, 30, 20, 15, 30, 15, 20, 20, 30, 15],
'Region': ['North', 'South', 'East', 'North', 'West', 'South', 'East', 'West', 'North',
'East'],
'Date': pd.to_datetime(['2023-01-15', '2023-02-10', '2023-02-20', '2023-03-05',
'2023-03-10', '2023-04-12', '2023-05-14', '2023-06-20', '2023-07-10',
'2023-08-02'])
}
df = pd.DataFrame(data)# Create a DataFrame
df['Revenue'] = df['Quantity'] * df['Price']# Calculate the Revenue (Quantity *
Price)
top_10_customers = df.groupby('Customer ID')['Revenue'].sum().nlargest(10)
average_sales_per_product = df.groupby('Product')['Revenue'].mean()

df['Month'] = df['Date'].dt.to_period('M') # Extract month from date


monthly_sales = df.groupby('Month')['Revenue'].sum()
plt.figure(figsize=(10, 6))# Plotting the results
plt.subplot(2, 1, 1)# Monthly sales trend
monthly_sales.plot(kind='line', marker='o', color='b')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.subplot(2, 1, 2)# Top 10 customers by revenue
top_10_customers.plot(kind='bar', color='g')
plt.title('Top 10 Customers by Revenue')
plt.xlabel('Customer ID')
plt.ylabel('Total Revenue')
plt.tight_layout()
plt.show()
print("Top 10 Customers by Revenue:")# Output results
print(top_10_customers)
print("\nAverage Sales per Product Category:")
print(average_sales_per_product)

Explanation:

A DataFrame simulating sales data (Order ID, Customer ID, Product, Quantity,
Price, and Region) is created.A new column Revenue is computed as
df['Quantity'] * df['Price'] .
df.groupby('CustomerID')['Revenue'].sum().nlargest(10) groups data by
customer and extracts the top 10 revenue-generating
customers.df.groupby('Product')['Revenue'].mean() computes average revenue
per product category.df['Date'].dt.to_period('M') extracts the month from the date
column.df.groupby('Month')['Revenue'].sum() aggregates total revenue per
month.plt.subplot() creates subplots for: Monthly sales trend (line plot), Top 10
customers by revenue (bar chart) plt.show() displays the plots.

Output:



Case Study: Customer Segmentation

Ex36: Perform RFM (Recency, Frequency, Monetary) analysis on a customer dataset:


Segment customers into "High-Value," "Medium-Value," and "Low-Value" groups.
Visualize customer segments using bar and pie charts.

Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = { 'Customer ID': [101, 102, 103, 104, 105, 106],


'Last Purchase Date': ['2023-12-01', '2023-11-25', '2023-12-10', '2023-10-15',
'2023-12-05', '2023-11-30'],'Purchase Frequency': [5, 2, 7, 3, 6, 4], 'Monetary
Value': [200, 300, 150, 400, 250, 100]} # Sample dataset
df = pd.DataFrame(data)# Convert to DataFrame
df['Last Purchase Date'] = pd.to_datetime(df['Last Purchase Date'])# Convert to
datetime
current_date = pd.to_datetime('2023-12-31')# Calculate RFM metrics
df['Recency'] = (current_date - df['Last Purchase Date']).dt.days# Recency:days
since last buy
# Frequency and Monetary are already available in the dataset
# Assign quantiles to create RFM score groups
recency_quantiles = pd.qcut(df['Recency'], 3, labels=["High", "Medium", "Low"])
frequency_quantiles = pd.qcut(df['Purchase Frequency'], 3, labels=["Low",
"Medium", "High"])
monetary_quantiles = pd.qcut(df['Monetary Value'], 3, labels=["Low", "Medium",
"High"])
df['Recency Segment'] = recency_quantiles# Create RFM segments
df['Frequency Segment'] = frequency_quantiles
df['Monetary Segment'] = monetary_quantiles
df['RFM Segment'] = df['Recency Segment'].astype(str) + df['Frequency
Segment'].astype(str) + df['Monetary Segment'].astype(str)# Combine to create
RFM group (convert to string first)
def segment_customer(rfm):# Segment customers based on RFM score
if rfm == 'HighHighHigh':
return 'High-Value'
elif rfm == 'HighHighMedium' or rfm == 'HighMediumHigh':
return 'Medium-Value'
else:
return 'Low-Value'
df['Customer Segment'] = df['RFM Segment'].apply(segment_customer)
segment_counts = df['Customer Segment'].value_counts()# Visualize customer
segments
plt.figure(figsize=(8, 6))# Bar chart for customer segments
plt.bar(segment_counts.index, segment_counts.values, color=['green', 'blue', 'red'])
plt.title('Customer Segments by Value')
plt.xlabel('Customer Segment')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(6, 6))# Pie chart for customer segments
plt.pie(segment_counts, labels=segment_counts.index, autopct='%1.1f%%',
colors=['green', 'blue', 'red'])
plt.title('Customer Segments Distribution')
plt.show()

Explanation:

A DataFrame with Customer ID, Last Purchase Date, Purchase Frequency, and
Monetary Value is created. pd.to_datetime() ensures correct date calculations.
(current_date - df['Last Purchase Date']). dt.days computes the number of days since the
last purchase.pd.qcut() splits data into quantiles (High, Medium, Low). Each customer
is assigned a combined RFM segment based on Recency, Frequency, and Monetary
values.A function assigns labels: "High-Value": Best customers, "Medium-Value":
Moderate customers, "Low-Value": Least active customers. A bar chart displays
customer segment distribution.A pie chart represents segment proportions.

Output:


Case Study: Time Series Forecasting

Ex37: Use ARIMA or Prophet to forecast monthly sales data for the next 12 months.

Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from prophet import Prophet
from datetime import datetime, timedelta
date_rng = pd.date_range(start='2015-01-01', end='2023-12-01', freq='MS')#
Generate data
np.random.seed(42)
sales = np.random.randint(100, 500, size=(len(date_rng)))

df = pd.DataFrame({'ds': date_rng, 'y': sales})# Create DataFrame


plt.figure(figsize=(10, 5))# Plot the sales data
plt.plot(df['ds'], df['y'], marker='o', linestyle='-')
plt.title('Monthly Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.grid()
plt.show()
order = (5, 1, 0)# Set ARIMA order (p, d, q)
model_arima = ARIMA(df['y'], order=order)
arima_result = model_arima.fit()
forecast_arima = arima_result.forecast(steps=12)# Forecast for next 12 months
forecast_dates= pd.date_range(start=df['ds'].iloc[-1]+ timedelta(days=30),
periods=12,freq='MS')
arima_forecast_df = pd.DataFrame({'ds': forecast_dates, 'y': forecast_arima})#
Convert ARIMA forecast to DataFrame
prophet_model = Prophet()# Initialize and fit Prophet model
prophet_model.fit(df)
future = prophet_model.make_future_dataframe(periods=12, freq='MS')# Create
future DataFrame for next 12 months
forecast_prophet = prophet_model.predict(future)# Predict using Prophet
prophet_model.plot(forecast_prophet)# Plot Prophet forecast
plt.title("Prophet Forecast")
plt.show()
print("ARIMA Forecast:\n", arima_forecast_df)# Show forecast for next 12
months
print("\nProphet Forecast:\n", forecast_prophet[['ds', 'yhat', 'yhat_lower',
'yhat_upper']].tail(12))

Explanation:

A date range (2015-2023) and corresponding sales values are created. plt.plot()
visualizes historical sales trends. ARIMA(df['y'], order=(5,1,0)) fits an ARIMA model
with specified parameters. .forecast(steps=12) predicts sales for the next 12 months.
forecast_dates = pd.date_range(start=df['ds'].iloc[-1] + timedelta(days=30),
periods=12, freq='MS') generates future dates. A DataFrame is created with predicted
sales values. Prophet().fit(df) initializes and trains the Prophet model. Future sales are
predicted and plotted.

Output:


Ex38: Evaluate the forecast model using metrics like RMSE and MAE, and plot the
forecast with confidence intervals.

Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.arima.model import ARIMA
from datetime import datetime, timedelta

dates = pd.date_range(start="2020-01-01", periods=36, freq="M")# Generate


sample data
sales = np.random.randint(100, 200, size=(36,))
data = pd.DataFrame({"Date": dates, "Sales": sales})# Convert to DataFrame
data.set_index("Date", inplace=True)
train = data[:-12]# Split data into train
test = data[-12:]# Split data into test
model = ARIMA(train, order=(5,1,0)) # ARIMA(p,d,q)# Fit ARIMA model
model_fit = model.fit()
forecast = model_fit.forecast(steps=12, alpha=0.05) # 95% confidence interval#
Forecast the next 12 months
rmse = np.sqrt(mean_squared_error(test, forecast))# Compute RMSE and MAE
mae = mean_absolute_error(test, forecast)
print(f"RMSE: {rmse}")
print(f"MAE: {mae}")
plt.figure(figsize=(10,6))# Plot the forecast
plt.plot(train.index, train["Sales"], label="Training Data", color="blue")
plt.plot(test.index, test["Sales"], label="Test Data", color="red")
plt.plot(test.index, forecast, label="Forecast", color="green")
conf_int = model_fit.get_forecast(steps=12).conf_int(alpha=0.05)# Plot
confidence intervals
plt.fill_between(test.index, conf_int.iloc[:, 0], conf_int.iloc[:, 1], color='gray',
alpha=0.3)
plt.title("Sales Forecast with ARIMA")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.legend()
plt.show()

Explanation:​

Generate monthly sales data from January 2020 using pd.date_range() for dates
and np.random.randint() for sales values. Convert the data into a DataFrame with dates
as the index. Split the data into a training set (train with 24 months) and a test set (test
with 12 months). Use the ARIMA model with parameters (5, 1, 0) on the training data to
capture the time series pattern. Fit the model using model.fit(). Forecast the next 12
months using model_fit.forecast() with a 95% confidence interval. Calculate RMSE
(np.sqrt(mean_squared_error(test, forecast))) and MAE (mean_absolute_error(test,
forecast)) to evaluate model accuracy. Plot the training data, test data, and forecasted
values using plt.plot(). Display the confidence intervals with plt.fill_between() to show
forecast uncertainty. Print RMSE and MAE values. Display the forecast plot with
confidence intervals.

Output:
11: Advanced Visualization and Storytelling

Interactive Visualizations

Ex39: Create an interactive dashboard using Plotly and Dash to display key performance
indicators (KPIs) and trends.

Code:
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.graph_objects as go
import pandas as pd
kpi_data = {
"Month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
"Sales": [1000, 1500, 2000, 1800, 2200, 2400],
"Profit": [200, 300, 500, 400, 600, 700],
}# Sample Data

df = pd.DataFrame(kpi_data)

app = dash.Dash(__name__)# Initialize Dash app

app.layout = html.Div([
html.H1("Interactive KPI Dashboard", style={"textAlign": "center"}),

html.Div([
html.Div([
html.H3("Total Sales"),
html.P(id="total-sales", style={"fontSize": "24px", "color": "blue"}),
], style={"padding": "20px", "border": "1px solid black", "flex": 1}),

html.Div([
html.H3("Total Profit"),
html.P(id="total-profit", style={"fontSize": "24px", "color": "green"}),
], style={"padding": "20px", "border": "1px solid black", "flex": 1}),
], style={"display": "flex", "justifyContent": "space-around"}),

html.Div([
dcc.Graph(id="kpi-trends")
])
])

@app.callback(
[Output("total-sales", "children"), Output("total-profit", "children"),
Output("kpi-trends", "figure")],
[Input("kpi-trends", "id")]
)
def update_dashboard(_):
total_sales = df["Sales"].sum()
total_profit = df["Profit"].sum()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Month"], y=df["Sales"],
mode="lines+markers", name="Sales"))
fig.add_trace(go.Scatter(x=df["Month"], y=df["Profit"],
mode="lines+markers", name="Profit")) #

fig.update_layout(title="Sales and Profit Trends", xaxis_title="Month",


yaxis_title="Value")

return f"${total_sales}", f"${total_profit}", fig

if __name__ == "__main__":
app.run_server(debug=True)

Explanation:

●​ The update_dashboard callback dynamically calculates total sales and profit and
updates the trend graph using Plotly's Scatter plots.
●​ Flex layout is used to align KPIs side-by-side, and the graph displays trends over
months.

Output:

You might also like