0% found this document useful (0 votes)
11 views16 pages

Question bank DEV.docx

easy to learn

Uploaded by

jemimaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Question bank DEV.docx

easy to learn

Uploaded by

jemimaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT I EXPLORATORY DATA ANALYSIS

2 Marks
1. What is Exploratory Data Analysis (EDA)? (Nov Dec 2022)
EDA is a statistical approach used to analyze and summarize the main characteristics of a
dataset. It involves techniques like visualization and summary statistics to understand data
patterns and detect anomalies. For instance, using histograms and box plots to examine the
distribution of salaries in a dataset helps identify trends and outliers. EDA is crucial before
applying complex models to ensure data is clean and relevant.

2. Why is EDA significant in data science?


EDA is crucial because it allows data scientists to understand the underlying structure of the
data, uncover relationships, and detect issues like missing values or outliers. For example,
plotting scatter plots of employee salary vs. age can reveal if there are clusters or trends,
which helps in feature selection and preprocessing before model training.

3. What are the main steps involved in EDA?


The main steps in EDA include data collection, data cleaning, data transformation, and data
visualization. For example, after collecting data on employee performance, you clean it by
handling missing values, transform it by normalizing salary ranges, and then use
visualization tools like histograms to explore salary distribution.

4. How does EDA differ from classical and Bayesian analysis?


EDA focuses on summarizing and visualizing data without prior assumptions. Classical
analysis involves hypothesis testing and inferential statistics based on predefined models.
Bayesian analysis incorporates prior beliefs with data to update probabilities. For example,
EDA might use scatter plots to explore relationships, while Bayesian methods use probability
distributions to make predictions.

5. What software tools are commonly used for EDA?


Common EDA tools include Python libraries like Pandas, Matplotlib, and Seaborn, and
software like R and Tableau. For instance, Python’s Pandas library is used for data
manipulation, while Matplotlib and Seaborn are used for creating visualizations like line plots
and heatmaps.

6. What is a line chart, and when should it be used?


A line chart displays data points connected by lines, showing trends over time. It is useful for
visualizing changes in a variable over a continuous range. For example, a line chart of
monthly sales figures helps track performance trends and seasonal effects.

7. How does a bar chart differ from a line chart?


A bar chart represents categorical data with rectangular bars, where the length of each bar
indicates the value of the category. In contrast, a line chart shows trends over time with
points connected by lines. For instance, a bar chart might display sales by product category,
while a line chart shows sales trends over several months.

8. What is the purpose of a scatter plot?


A scatter plot visualizes the relationship between two continuous variables, showing how
one variable changes with respect to another. For example, a scatter plot of employee age
vs. salary can reveal if older employees tend to have higher salaries.

9. What does a bubble chart represent?


A bubble chart is an extension of a scatter plot where data points are represented as
bubbles. The size of the bubble represents a third variable. For example, plotting employees’
salaries (y-axis), experience (x-axis), and bubble size for job level can show how job level
correlates with salary and experience.

10. What is an area plot, and how is it used?


An area plot displays data with the area between the line and the x-axis filled with color. It is
used to show cumulative totals over time. For example, an area plot showing accumulated
sales revenue over months can highlight total growth.

11. How do pie charts help in data analysis?


Pie charts represent proportions of a whole as slices of a circle. They are useful for showing
the relative size of categories. For instance, a pie chart showing the distribution of
employees across different departments helps visualize which department has the most
employees.

12. What information does a polar chart convey?


A polar chart displays data in a circular layout, with values represented as points along radial
lines. It is used for cyclical data, such as seasonal trends. For example, a polar chart
showing monthly sales can reveal seasonal variations throughout the year.

13. What is the role of a histogram in EDA?


A histogram shows the distribution of a continuous variable by dividing data into bins and
plotting the frequency of values within each bin. For example, a histogram of employee ages
can reveal the age distribution within a company, helping identify age-related trends.

14. What is the purpose of a lollipop chart?


A lollipop chart is similar to a bar chart but uses lines with markers at the end to represent
data values. It provides a clear view of differences between categories. For example, a
lollipop chart comparing the salaries of employees across different departments can make it
easier to see variations.

15. How do you perform data merging in Python?


Data merging combines two datasets based on a common key. In Python, this is done using
the merge() function in Pandas. For example, merging employee data with department data
on a common DepartmentID helps create a comprehensive dataset for analysis.
merged_df = pd.merge(employee_df, department_df, on='DepartmentID')

16. What is data reshaping, and how is it done?


Data reshaping involves altering the structure of a dataset, such as pivoting or unpivoting
data. In Pandas, this is done using functions like pivot() and melt(). For example, reshaping
sales data from a long format to a wide format using pivot() makes it easier to analyze sales
by region.
reshaped_df = df.pivot(index='Date', columns='Region', values='Sales')

17. How do pivot tables help in data analysis?


Pivot tables summarize data by organizing it into a multi-dimensional table. They aggregate
data based on different dimensions. For example, creating a pivot table to summarize total
sales by region and month helps analyze performance across different periods and
locations.
pivot_table = pd.pivot_table(df, values='Sales', index='Region', columns='Month',
aggfunc='sum')

18. What are cross-tabulations, and how are they useful? (Nov Dec 2022)
Cross-tabulations summarize the relationship between two categorical variables by creating
a contingency table. They help analyze how variables interact. For example, a
cross-tabulation of employee department and gender shows the distribution of genders
within each department.
cross_tab = pd.crosstab(df['Department'], df['Gender'])

19. What are groupby mechanics in pandas?


Groupby mechanics involve splitting data into groups based on a key, applying a function to
each group, and then combining the results. For example, grouping employee data by
department and calculating the mean salary for each department provides insights into
departmental salary structures.
grouped = df.groupby('Department')['Salary'].mean()

20. How do you perform data aggregation using pandas?


Data aggregation combines multiple aggregation functions on grouped data to summarize
information. In pandas, this is achieved using the agg() method. For instance, aggregating
employee data by department to calculate both average salary and total years at the
company can be done as follows:
aggregated = df.groupby('Department').agg({'Salary': 'mean', 'YearsAtCompany': 'sum'})

16 Marks
1. What is the primary purpose of EDA? What are the differences between EDA with
classical and Bayesian analysis? Discuss it in detail. (Nov Dec 2022)
Introduction to EDA
● Definition and significance of Exploratory Data Analysis (2 marks)
Primary Purpose of EDA
● Overview of objectives such as understanding data structure, detecting patterns, and
identifying anomalies (2 marks)
Classical Analysis in EDA
● Techniques used in classical EDA (e.g., summary statistics, visualization) (2 marks)
Bayesian Analysis in EDA
● Introduction to Bayesian methods in EDA (2 marks)
Comparison Between Classical and Bayesian Analysis
● Philosophical differences: Frequentist vs. Bayesian perspective (2 marks)
Application Differences
● Practical differences in applying EDA with classical vs. Bayesian methods (2 marks)
Advantages and Limitations
● Strengths and weaknesses of classical and Bayesian approaches in EDA (2 marks)
Conclusion
● Summary of key points and the importance of selecting the right approach based on
the context (2 marks)

2. Explain various transformation techniques in EDA. (Nov Dec 2022)


Loading the CSV File (1 mark)
● Demonstrating the ability to load a CSV file into a pandas DataFrame.
Data Transformation (2 marks)
● Creating a new feature (Revenue_per_Unit) to demonstrate understanding of feature
engineering.
Data Cleansing: Handling Duplicates (2 marks)
● Detecting and removing duplicate rows to ensure data accuracy.
Data Cleansing: Correcting Errors (1 mark)
● Correcting a data entry error (e.g., fixing a spelling mistake in a categorical variable).
Date Conversion (2 marks)
● Converting a string date to a datetime object and extracting components (year,
month, day).
Handling Missing Values (2 marks)
● Checking for NaN values and applying methods to handle missing data (removing
rows, filling with mean).
Applying Descriptive Statistics (2 marks)
● Summarizing the dataset using descriptive statistics to gain insights into the data
distribution.
Data Refactoring: Renaming Columns (1 mark)
● Renaming columns for clarity or consistency in the dataset.
Data Refactoring: Changing Data Types (1 mark)
● Demonstrating the ability to change the data type of a column.
Dropping Unnecessary Columns (1 mark)
● Removing irrelevant columns to focus on meaningful data.
Refactoring Timezones (1 mark)
● Converting datetime values to a consistent timezone for accurate time-based
analysis.
3. Explain the Fundamentals of Exploratory Data Analysis (EDA)
1. Definition and Purpose of EDA (4 marks)
o Define EDA and explain its primary purpose in data science.
2. Significance of EDA in Data Science (4 marks)
o Discuss why EDA is significant in understanding and preparing data for
further analysis.
3. Steps Involved in EDA (4 marks)
o Outline the key steps typically involved in an EDA process (e.g., data
cleaning, visualization, summarization).
4. Comparison with Classical and Bayesian Analysis (4 marks)
o Compare EDA with classical statistical analysis and Bayesian analysis in
terms of approach and usage.
4. Discuss the Role of Visual Aids in EDA
1. Importance of Visualization in EDA (4 marks)
o Explain why visualization is crucial in EDA and how it helps in understanding
data.
2. Types of Visual Aids Used in EDA (6 marks)
o Describe different types of visual aids used in EDA, such as histograms, box
plots, scatter plots, and pair plots.
3. Application of Visual Aids in Identifying Patterns and Outliers (6 marks)
o Provide examples of how visual aids can be used to identify patterns, trends,
and outliers in datasets.
5. Explain Grouping and Aggregation Techniques in EDA
1. Concept of Grouping Datasets (4 marks)
o Define and explain the concept of grouping datasets and why it is useful in
EDA.
2. Methods of Data Aggregation (6 marks)
o Describe different methods of data aggregation and provide examples of their
application in EDA.
3. Use of Pivot Tables and Cross-Tabulations (6 marks)
o Explain how pivot tables and cross-tabulations can be used in EDA to
summarize and analyze data, with examples.
6. You are a data scientist at a retail company analyzing the sales data for the past year. The
dataset includes information such as product ID, sales volume, sales revenue, store location,
and date. Explain how you would use EDA to understand the sales patterns in different
regions and identify potential factors contributing to high or low sales.
Marks Distribution:
Initial Data Exploration and Cleaning (4 marks)
o Describe how you would clean the data (e.g., handling missing values,
outliers).
Visualizing Sales Trends (4 marks)
o Explain which visual aids you would use to visualize sales trends over time
and across different regions.
Identifying Key Factors Influencing Sales (4 marks)
● Discuss how EDA could help identify key factors such as product type, pricing, or
store location that influence sales.
Actionable Insights for the Business (4 marks)
● Describe the insights you would provide to the business based on your EDA findings
and how these could inform decision-making.
7. A telecom company wants to segment its customer base to target specific groups for
marketing campaigns. The dataset includes customer demographics, usage patterns, and
churn data. Describe how you would use EDA to segment customers and identify which
segments are most likely to churn.
1. Data Exploration and Cleaning (3 marks)
o Explain how you would prepare the dataset for analysis, including dealing
with missing data and anomalies.
2. Customer Segmentation Using EDA (5 marks)
o Discuss the EDA techniques you would use to identify distinct customer
segments based on usage patterns and demographics.
3. Visualizing Customer Segments (4 marks)
o Describe the types of visual aids you would use to represent different
customer segments and their characteristics.
4. Identifying High-Risk Segments (4 marks)
o Explain how you would use EDA to identify customer segments that are most
likely to churn and suggest strategies to retain them.

UNIT II VISUALIZING USING MATPLOTLIB

2 Marks
1. What is the basic syntax for importing Matplotlib in Python?
To import Matplotlib, you use:
import matplotlib.pyplot as plt
This imports the pyplot module from Matplotlib, which provides functions for creating plots.
2. How do you create a simple line plot in Matplotlib?
Use the plt.plot() function to create a line plot:
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)


y = np.sin(x)
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()
This plots y as a function of x.
3. What is the difference between plt.plot() and plt.scatter() for scatter plots?
plt.plot() can be used for line plots and scatter plots by setting the marker argument.
plt.scatter() is specifically for scatter plots:
plt.scatter(x, y, color='blue', label='Data Points')
plt.scatter() offers more options for customizing scatter plots.
4. How do you add error bars to a plot in Matplotlib?
Use plt.errorbar() to add error bars:
plt.errorbar(x, y, yerr=0.1, fmt='o')
Here, yerr specifies the vertical error for each point.
5. How can you create a density plot in Matplotlib?
Use plt.hist() with density=True:
plt.hist(data, bins=30, density=True, alpha=0.6, color='g')
This normalizes the histogram to show a density estimate.
6. What is the purpose of histograms in data visualization?
Histograms show the distribution of a single variable. They divide the data into bins and
count the number of observations in each bin:
plt.hist(data, bins=20)
This displays the frequency distribution of data.
7. How do you add a legend to a Matplotlib plot?
Use plt.legend() and provide labels:
plt.plot(x, y, label='Line')
plt.legend()
Legends help identify different data series.
8. How do you customize colors in a Matplotlib plot?
Use the color argument:
plt.plot(x, y, color='red')
You can also use color codes or names.
9. How do you create multiple subplots in Matplotlib?
Use plt.subplots():
fig, axs = plt.subplots(2, 2)
axs[0, 0].plot(x, y)
axs[0, 1].scatter(x, y)
This creates a 2x2 grid of subplots.
10. How do you add text to a Matplotlib plot?
Use plt.text():
plt.text(x, y, 'Label', fontsize=12)
This adds text at the specified (x, y) position.
11. How do you customize plot styles globally in Matplotlib?
Use plt.rcParams to set default styles:
plt.rcParams['lines.color'] = 'blue'
This changes the default line color for all plots.
12. How do you create a 3D plot in Matplotlib?
Use the Axes3D module:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(x, y, z)
This creates a 3D line plot.
13. How do you create a basic map using Basemap?
Use the Basemap class to create a map projection:
from mpl_toolkits.basemap import Basemap
map = Basemap(projection='ortho', lat_0=50, lon_0=-100)
map.drawcoastlines()
This sets up an orthographic projection and draws coastlines.
14. How does Seaborn differ from Matplotlib in terms of ease of use?
Seaborn provides a high-level interface and integrates directly with pandas DataFrames,
making it easier to create complex statistical plots with less code.
15. What types of plots can you create with Seaborn?
Seaborn supports various plots including:
● Scatter plots (sns.scatterplot)
● Line plots (sns.lineplot)
● Histograms (sns.histplot)
● Box plots (sns.boxplot)
● Pair plots (sns.pairplot)
16. How do you create a scatter plot in Seaborn?
Use sns.scatterplot():
sns.scatterplot(data=df, x='x_column', y='y_column')
This creates a scatter plot with the specified columns.
17. How can you create a histogram in Seaborn?
Use sns.histplot():
sns.histplot(data=df['column'])
This visualizes the distribution of data in a column.
18. How do you create a box plot in Seaborn?
Use sns.boxplot():
sns.boxplot(data=df, x='category_column', y='value_column')
This shows the distribution and potential outliers for the values.
19. What does a pair plot in Seaborn show?
pairplot() creates a matrix of scatter plots for each pair of variables in a DataFrame, helping
visualize relationships:
sns.pairplot(df)
20. How can you use Seaborn to analyze marathon finishing times?
You can use various Seaborn plots:
● Scatter Plot: To explore the relationship between age and finishing time.
● Histogram: To examine the distribution of finishing times.
● Box Plot: To compare finishing times across different genders.
sns.scatterplot(data=df, x='age', y='finishing_time')
sns.histplot(df['finishing_time'])
sns.boxplot(data=df, x='gender', y='finishing_time')

16 marks
1.How to plot a line on a scatter plot in python? Illustrate with code. (Nov Dec 2022)
1. Import Libraries (2 marks):
● Import matplotlib.pyplot and numpy.
2. Generate Data (2 marks):
● Create sample data for scatter plot (x, y).
● Create data for the line plot (x_line, y_line).
3. Create Scatter Plot (2 marks):
● Use plt.scatter() to plot the scatter data.
4. Overlay the Line (2 marks):
● Use plt.plot() to add the line to the scatter plot.
5. Add Labels and Legend (4 marks):
● Label the x-axis and y-axis.
● Add a title to the plot.
● Include a legend to differentiate between scatter and line plots.
6. Show Plot (2 marks):
● Use plt.show() to display the final plot.
7. Code Quality and Clarity (2 marks):
● Ensure code is well-organized and clear.
● Properly comment the code to explain each section.

2. Discuss with how seaborn helps to visualize the statistical relationships. Illustrate
with code and example. (Nov Dec 2022)
1. Introduction to Seaborn for Statistical Visualization (2 Marks)
2. Visualizing Distributions (4 Marks)
3. Exploring Relationships between Variables (4 Marks)
4. Comparing Distributions Across Categories (4 Marks)
5. Visualizing Pairwise Relationships (2 Marks)
6. Analyzing Relationships with Seaborn in Practice (4 Marks)

3. Write Python code to import Matplotlib and create a simple line plot. Annotate the plot with
axes labels, a title, and a grid.
● Code for importing Matplotlib (2 marks)
● Code for creating a simple line plot (4 marks)
● Adding axes labels and title (4 marks)
● Adding a grid to the plot (3 marks)
● Explanation of each step (3 marks)

4. Write Python code to create a scatter plot and demonstrate how to visualize errors in data
points using error bars.
● Code for creating a scatter plot (5 marks)
● Adding error bars to the plot (5 marks)
● Explanation of error bar parameters and their role (3 marks)
● Plot labeling and customization (3 marks)

5. Explain and demonstrate with Python code how to create density plots and contour plots
using Matplotlib.
● Explanation of density plots (3 marks)
● Code for creating a density plot (4 marks)
● Explanation of contour plots (3 marks)
● Code for creating a contour plot (4 marks)
● Visual comparison between the two plots (2 marks)

6. Describe how to create histograms in Matplotlib and explain the role of legends in
visualizing multiple datasets. Write code to plot two datasets with histograms and legends.
● Explanation of histograms and their importance (3 marks)
● Code for creating a histogram for one dataset (4 marks)
● Plotting and comparing two datasets in a histogram (3 marks)
● Adding a legend to differentiate the datasets (3 marks)
● Explanation of histogram bins and legends (3 marks)
7. Write Python code to create a figure with two subplots: one showing a line plot and the
other showing a scatter plot. Customize both plots by adding text annotations and adjusting
colors.
● Code for creating subplots (2x1 grid) (4 marks)
● Line plot with annotations and customization (4 marks)
● Scatter plot with annotations and color customization (4 marks)
● Explanation of subplots and customization options (4 marks)
8. Explain how to plot geographic data using Basemap in Matplotlib. Additionally, show how
Seaborn can be used to visualize relationships in a dataset. Write code for both
visualizations.
● Explanation of Basemap and geographic data visualization (3 marks)
● Code for creating a simple map using Basemap (5 marks)
● Explanation of Seaborn and its advantages over Matplotlib (3 marks)
● Code for visualizing relationships with Seaborn (scatterplot or heatmap) (5 marks)
9. Imagine you are analyzing stock prices of two companies over a 10-day period. Write
Python code to plot the stock prices using Matplotlib's simple line plot. Customize the plot
with proper labels, titles, and a legend to differentiate between the two companies.
● Loading or defining stock price data (2 marks)
● Plotting the stock prices with a line plot (4 marks)
● Customizing the plot with labels and title (3 marks)
● Adding a legend to differentiate the companies (3 marks)
● Explanation of how line plots help in stock market analysis (4 marks)
10. Assume you're tasked with analyzing temperature variations in a city over a month,
where each day's average temperature has a margin of error. Write Python code to create a
scatter plot of daily temperatures, incorporating error bars for uncertainty.
● Loading or defining temperature data with errors (3 marks)
● Plotting a scatter plot of temperatures (4 marks)
● Adding error bars to represent uncertainty (4 marks)
● Customizing the scatter plot (colors, labels, title) (3 marks)
● Explanation of the role of error bars in real-time weather analysis (2 marks)

11. You are studying population density in different regions of a city. Using the coordinates of
people’s locations, create a density plot and a contour plot to visualize population
distribution. Provide a comparison between both visualizations.
● Loading or defining population coordinates data (3 marks)
● Creating a density plot for population distribution (4 marks)
● Creating a contour plot for population distribution (4 marks)
● Customization of both plots (title, labels, color) (3 marks)
● Comparison between density and contour plots in population analysis (2 marks)

12. A company wants to analyze the distribution of sales for two different products over the
last year. Using histograms, visualize the sales data of both products on the same plot, and
add a legend to distinguish them.
● Loading or defining sales data for two products (3 marks)
● Creating a histogram for one product (3 marks)
● Overlaying the histogram for the second product (3 marks)
● Adding a legend to distinguish between the two products (3 marks)
● Explanation of how histograms help in sales analysis (4 marks)

13. Imagine you are monitoring weather data in real-time. Create a figure with two subplots:
one subplot for daily temperature variations (line plot) and the other for daily humidity
variations (scatter plot). Add text annotations to highlight the highest and lowest values in
each plot.
● Loading or defining temperature and humidity data (3 marks)
● Creating subplots (2x1 grid layout) (4 marks)
● Line plot for temperature variations with annotations (3 marks)
● Scatter plot for humidity variations with annotations (3 marks)
● Explanation of how subplots can enhance real-time weather monitoring (3 marks)
14. Assume you are tracking earthquake data (latitude, longitude, and magnitude) globally.
Write Python code to plot earthquake locations on a world map using Basemap. Customize
the plot by coloring points based on earthquake magnitude. Additionally, use Seaborn to
create a heatmap showing the frequency of earthquakes by region.
● Loading or defining earthquake data (coordinates and magnitude) (3 marks)
● Plotting earthquake locations on a world map using Basemap (4 marks)
● Customizing the plot with colors based on magnitude (4 marks)
● Creating a heatmap using Seaborn to show earthquake frequency by region (3
marks)
● Explanation of the significance of geographic data in earthquake monitoring (2
marks)

UNIT III UNIVARIATE ANALYSIS

2 marks
1. What is a distribution in data analysis?
A distribution describes how data values are spread or clustered over a range. It shows the
frequency or probability of different values, helping analysts understand patterns and
variability. Common types include normal, uniform, and skewed distributions. Understanding
distributions is critical in identifying data trends, anomalies, and making inferences about a
dataset's underlying structure.
2. How do you define a variable in data analysis?

A variable is a characteristic or attribute that can take on different values in a dataset.


Variables can be continuous (like height or temperature) or categorical (like gender or
country). In data analysis, variables are central as they represent the features or metrics
being analyzed to draw insights or build models.
3. What are numerical summaries in data analysis?

Numerical summaries are statistical metrics that describe key properties of a dataset. These
include measures of central tendency (mean, median, mode) and measures of spread
(range, variance, standard deviation). They help provide a concise overview of the dataset,
allowing analysts to understand its overall distribution and variability.
4. How is the mean calculated in a dataset?

The mean is calculated by summing all the values in a dataset and dividing by the total
number of values. It provides the average value and is useful for understanding the central
tendency of the data. However, it is sensitive to outliers, which may skew the mean.
5. List three measures of central tendency.
o Mean: The arithmetic average of the dataset.
o Median: The middle value in a sorted dataset.
o Mode: The most frequent value in the dataset.
These measures provide insights into where most data points lie.
6. What is the purpose of standard deviation?

Standard deviation measures the amount of variation or dispersion in a dataset. A low


standard deviation indicates that data points are close to the mean, while a high standard
deviation shows that data points are spread out. It's a key measure for assessing risk and
variability in data.
7. How do you standardize a variable?

Standardizing a variable involves transforming it so that it has a mean of 0 and a standard


deviation of 1. This is done by subtracting the mean from each value and dividing by the
standard deviation. Standardization allows for comparison between variables on different
scales.
8. What is scaling in data analysis?

Scaling is the process of adjusting the range of values in a dataset. Techniques like min-max
scaling and standardization are used to bring data within a specific range, often between 0
and 1. Scaling is essential in machine learning models that are sensitive to different value
ranges, such as gradient-based algorithms.
9. List two methods of scaling data.
o Min-Max Scaling: Transforms values to a 0-1 range.
o Z-Score Standardization: Converts data to a standard normal distribution with mean
0 and standard deviation 1.

Both methods are commonly used to prepare data for algorithms.


10. What is inequality in data distribution?

Inequality in data distribution refers to the uneven spread of data points, where certain
values occur more frequently than others. Inequality is often measured using metrics like the
Gini coefficient or the Lorenz curve, which quantify disparities in data or economic indicators.
11. How is inequality measured in datasets?

Inequality is measured using tools like the Gini coefficient, which assesses the degree of
inequality in a distribution. The Lorenz curve is another visual method to show inequality,
typically used in economics to show income or wealth distribution. A higher Gini coefficient
indicates greater inequality.
12. List two common inequality measures.
● Gini Coefficient: Quantifies inequality in a distribution.
● Lorenz Curve: Graphically represents inequality by showing the cumulative
distribution of a variable, often used for wealth. These measures help assess disparities
within a dataset.
13. What is smoothing in time series analysis?
Smoothing refers to techniques that reduce noise and highlight trends in time series data.
Methods like moving averages or exponential smoothing smooth fluctuations and provide a
clearer view of long-term trends, making it easier to analyze underlying patterns.
14. How does a moving average smooth time series data?

A moving average smooths time series data by averaging a fixed number of consecutive
data points to reduce short-term fluctuations. It helps reveal underlying trends by eliminating
random noise, making it useful for trend analysis and forecasting in time series data.
15. List two smoothing techniques in time series.
● Moving Average: Averages data over a set period to smooth out fluctuations.
● Exponential Smoothing: Weights recent data more heavily to better capture trends.
These methods help reduce noise and highlight trends in time series analysis.
16. What is the difference between variance and standard deviation?

Variance measures the average squared deviation from the mean, while standard deviation
is the square root of the variance. Standard deviation is more interpretable since it has the
same unit as the data, while variance is in squared units, making it harder to interpret.
17. How do you interpret a high standard deviation in a dataset?

A high standard deviation indicates that the data points are spread out over a wide range,
showing significant variability in the dataset. It suggests that the dataset has large
fluctuations and is less consistent. In contrast, a low standard deviation suggests that data
points are clustered near the mean.
18. What is the purpose of a time series plot?

A time series plot visually represents data points over time, allowing for the analysis of
trends, seasonality, and patterns. It helps identify short-term fluctuations and long-term
trends, making it valuable for forecasting and understanding temporal changes in data.
19. How do you calculate a z-score?

A z-score is calculated by subtracting the mean from the data point and dividing by the
standard deviation. It indicates how many standard deviations a data point is from the mean,
helping identify outliers and standardize different variables for comparison.
20. List three types of time series components.
● Trend: The long-term direction of the data.
● Seasonality: Repeated patterns over time, usually within a year.
● Noise: Random fluctuations that don’t follow any pattern.
These components are crucial for time series decomposition and analysis.
16 marks

1. Explain the concept of distribution and how it applies to real-world data, such
as income distribution in a population.
o Definition and explanation of distribution (4 marks)
o Application to real-world scenarios (income distribution) (6 marks)
o Visual representation of distribution using histograms or density plots (6
marks)
2. Describe numerical summaries of level and spread, and compute these for a
given dataset of exam scores.
o Definition of level (mean, median, mode) (4 marks)
o Explanation of spread (range, variance, standard deviation) (4 marks)
o Computation of these summaries for a dataset (8 marks)
3. What is standardization? Explain its importance in machine learning, and apply
it to a dataset with varying scales of variables.
o Explanation of standardization and its process (4 marks)
o Importance in machine learning (4 marks)
o Code or steps to standardize a dataset (8 marks)
4. Discuss the concept of inequality with reference to income distribution.
Explain how the Gini coefficient is calculated.
o Definition and explanation of inequality (4 marks)
o Application to income distribution (4 marks)
o Explanation of Gini coefficient calculation (8 marks)
5. Explain the process of smoothing in time series data and its importance.
Perform smoothing using a moving average on a given dataset.
o Explanation of smoothing and its importance (4 marks)
o Steps to apply moving averages (4 marks)
o Smoothing a sample dataset and showing results (8 marks)
6. Differentiate between scaling and standardizing. Demonstrate both methods on
a dataset with wide-ranging values.
o Explanation of scaling (4 marks)
o Explanation of standardizing (4 marks)
o Applying both techniques on a dataset (8 marks)
7. Analyze a time series dataset and explain how to identify trends and
seasonality. Apply a smoothing technique to highlight these patterns.
o Explanation of trends and seasonality (4 marks)
o Identifying these components in a time series (4 marks)
o Smoothing the dataset to highlight patterns (8 marks)
8. What is a z-score, and how is it used to detect outliers in a dataset? Calculate
the z-scores for a dataset and identify the outliers.
o Explanation of z-scores (4 marks)
o Application of z-scores for outlier detection (4 marks)
o Calculation and identification of outliers in a dataset (8 marks)
9. Explain the importance of variance and standard deviation in data analysis.
Calculate these measures for a given dataset and interpret the results.
o Definition and importance of variance (4 marks)
o Definition and importance of standard deviation (4 marks)
o Calculation and interpretation for a sample dataset (8 marks)
10. Discuss the importance of scaling data in machine learning. Perform min-max
scaling on a dataset and explain the effect on the range of the data.
● Explanation of scaling and its importance (4 marks)
● Steps for performing min-max scaling (4 marks)
● Applying min-max scaling on a dataset and observing changes (8 marks)

Scenario based Questions


1. You have data representing the monthly income of 1,000 individuals from different
regions. Analyze the distribution of income and summarize its key numerical characteristics
such as mean, median, variance, and standard deviation.
● Loading or defining income data (3 marks)
● Creating and visualizing the distribution of income (histogram or boxplot) (4 marks)
● Calculating and interpreting the mean, median, variance, and standard deviation (5
marks)
● Explanation of how these numerical summaries describe income variability (4 marks)

2. A dataset contains test scores of 500 students from different schools. To compare
performances fairly, you are asked to scale and standardize the scores. Write Python code
to perform this task and explain the significance of scaling in this context.
● Loading or defining the test score data (2 marks)
● Explanation of the need for scaling and standardizing the data (4 marks)
● Scaling the test scores (min-max scaling) (3 marks)
● Standardizing the test scores (z-score normalization) (3 marks)
● Comparison of the results before and after scaling/standardizing (4 marks)

3. You are studying the economic inequality of a country by analyzing the income of its
citizens. Using the Gini coefficient as a measure of inequality, calculate the Gini coefficient
from the given income data and explain what the result implies about the level of inequality.
● Explanation of the Gini coefficient and its role in measuring inequality (3 marks)
● Loading or defining income data for a population (2 marks)
● Code for calculating the Gini coefficient (6 marks)
● Interpreting the result and its implications for economic inequality (5 marks)

4. You are given the daily sales data of a retail store for the past year. Apply time series
smoothing techniques to this data to reduce short-term fluctuations and make the long-term
trend clearer. Discuss the choice of smoothing method and its impact on forecasting future
sales.
● Loading or defining sales time series data (2 marks)
● Explanation of the need for smoothing in time series forecasting (3 marks)
● Applying a moving average for smoothing the data (4 marks)
● Plotting the smoothed sales data (3 marks)
● Explaining how smoothing affects sales forecasting (4 marks)

5. In a healthcare study, you have the data of patients' cholesterol levels. Investigate the
distribution of cholesterol levels and describe it using appropriate visualizations and
summary statistics (mean, median, mode, variance). Comment on whether the data is
skewed or normally distributed and the potential implications for the healthcare study.
● Loading or defining cholesterol level data (2 marks)
● Creating a histogram or boxplot to visualize the distribution (4 marks)
● Calculating and interpreting mean, median, mode, and variance (5 marks)
● Explanation of skewness or normality of the data and its healthcare implications (5
marks)

You might also like