Question bank DEV.docx
Question bank DEV.docx
2 Marks
1. What is Exploratory Data Analysis (EDA)? (Nov Dec 2022)
EDA is a statistical approach used to analyze and summarize the main characteristics of a
dataset. It involves techniques like visualization and summary statistics to understand data
patterns and detect anomalies. For instance, using histograms and box plots to examine the
distribution of salaries in a dataset helps identify trends and outliers. EDA is crucial before
applying complex models to ensure data is clean and relevant.
18. What are cross-tabulations, and how are they useful? (Nov Dec 2022)
Cross-tabulations summarize the relationship between two categorical variables by creating
a contingency table. They help analyze how variables interact. For example, a
cross-tabulation of employee department and gender shows the distribution of genders
within each department.
cross_tab = pd.crosstab(df['Department'], df['Gender'])
16 Marks
1. What is the primary purpose of EDA? What are the differences between EDA with
classical and Bayesian analysis? Discuss it in detail. (Nov Dec 2022)
Introduction to EDA
● Definition and significance of Exploratory Data Analysis (2 marks)
Primary Purpose of EDA
● Overview of objectives such as understanding data structure, detecting patterns, and
identifying anomalies (2 marks)
Classical Analysis in EDA
● Techniques used in classical EDA (e.g., summary statistics, visualization) (2 marks)
Bayesian Analysis in EDA
● Introduction to Bayesian methods in EDA (2 marks)
Comparison Between Classical and Bayesian Analysis
● Philosophical differences: Frequentist vs. Bayesian perspective (2 marks)
Application Differences
● Practical differences in applying EDA with classical vs. Bayesian methods (2 marks)
Advantages and Limitations
● Strengths and weaknesses of classical and Bayesian approaches in EDA (2 marks)
Conclusion
● Summary of key points and the importance of selecting the right approach based on
the context (2 marks)
2 Marks
1. What is the basic syntax for importing Matplotlib in Python?
To import Matplotlib, you use:
import matplotlib.pyplot as plt
This imports the pyplot module from Matplotlib, which provides functions for creating plots.
2. How do you create a simple line plot in Matplotlib?
Use the plt.plot() function to create a line plot:
import matplotlib.pyplot as plt
import numpy as np
16 marks
1.How to plot a line on a scatter plot in python? Illustrate with code. (Nov Dec 2022)
1. Import Libraries (2 marks):
● Import matplotlib.pyplot and numpy.
2. Generate Data (2 marks):
● Create sample data for scatter plot (x, y).
● Create data for the line plot (x_line, y_line).
3. Create Scatter Plot (2 marks):
● Use plt.scatter() to plot the scatter data.
4. Overlay the Line (2 marks):
● Use plt.plot() to add the line to the scatter plot.
5. Add Labels and Legend (4 marks):
● Label the x-axis and y-axis.
● Add a title to the plot.
● Include a legend to differentiate between scatter and line plots.
6. Show Plot (2 marks):
● Use plt.show() to display the final plot.
7. Code Quality and Clarity (2 marks):
● Ensure code is well-organized and clear.
● Properly comment the code to explain each section.
2. Discuss with how seaborn helps to visualize the statistical relationships. Illustrate
with code and example. (Nov Dec 2022)
1. Introduction to Seaborn for Statistical Visualization (2 Marks)
2. Visualizing Distributions (4 Marks)
3. Exploring Relationships between Variables (4 Marks)
4. Comparing Distributions Across Categories (4 Marks)
5. Visualizing Pairwise Relationships (2 Marks)
6. Analyzing Relationships with Seaborn in Practice (4 Marks)
3. Write Python code to import Matplotlib and create a simple line plot. Annotate the plot with
axes labels, a title, and a grid.
● Code for importing Matplotlib (2 marks)
● Code for creating a simple line plot (4 marks)
● Adding axes labels and title (4 marks)
● Adding a grid to the plot (3 marks)
● Explanation of each step (3 marks)
4. Write Python code to create a scatter plot and demonstrate how to visualize errors in data
points using error bars.
● Code for creating a scatter plot (5 marks)
● Adding error bars to the plot (5 marks)
● Explanation of error bar parameters and their role (3 marks)
● Plot labeling and customization (3 marks)
5. Explain and demonstrate with Python code how to create density plots and contour plots
using Matplotlib.
● Explanation of density plots (3 marks)
● Code for creating a density plot (4 marks)
● Explanation of contour plots (3 marks)
● Code for creating a contour plot (4 marks)
● Visual comparison between the two plots (2 marks)
6. Describe how to create histograms in Matplotlib and explain the role of legends in
visualizing multiple datasets. Write code to plot two datasets with histograms and legends.
● Explanation of histograms and their importance (3 marks)
● Code for creating a histogram for one dataset (4 marks)
● Plotting and comparing two datasets in a histogram (3 marks)
● Adding a legend to differentiate the datasets (3 marks)
● Explanation of histogram bins and legends (3 marks)
7. Write Python code to create a figure with two subplots: one showing a line plot and the
other showing a scatter plot. Customize both plots by adding text annotations and adjusting
colors.
● Code for creating subplots (2x1 grid) (4 marks)
● Line plot with annotations and customization (4 marks)
● Scatter plot with annotations and color customization (4 marks)
● Explanation of subplots and customization options (4 marks)
8. Explain how to plot geographic data using Basemap in Matplotlib. Additionally, show how
Seaborn can be used to visualize relationships in a dataset. Write code for both
visualizations.
● Explanation of Basemap and geographic data visualization (3 marks)
● Code for creating a simple map using Basemap (5 marks)
● Explanation of Seaborn and its advantages over Matplotlib (3 marks)
● Code for visualizing relationships with Seaborn (scatterplot or heatmap) (5 marks)
9. Imagine you are analyzing stock prices of two companies over a 10-day period. Write
Python code to plot the stock prices using Matplotlib's simple line plot. Customize the plot
with proper labels, titles, and a legend to differentiate between the two companies.
● Loading or defining stock price data (2 marks)
● Plotting the stock prices with a line plot (4 marks)
● Customizing the plot with labels and title (3 marks)
● Adding a legend to differentiate the companies (3 marks)
● Explanation of how line plots help in stock market analysis (4 marks)
10. Assume you're tasked with analyzing temperature variations in a city over a month,
where each day's average temperature has a margin of error. Write Python code to create a
scatter plot of daily temperatures, incorporating error bars for uncertainty.
● Loading or defining temperature data with errors (3 marks)
● Plotting a scatter plot of temperatures (4 marks)
● Adding error bars to represent uncertainty (4 marks)
● Customizing the scatter plot (colors, labels, title) (3 marks)
● Explanation of the role of error bars in real-time weather analysis (2 marks)
11. You are studying population density in different regions of a city. Using the coordinates of
people’s locations, create a density plot and a contour plot to visualize population
distribution. Provide a comparison between both visualizations.
● Loading or defining population coordinates data (3 marks)
● Creating a density plot for population distribution (4 marks)
● Creating a contour plot for population distribution (4 marks)
● Customization of both plots (title, labels, color) (3 marks)
● Comparison between density and contour plots in population analysis (2 marks)
12. A company wants to analyze the distribution of sales for two different products over the
last year. Using histograms, visualize the sales data of both products on the same plot, and
add a legend to distinguish them.
● Loading or defining sales data for two products (3 marks)
● Creating a histogram for one product (3 marks)
● Overlaying the histogram for the second product (3 marks)
● Adding a legend to distinguish between the two products (3 marks)
● Explanation of how histograms help in sales analysis (4 marks)
13. Imagine you are monitoring weather data in real-time. Create a figure with two subplots:
one subplot for daily temperature variations (line plot) and the other for daily humidity
variations (scatter plot). Add text annotations to highlight the highest and lowest values in
each plot.
● Loading or defining temperature and humidity data (3 marks)
● Creating subplots (2x1 grid layout) (4 marks)
● Line plot for temperature variations with annotations (3 marks)
● Scatter plot for humidity variations with annotations (3 marks)
● Explanation of how subplots can enhance real-time weather monitoring (3 marks)
14. Assume you are tracking earthquake data (latitude, longitude, and magnitude) globally.
Write Python code to plot earthquake locations on a world map using Basemap. Customize
the plot by coloring points based on earthquake magnitude. Additionally, use Seaborn to
create a heatmap showing the frequency of earthquakes by region.
● Loading or defining earthquake data (coordinates and magnitude) (3 marks)
● Plotting earthquake locations on a world map using Basemap (4 marks)
● Customizing the plot with colors based on magnitude (4 marks)
● Creating a heatmap using Seaborn to show earthquake frequency by region (3
marks)
● Explanation of the significance of geographic data in earthquake monitoring (2
marks)
2 marks
1. What is a distribution in data analysis?
A distribution describes how data values are spread or clustered over a range. It shows the
frequency or probability of different values, helping analysts understand patterns and
variability. Common types include normal, uniform, and skewed distributions. Understanding
distributions is critical in identifying data trends, anomalies, and making inferences about a
dataset's underlying structure.
2. How do you define a variable in data analysis?
Numerical summaries are statistical metrics that describe key properties of a dataset. These
include measures of central tendency (mean, median, mode) and measures of spread
(range, variance, standard deviation). They help provide a concise overview of the dataset,
allowing analysts to understand its overall distribution and variability.
4. How is the mean calculated in a dataset?
The mean is calculated by summing all the values in a dataset and dividing by the total
number of values. It provides the average value and is useful for understanding the central
tendency of the data. However, it is sensitive to outliers, which may skew the mean.
5. List three measures of central tendency.
o Mean: The arithmetic average of the dataset.
o Median: The middle value in a sorted dataset.
o Mode: The most frequent value in the dataset.
These measures provide insights into where most data points lie.
6. What is the purpose of standard deviation?
Scaling is the process of adjusting the range of values in a dataset. Techniques like min-max
scaling and standardization are used to bring data within a specific range, often between 0
and 1. Scaling is essential in machine learning models that are sensitive to different value
ranges, such as gradient-based algorithms.
9. List two methods of scaling data.
o Min-Max Scaling: Transforms values to a 0-1 range.
o Z-Score Standardization: Converts data to a standard normal distribution with mean
0 and standard deviation 1.
Inequality in data distribution refers to the uneven spread of data points, where certain
values occur more frequently than others. Inequality is often measured using metrics like the
Gini coefficient or the Lorenz curve, which quantify disparities in data or economic indicators.
11. How is inequality measured in datasets?
Inequality is measured using tools like the Gini coefficient, which assesses the degree of
inequality in a distribution. The Lorenz curve is another visual method to show inequality,
typically used in economics to show income or wealth distribution. A higher Gini coefficient
indicates greater inequality.
12. List two common inequality measures.
● Gini Coefficient: Quantifies inequality in a distribution.
● Lorenz Curve: Graphically represents inequality by showing the cumulative
distribution of a variable, often used for wealth. These measures help assess disparities
within a dataset.
13. What is smoothing in time series analysis?
Smoothing refers to techniques that reduce noise and highlight trends in time series data.
Methods like moving averages or exponential smoothing smooth fluctuations and provide a
clearer view of long-term trends, making it easier to analyze underlying patterns.
14. How does a moving average smooth time series data?
A moving average smooths time series data by averaging a fixed number of consecutive
data points to reduce short-term fluctuations. It helps reveal underlying trends by eliminating
random noise, making it useful for trend analysis and forecasting in time series data.
15. List two smoothing techniques in time series.
● Moving Average: Averages data over a set period to smooth out fluctuations.
● Exponential Smoothing: Weights recent data more heavily to better capture trends.
These methods help reduce noise and highlight trends in time series analysis.
16. What is the difference between variance and standard deviation?
Variance measures the average squared deviation from the mean, while standard deviation
is the square root of the variance. Standard deviation is more interpretable since it has the
same unit as the data, while variance is in squared units, making it harder to interpret.
17. How do you interpret a high standard deviation in a dataset?
A high standard deviation indicates that the data points are spread out over a wide range,
showing significant variability in the dataset. It suggests that the dataset has large
fluctuations and is less consistent. In contrast, a low standard deviation suggests that data
points are clustered near the mean.
18. What is the purpose of a time series plot?
A time series plot visually represents data points over time, allowing for the analysis of
trends, seasonality, and patterns. It helps identify short-term fluctuations and long-term
trends, making it valuable for forecasting and understanding temporal changes in data.
19. How do you calculate a z-score?
A z-score is calculated by subtracting the mean from the data point and dividing by the
standard deviation. It indicates how many standard deviations a data point is from the mean,
helping identify outliers and standardize different variables for comparison.
20. List three types of time series components.
● Trend: The long-term direction of the data.
● Seasonality: Repeated patterns over time, usually within a year.
● Noise: Random fluctuations that don’t follow any pattern.
These components are crucial for time series decomposition and analysis.
16 marks
1. Explain the concept of distribution and how it applies to real-world data, such
as income distribution in a population.
o Definition and explanation of distribution (4 marks)
o Application to real-world scenarios (income distribution) (6 marks)
o Visual representation of distribution using histograms or density plots (6
marks)
2. Describe numerical summaries of level and spread, and compute these for a
given dataset of exam scores.
o Definition of level (mean, median, mode) (4 marks)
o Explanation of spread (range, variance, standard deviation) (4 marks)
o Computation of these summaries for a dataset (8 marks)
3. What is standardization? Explain its importance in machine learning, and apply
it to a dataset with varying scales of variables.
o Explanation of standardization and its process (4 marks)
o Importance in machine learning (4 marks)
o Code or steps to standardize a dataset (8 marks)
4. Discuss the concept of inequality with reference to income distribution.
Explain how the Gini coefficient is calculated.
o Definition and explanation of inequality (4 marks)
o Application to income distribution (4 marks)
o Explanation of Gini coefficient calculation (8 marks)
5. Explain the process of smoothing in time series data and its importance.
Perform smoothing using a moving average on a given dataset.
o Explanation of smoothing and its importance (4 marks)
o Steps to apply moving averages (4 marks)
o Smoothing a sample dataset and showing results (8 marks)
6. Differentiate between scaling and standardizing. Demonstrate both methods on
a dataset with wide-ranging values.
o Explanation of scaling (4 marks)
o Explanation of standardizing (4 marks)
o Applying both techniques on a dataset (8 marks)
7. Analyze a time series dataset and explain how to identify trends and
seasonality. Apply a smoothing technique to highlight these patterns.
o Explanation of trends and seasonality (4 marks)
o Identifying these components in a time series (4 marks)
o Smoothing the dataset to highlight patterns (8 marks)
8. What is a z-score, and how is it used to detect outliers in a dataset? Calculate
the z-scores for a dataset and identify the outliers.
o Explanation of z-scores (4 marks)
o Application of z-scores for outlier detection (4 marks)
o Calculation and identification of outliers in a dataset (8 marks)
9. Explain the importance of variance and standard deviation in data analysis.
Calculate these measures for a given dataset and interpret the results.
o Definition and importance of variance (4 marks)
o Definition and importance of standard deviation (4 marks)
o Calculation and interpretation for a sample dataset (8 marks)
10. Discuss the importance of scaling data in machine learning. Perform min-max
scaling on a dataset and explain the effect on the range of the data.
● Explanation of scaling and its importance (4 marks)
● Steps for performing min-max scaling (4 marks)
● Applying min-max scaling on a dataset and observing changes (8 marks)
2. A dataset contains test scores of 500 students from different schools. To compare
performances fairly, you are asked to scale and standardize the scores. Write Python code
to perform this task and explain the significance of scaling in this context.
● Loading or defining the test score data (2 marks)
● Explanation of the need for scaling and standardizing the data (4 marks)
● Scaling the test scores (min-max scaling) (3 marks)
● Standardizing the test scores (z-score normalization) (3 marks)
● Comparison of the results before and after scaling/standardizing (4 marks)
3. You are studying the economic inequality of a country by analyzing the income of its
citizens. Using the Gini coefficient as a measure of inequality, calculate the Gini coefficient
from the given income data and explain what the result implies about the level of inequality.
● Explanation of the Gini coefficient and its role in measuring inequality (3 marks)
● Loading or defining income data for a population (2 marks)
● Code for calculating the Gini coefficient (6 marks)
● Interpreting the result and its implications for economic inequality (5 marks)
4. You are given the daily sales data of a retail store for the past year. Apply time series
smoothing techniques to this data to reduce short-term fluctuations and make the long-term
trend clearer. Discuss the choice of smoothing method and its impact on forecasting future
sales.
● Loading or defining sales time series data (2 marks)
● Explanation of the need for smoothing in time series forecasting (3 marks)
● Applying a moving average for smoothing the data (4 marks)
● Plotting the smoothed sales data (3 marks)
● Explaining how smoothing affects sales forecasting (4 marks)
5. In a healthcare study, you have the data of patients' cholesterol levels. Investigate the
distribution of cholesterol levels and describe it using appropriate visualizations and
summary statistics (mean, median, mode, variance). Comment on whether the data is
skewed or normally distributed and the potential implications for the healthcare study.
● Loading or defining cholesterol level data (2 marks)
● Creating a histogram or boxplot to visualize the distribution (4 marks)
● Calculating and interpreting mean, median, mode, and variance (5 marks)
● Explanation of skewness or normality of the data and its healthcare implications (5
marks)