DSBDA - Mini Project Report
DSBDA - Mini Project Report
Dhankawadi, Pune
A PROJECT REPORT ON
Covid Vaccine Statewise Analysis
SUBMITTED BY
Omkar Shinde (31480)
Amey Wadgaonkar (31492)
Problem Statement:
Use the following covid_vaccine_statewise.csv dataset and perform the following analytics on the given
dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv
● Describe the dataset
● Number of persons statewise vaccinated for the first dose in India
● Number of persons statewise vaccinated for the second dose in India
● Number of males vaccinated
● Number of females vaccinated
Objectives:
● To Describe the dataset
● To do Preprocessing on the given dataset
Theory:
Libraries:
● Pandas - Pandas, a powerful Python library, plays a crucial role in machine learning (ML) workflows
by providing efficient data manipulation and analysis capabilities. With its intuitive data structures, such
as DataFrames and Series, Pandas simplifies the process of preprocessing and cleaning datasets for ML
tasks. It offers various functionalities like data selection, filtering, merging, and transformation,
allowing users to handle missing values, outliers, and feature engineering effectively. Additionally,
Pandas seamlessly integrates with other ML libraries like NumPy and scikit-learn, enabling smooth data
integration and model building. It also supports reading and writing data from various file formats,
making it convenient for ML practitioners to work with diverse data sources. Whether it's exploratory
data analysis, data preprocessing, or feature extraction, Pandas provides a versatile and efficient toolkit
that significantly enhances productivity and facilitates the development of robust machine learning
models.
● Numpy - NumPy, a fundamental library for numerical computations in Python, is widely used in
machine learning (ML) applications. Its array-oriented programming paradigm allows for efficient
manipulation and processing of large multi-dimensional arrays and matrices, which are central to many
ML algorithms. NumPy's extensive collection of mathematical functions enables quick and vectorized
operations on arrays, improving computational performance significantly. ML tasks such as data
preprocessing, feature extraction, and model evaluation benefit from NumPy's capabilities in handling
numerical data. The seamless integration of NumPy with other ML libraries like Pandas, scikit-learn,
and TensorFlow ensures smooth data interchange and compatibility.
Methods used:
● read_csv () - The .read_csv() function takes a path to a CSV file and reads the data into a Pandas
DataFrame object.
● describe() - The describe() method returns a description of the data in the DataFrame. If the
DataFrame contains numerical data, the description contains this information for each column:
count - The number of not-empty values. mean - The average (mean) value.
● groupby() and sum() -Use DataFrame.groupby().sum() to group rows based on one or multiple
columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object
which contains an aggregate function sum() to calculate a sum of a given column for each group.
The dataframe.groupby() involves a combination of splitting the object, applying a function, and
combining the results. This can be used to group large amounts of data and compute operations on
these groups such as sum().
Pandas dataframe.sum() function returns the sum of the values for the requested axis. If the input
is the index axis, then it adds all the values in a column and repeats the same for all the columns
and returns a series containing the sum of all the values in each column.
System Architecture:
Methodology:
1. Data collection: Gather the relevant data from reliable sources.
2. Data loading: Load the data into a suitable data structure (e.g., DataFrame) using a programming
language like Python.
3. Data overview: Get an initial understanding of the dataset by examining its structure, dimensions, and
basic statistical summaries.
4. Data cleaning: Handle missing values, duplicates, and outliers in the dataset to ensure data quality.
5. Data visualization: Create visual representations such as histograms, scatter plots, and box plots to
understand patterns, relationships, and distributions in the data.
6. Feature engineering: Extract or transform features to derive new meaningful variables that can
enhance the analysis and model performance.
7. Statistical analysis: Apply statistical techniques to uncover insights, correlations, and associations
within the data.
8. Data segmentation: Group and explore the data based on different criteria (e.g., demographics, time
periods) to uncover patterns and differences.
9. Conclusion and reporting: Summarize findings, draw conclusions, and present the results of the EDA
process in a clear and concise manner.
Results:
Conclusion: In this project, we analyzed the COVID-19 vaccination data in India using the
"covid_vaccine_statewise.csv" dataset. We explored the number of individuals vaccinated for the first
and second doses across different states. Additionally, we determined the number of males and females
vaccinated in the country. The insights gained from this analysis can aid in understanding the progress
of vaccination efforts in India and help in formulating effective strategies to combat the COVID-19
pandemic.