100% found this document useful (1 vote)
1K views

DSBDA - Mini Project Report

Uploaded by

omkarshinde3905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

DSBDA - Mini Project Report

Uploaded by

omkarshinde3905
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SCTR's Pune Institute of Computer Technology

Dhankawadi, Pune

A PROJECT REPORT ON
Covid Vaccine Statewise Analysis

SUBMITTED BY
Omkar Shinde (31480)
Amey Wadgaonkar (31492)

Under the guidance of


Prof. Rutuja
Kulkarni

DEPARTMENT OF COMPUTER ENGINEERING


Academic Year 2023-24
Title:
Mini-Project
Exploratory data analysis of the covid vaccination data of India using the given dataset.

Problem Statement:
Use the following covid_vaccine_statewise.csv dataset and perform the following analytics on the given
dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv
● Describe the dataset
● Number of persons statewise vaccinated for the first dose in India
● Number of persons statewise vaccinated for the second dose in India
● Number of males vaccinated
● Number of females vaccinated

Objectives:
● To Describe the dataset
● To do Preprocessing on the given dataset

Theory:
Libraries:
● Pandas - Pandas, a powerful Python library, plays a crucial role in machine learning (ML) workflows
by providing efficient data manipulation and analysis capabilities. With its intuitive data structures, such
as DataFrames and Series, Pandas simplifies the process of preprocessing and cleaning datasets for ML
tasks. It offers various functionalities like data selection, filtering, merging, and transformation,
allowing users to handle missing values, outliers, and feature engineering effectively. Additionally,
Pandas seamlessly integrates with other ML libraries like NumPy and scikit-learn, enabling smooth data
integration and model building. It also supports reading and writing data from various file formats,
making it convenient for ML practitioners to work with diverse data sources. Whether it's exploratory
data analysis, data preprocessing, or feature extraction, Pandas provides a versatile and efficient toolkit
that significantly enhances productivity and facilitates the development of robust machine learning
models.
● Numpy - NumPy, a fundamental library for numerical computations in Python, is widely used in
machine learning (ML) applications. Its array-oriented programming paradigm allows for efficient
manipulation and processing of large multi-dimensional arrays and matrices, which are central to many
ML algorithms. NumPy's extensive collection of mathematical functions enables quick and vectorized
operations on arrays, improving computational performance significantly. ML tasks such as data
preprocessing, feature extraction, and model evaluation benefit from NumPy's capabilities in handling
numerical data. The seamless integration of NumPy with other ML libraries like Pandas, scikit-learn,
and TensorFlow ensures smooth data interchange and compatibility.

● Sklearn - Scikit-learn, a widely-used machine learning library in Python, provides a comprehensive


set of tools for various aspects of ML workflows. Its extensive collection of algorithms and utilities
covers a broad range of tasks, including classification, regression, clustering, dimensionality reduction,
and model selection. With scikit-learn, ML practitioners can easily implement and experiment with
different algorithms and models without having to build everything from scratch. The library offers a
consistent and user-friendly API, making it straightforward to preprocess and transform data, split
datasets for training and testing, and evaluate model performance using various metrics. scikit-learn
also includes modules for feature extraction, feature selection, and hyperparameter tuning, enabling
researchers to fine-tune models for optimal performance.

Methods used:

● read_csv () - The .read_csv() function takes a path to a CSV file and reads the data into a Pandas
DataFrame object.

● describe() - The describe() method returns a description of the data in the DataFrame. If the
DataFrame contains numerical data, the description contains this information for each column:
count - The number of not-empty values. mean - The average (mean) value.

● groupby() and sum() -Use DataFrame.groupby().sum() to group rows based on one or multiple
columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object
which contains an aggregate function sum() to calculate a sum of a given column for each group.
The dataframe.groupby() involves a combination of splitting the object, applying a function, and
combining the results. This can be used to group large amounts of data and compute operations on
these groups such as sum().

Pandas dataframe.sum() function returns the sum of the values for the requested axis. If the input
is the index axis, then it adds all the values in a column and repeats the same for all the columns
and returns a series containing the sum of all the values in each column.
System Architecture:

Methodology:
1. Data collection: Gather the relevant data from reliable sources.
2. Data loading: Load the data into a suitable data structure (e.g., DataFrame) using a programming
language like Python.
3. Data overview: Get an initial understanding of the dataset by examining its structure, dimensions, and
basic statistical summaries.
4. Data cleaning: Handle missing values, duplicates, and outliers in the dataset to ensure data quality.
5. Data visualization: Create visual representations such as histograms, scatter plots, and box plots to
understand patterns, relationships, and distributions in the data.
6. Feature engineering: Extract or transform features to derive new meaningful variables that can
enhance the analysis and model performance.
7. Statistical analysis: Apply statistical techniques to uncover insights, correlations, and associations
within the data.
8. Data segmentation: Group and explore the data based on different criteria (e.g., demographics, time
periods) to uncover patterns and differences.
9. Conclusion and reporting: Summarize findings, draw conclusions, and present the results of the EDA
process in a clear and concise manner.
Results:
Conclusion: In this project, we analyzed the COVID-19 vaccination data in India using the
"covid_vaccine_statewise.csv" dataset. We explored the number of individuals vaccinated for the first
and second doses across different states. Additionally, we determined the number of males and females
vaccinated in the country. The insights gained from this analysis can aid in understanding the progress
of vaccination efforts in India and help in formulating effective strategies to combat the COVID-19
pandemic.

You might also like