Report Data Analysis
Report Data Analysis
A Report of Internship
at
TECHNEX IIT(BHU)
Submitted by
SAMIR TAMANG
Regd.No.: 20781A05H5
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
CERTIFICATE
I would like to express my special thanks of gratitude to my Internship Training guide Mr. Nanda
Kumar Agrawal, Head of the Department Dr. P. Jyotheeswari as well as our principal Dr. M. Mohan Babu
who gave me the golden opportunity to do this wonderful Internship at “TECHNEX IIT(BHU)”, which
provided me an opportunity explore the new horizons.
I sincerely express my gratitude towards the “TECHNEX IIT(BHU)”, for providing this opportunity.
I would like to thank Mr. Hareram Singh internship coordinator Department of CSE FOR their
support and advices.
Secondly, I would also like to thank my parents and friends who helped me a lot in finalizing this
report within the limited time frame.
IV-CSE’B’
TABLE OF CONTENTS
List of Figures
1.0 INTRODUCTION 1
1.1 Sub section 1
LIST OF FIGURES
INDEX
1.Introduction to python
1.1 About the Python 10
2.Data Science
1. Data Collection:
Describe the sources of data used in your analysis.
Discuss the data collection process, including any challenges or biases. Data cleaning
Data transformation
Feature scaling
Handling missing data
2. Data Exploration and Preprocessing:
Explore the dataset and provide descriptive statistics.
Discuss any missing data, outliers, or data quality issues.
Describe the preprocessing steps undertaken to clean and prepare the data
This section should include visualizations and statistical summaries of the data to gain
insights and identify patterns Histograms
Box plots
Scatter plots
Pair plots
Heatmaps
4. Data Visualization:
Matplotlib
Seaborn
Plotly
Tableau
g. gplot2 (R)
0
5. Text Analysis:
4. INTRODUCTION
4.1 Taste of machine learning
4.2 FUTURE OF MACHINE LEARNING
5. TECHNOLOGY LEARNT
5.1 Introduction to AI & ML
5.2 Definition of Machine Learning
5.3 Machine Learning Algorithm
8. REFERENCES
9.CONCLUSION
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITES
Training Objective: To enhance the knowledge of learner and perform well prediction to get
the result.
Student’s Work Assignment: They have assigned Minor and Major project during training.
2. COMPANY PROFILE
YBI Foundation is a Delhi-based not-for-profit EdTech company that aims to enable the youth to
grow in the world of emerging technologies. They offer a mix of online and offline approaches to
bring new skills, education, technologies for students, academicians and practitioners. They
believe in the learning anywhere and anytime approach to reach out to learners. The platform
provides free online instructor-led classes for students to excel in data science, business analytics,
machine learning, cloud computing and big data. They aim to focus on innovation, creativity,
technology approach and keep themselves in sync with the present industry requirements. They
endeavor to support learners to achieve the highest possible goals in their academics and
professions.
O/P: True
False
False
True
True
False
7)Bitwise Operator:
Bitwise operators are used to performing bitwise calculations on integers
2.2 Variables and Data Types
Variable
1)A variable is a container which is used to store some value. The value can be of any data type 2)
Multiple variables can be initialized in the same line by separating them by comma
3) Variables can be reinitialized, reassigned, redeclared
Example: w1 = (12,13,14,15,56,13)
Data Science
Data science is a multidisciplinary field that involves the use of scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and unstructured data. It
combines expertise from various domains such as statistics, computer science, mathematics, and
domain-specific knowledge to analyze and interpret complex data sets
Data Collection:
Data collection is a crucial step in the data science process, as it involves gathering
relevant information from various sources to analyse and derive insights. Here are
key aspects of data collection in data science:
1. Define Objectives and Scope:
Clearly define the goals and objectives of the data science project.
Determine the scope of the project, including the specific questions you aim to answer.
2. Identify Data Sources:
Determine where the relevant data is located. Sources may include:
Databases (SQL or NoSQL databases)
APIs (Application Programming Interfaces)
Web scraping
Sensor data
Log files
3.Data Extraction:
Extract the relevant data from the identified sources.
Transform the data into a format suitable for analysis, addressing issues such as missing
values and outliers.
4.Data Integration:
If the data is spread across multiple sources, integrate it into a single, unified dataset.
Ensure consistency in terms of data types, formats, and units.
Exploratory Data Analysis (EDA) is a critical phase in the data science process that involves visually
and statistically exploring a dataset to gain a deeper understanding of its patterns, characteristics, and
relationships. EDA helps in formulating hypotheses, identifying patterns, and guiding subsequent steps
in the data analysis process. Here are key aspects of Exploratory Data Analysis:
1. Descriptive Statistics:
Calculate basic summary statistics such as mean, median, mode, range, standard
deviation, and quartiles.
Understand the distribution of numerical features.
2. Data Visualization:
Create visual representations of the data using charts and graphs.
Common visualization tools include histograms, box plots, scatter plots, line plots, and
bar charts.
Use tools like Matplotlib, Seaborn, and Plotly in Python for visualization.
3. Univariate Analysis:
Analyse individual variables in isolation.
Examine the distribution of each variable to identify outliers and patterns.
4. Bivariate Analysis:
Explore relationships between pairs of variables.
Use scatter plots, line plots, and correlation matrices to understand how variables
interact.
5. Multivariate Analysis:
Analyse relationships involving three or more variables simultaneously.
Techniques include 3D plots, heatmaps, and pair plots.
6. Handling Missing Data:
Identify and handle missing values appropriately.
Understand the impact of missing data on analysis and decision-making.
7. Outlier Detection:
Identify and examine outliers that may significantly impact the analysis.
Evaluate whether outliers are errors or genuine data points.
Data Visualization:
Data visualization is the representation of data in graphical or visual formats, making it easier to
understand patterns, trends, and insights. Effective data visualization is a crucial aspect of data analysis
and communication.
1.Bar Chart:
Bar charts are used to compare two or more categories of data. Each category is represented by a bar,
and the length of the bar is proportional to the value of the category. Bar charts are a good choice for
showing data that is categorical or discrete.
Fig: bar chart
2. Line Chart:
Line graphs, or line charts, are a simple but effective staple for representing time-series data. They
are visually similar to scatterplots but represent data points separated by time intervals with
segments joined by a line
3. Pie Chart: pie charts represent a single variable, broken down into percentages or proportions.
Machine Learning
Machine learning (ML) is a subfield of artificial intelligence (AI) that focuses on the development of
algorithms and statistical models that enable computers to perform tasks without explicit
programming. The primary goal of machine learning is to allow computers to learn from data and
improve their performance over time. Here are key concepts and aspects of machine learning:
1. Types of Machine Learning:
Supervised Learning: The algorithm is trained on a labeled dataset, where each input is
associated with the correct output. The model generalizes patterns from the training data
to make predictions on new, unseen data.
Unsupervised Learning: The algorithm is given data without explicit labels. The model
explores patterns and structures in the data, such as clustering similar data points or
reducing dimensionality.
Reinforcement Learning: The algorithm learns by interacting with an environment. It
receives feedback in the form of rewards or penalties, allowing it to learn optimal
strategies to achieve a goal.
Supervised learning algorithms - Regression and Classification.
Regression:
1)It is a predictive modelling technique which investigates the relationship between dependent and
independent variables(one or more)
2)Dependent variable is continuous in nature eg -Sales, Weight, Profit, Revenue, Price, Distance,
Magnitude,Height, Weight etc.
y = dependent variable/output
x = independent variable/input(s)
LinearRegression:
1)It is a regression model that estimates the relationship between one independent variable and one
dependent variable using a straight line.
2) It has an equation of the form y = ax + b or y = mx + c
Where
x = independent variable/ input feature/input attribute/input column
y = dependent variable / output feature/target attribute/ output column
a/m = slope or coefficient or weight or how much we expect y to change as x changes
b/c = intercept / constant / bias
In this graph, x = Time spent Studying, y = Marks obtained. The orange dots are the corresponding data
points. The blue line is the best fit line for Linear regression (y = mx +c)
Clustering:
Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects
and categorizes them as per the presence and absence of those commonalities.
> K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
> K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using K-
NN algorithm.
> K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
Text Mining:
Definition:
Text Mining (also known as Text Analytics) is the process of deriving meaningful information and
patterns from large volumes of unstructured text data. It involves extracting valuable insights
and knowledge from textual information.
Key Concepts and Tasks:
1. Text Preprocessing: Cleaning and transforming raw text data for analysis.
2. Information Retrieval: Extracting relevant information from a large corpus of documents.
3. Topic Modelling: Identifying topics present in a collection of documents.
4. Clustering: Grouping similar documents or text snippets together.
5. Text Classification: Assigning predefined categories or labels to documents.
6. Pattern Recognition: Identifying patterns and trends in textual data.
Database:
A database is an organized collection of structured information or data that is stored electronically
in a computer system. Databases are designed to efficiently manage, store, and retrieve data. They
are a critical component in various applications, enabling the storage and manipulation of vast
amounts of information. Key concepts related to databases include:
1. Database Management System (DBMS):
A software system that provides an interface to interact with the database.
Manages the storage, retrieval, and organization of data.
2. Relational Database:
Organizes data into tables with rows and columns.
Uses relationships between tables to establish connections and maintain data integrity.
3. NoSQL Database:
A type of database that does not strictly adhere to the traditional relational model.
Suited for handling unstructured or semi-structured data and provides flexible schema
design.
4. Key-Value Store:
A NoSQL database model where each data item is stored as a key-value pair.
5. Document Store:
A NoSQL database model that stores data in semi-structured documents (e.g., JSON or
XML).
6. Graph Database:
A database designed for handling data with complex relationships, using a graph
structure.
7. ACID Properties:
A set of properties (Atomicity, Consistency, Isolation, Durability) that guarantee the
reliability of database transactions.
1. SELECT Statement:
Retrieves data from one or more tables.
sqlCopy code
SELECT column1, column2 FROM table WHERE condition ;
2. INSERT Statement:
Adds new records to a table.
sqlCopy code
INSERT INTO table (column1, column2) VALUES (value1, value2);
3. UPDATE Statement:
Modifies existing records in a table.
sqlCopy code
UPDATE table SET column1 = value1 WHERE condition ;
4. DELETE Statement:
Removes records from a table.
sqlCopy code
DELETE FROM table WHERE condition ;
5. CREATE TABLE Statement:
Defines a new table with its structure.
sqlCopy code
CREATE TABLE table ( column1 datatype1, column2 datatype2, ... );
6. JOIN Operation:
Combines rows from two or more tables based on a related column.
sqlCopy code
SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column;
7. INDEX:
Improves the speed of data retrieval operations on a database table.
sqlCopy code
CREATE INDEX index_name ON table ( column );
8. Normalization:
The process of organizing data in a database to reduce redundancy and
improve data integrity.
PROJECT WORK
During the period of (6 week) Internship they have assigned two projects to do :
1. Minor Project
2. Major Project
Minor Project:
In Minor project, I have done “Car price prediction” and “SMS spam classifier”
Car price prediction models typically involve the use of various modules or components to analyze and
predict the prices of cars. The specific modules used can vary depending on the complexity of the model
and the data available. Here are some common modules or steps that may be involved in building a car
price prediction model:
1. Data Collection and Preprocessing:
Data Collection Module: Collecting relevant data on cars, including features such as
make, model, year, mileage, fuel type, engine size, etc.
Data Cleaning and Preprocessing Module: Handling missing data, removing outliers, and
preparing the data for analysis.
2.Exploratory Data Analysis (EDA):
Descriptive Statistics Module: Analyzing summary statistics, distributions, and other
descriptive measures of the data.
Data Visualization Module: Creating visualizations to understand the relationships
between different features and the target variable (car prices).
3.Model Building:
Model Selection Module: Choosing an appropriate machine learning algorithm for
regression tasks (e.g., linear regression, decision trees, random forests, gradient
boosting).
SMS spam prediction models typically involve several modules to process and analyse text data
for classifying messages as spam or not spam. Below are common modules or steps involved in
building an SMS spam prediction model:
1.Feature Extraction:
Tokenization Module: Breaking down the text into individual words or tokens.
To build an SMS spam classifier, you'll likely use various libraries and frameworks that provide tools for
data preprocessing, feature extraction, machine learning, and evaluation. Here are some commonly
used libraries in Python for building SMS spam classifiers:
NLTK (Natural Language Toolkit):
NLTK is a powerful library for natural language processing. It provides tools for tokenization, stemming,
lemmatization, and other text processing tasks.
import nltk
Scikit-learn:
Scikit-learn is a popular machine learning library that includes tools for data preprocessing, feature
extraction, and implementing machine learning algorithms.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
2.Major Project
On major project, I have done on topic “Online Insurance Analysis In Python”
Online insurance analysis involves the examination and assessment of various aspects related to the
digital landscape of the insurance industry. As insurance services increasingly migrate to online
platforms, thorough analysis becomes crucial for optimizing operations, enhancing customer
experiences, and ensuring the security and efficiency of digital processes.
1. Pandas:
Pandas is a powerful data manipulation library that provides data structures for efficient
data analysis. It is commonly used for tasks such as cleaning, transforming, and
exploring datasets.
import pandas as pd
2. NumPy:
NumPy is a fundamental package for scientific computing in Python. It provides support
for large, multi-dimensional arrays and matrices, along with mathematical functions to
operate on these arrays.
import numpy as np
3. Scikit-learn:
Scikit-learn is a machine learning library that provides simple and efficient tools for data
analysis and modeling, including modules for classification, regression, clustering, and
more.
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,
classification_report
4. Matplotlib and Seaborn:
Matplotlib and Seaborn are popular libraries for data visualization. They are used to
create various types of plots and charts to better understand and communicate data
insights.
import matplotlib.pyplot as plt import seaborn as sns
5. TensorFlow or PyTorch:
For more advanced machine learning tasks, such as deep learning, libraries like
TensorFlow or PyTorch can be used.
import tensorflow as tf
6. Statsmodels:
Statsmodels is a library for estimating and testing statistical models. It is useful for tasks
such as regression analysis.
7. Streamlit: Used as webservices.
Input output dataset/screenshots
Home.py
import streamlit as st
import pandas as pd
import plotly.express as px
from streamlit_option_menu import option_menu
from numerize.numerize import numerize
from query import *
import time
st.set_page_config(page_title="Dashboard", page_icon="🌍",layout="wide")
st.subheader("🚮Insurance Details Analysis")
st.markdown("##")
result=view_all_data()
df=pd.DataFrame(result,columns=["Policy","Expiry","Location","State","Region","In
vestment","Construction","BusinessType","Earthquake","Flood","Rating","id"])
#switcher
st.sidebar.header("Please filter")
region=st.sidebar.multiselect(
"Select Region",
options=df["Region"].unique(),
default=df["Region"].unique(),
)
location=st.sidebar.multiselect(
"Select Location",
options=df["Location"].unique(),
default=df["Location"].unique(),
)
construction=st.sidebar.multiselect(
"Select Construction",
options=df["Construction"].unique(),
default=df["Construction"].unique(),
)
df_selection=df.query(
"Region==@region & Location==@location & Construction==@construction"
)
def Home():
with st.expander("Tabular"):
showData=st.multiselect('Filter: ',df_selection.columns,default=[])
st.write(df_selection[showData])
total1,total2,total3,total4,total5=st.columns(5,gap='large')
with total1:
st.info('Total Investment',icon="📌")
st.metric(label="sum TZS",value=f"{total_investment: ,.0f}")
with total2:
st.info('Most Frequent',icon="📌")
st.metric(label="mode TZS",value=f"{investment_mode:,.0f}")
with total3:
st.info('Average',icon="📌")
st.metric(label="average TZS",value=f"{investment_mean:,.0f}")
with total4:
st.info('Central Earning',icon="📌")
st.metric(label="median TZS",value=f"{investment_median:,.0f}")
with total5:
st.info('Rating',icon="📌")
st.metric(label="Rating",value=numerize(rating),help=f""" Total Rating:
{rating} """)
st.markdown("""_ _ _""")
#graphs
def graphs():
#total_investment=int(df_selection["Investment"]).sum()
#averageRating=int(round(df_selection["Rating"]).mean(),2)
fig_investment.update_layout(
plot_bgcolor="rgba(0,0,0,0)",
xaxis=(dict(showgrid=False))
)
left,right=st.columns(2)
left.plotly_chart(fig_state,use_container_width=True)
right.plotly_chart(fig_investment,use_container_width=True)
def Progressbar():
st.markdown("""<style>.stProgress > div > div > div > div { background(to
right, #99ff99, #FFFF00)}</style""",unsafe_allow_html=True,)
target=3000000000
current=df_selection["Investment"].sum()
percent=round((current/target*100))
mybar=st.progress(0)
if percent>100:
st.subheader("Target done !")
else:
st.write("You have (",percent, "% ", "of ", (format(target, 'd')), "TZS")
for percent_complete in range(percent):
time.sleep(0.1)
mybar.progress(percent_complete+1,text="Target Percentage")
def sideBar():
with st.sidebar:
selected=option_menu(
menu_title="Main Menu",
options=["Home", "Progress"],
icons=["house", "eye"],
menu_icon="cast",
default_index=0
)
if selected=="Home":
st.subheader(f"Page: {selected}")
Home()
graphs()
if selected=="Progress":
st.subheader(f"Page: {selected}")
Progressbar()
graphs()
sideBar()
#theme
hide_st_style="""
<style>
#MainMenu{visibility:hidden;}
footer{visibility:hidden;}
header{visibility:hidden;}
</style>
"""
Query.py
import mysql.connector
import streamlit as st
conn=mysql.connector.connect(
host="localhost",
port="3306",
user="root",
passwd="",
db="mydbs"
)
c=conn.cursor()
#fetch
def view_all_data():
c.execute('select * from insurance order by id asc')
data=c.fetchall()
return data
Output:
3. CASE STUDY
In this 21st century the usage of smartphones has increased exponentially and use of social media
too. Concerning future prospects, learning Machine learning and Data Science is strongly recommended
for all individuals. And when we come about Machine learning and Data Science, Nodaway’s it become
part of all organization. It is part of Artificial Intelligence and whole world is going towards AI and Data
Science are used to predict and real-life application. Why machine learning and Dara Science because
Time effective
Perform any task automatically
No need to work yourself etc.
Easy to predict and classify
Used in different home appliance and performance is good.
Used in education, healthcare, security, Autonomous vehicles, Stock prediction etc.
4. CONCLUSION
The Machine learning is more useful nowadays as increasing number of people spending time on
social media. Data Science used the different Machine learning algorithms to predict and classifier the
data set. data science and machine learning represent transformative forces that are reshaping the
landscape of industries, decision-making processes, and technological advancements. The fusion of these
disciplines has led to unprecedented capabilities in extracting valuable insights from vast and complex
datasets, paving the way for innovation and efficiency across diverse sectors.
Looking ahead, the future of data science and machine learning holds promises of even greater
advancements. As technology evolves, these fields are poised to tackle increasingly complex challenges,
driving innovation, and providing solutions to societal issues.
.
REFERENCES
https://www.ybifoundation.org/#/home
https://www.mygreatlearning.com/blog/what-is-data-
science/
https://cloud.google.com/data-science