0% found this document useful (0 votes)
42 views

Dsa Report

The document summarizes an internship report submitted by Parth Rajput for their 5th semester internship in data science. The internship was conducted at Technophilia Solutions and involved two projects: 1) A web scraping project to extract data from IMDB without using APIs, and 2) A Titanic dataset project involving data cleaning, feature engineering, and basic machine learning modeling to predict passenger survival. The report includes sections on data science processes, Python libraries used for the projects like Pandas, Matplotlib, and Scikit-learn, as well as an overview of statistics concepts.

Uploaded by

12a.zuhebhasan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Dsa Report

The document summarizes an internship report submitted by Parth Rajput for their 5th semester internship in data science. The internship was conducted at Technophilia Solutions and involved two projects: 1) A web scraping project to extract data from IMDB without using APIs, and 2) A Titanic dataset project involving data cleaning, feature engineering, and basic machine learning modeling to predict passenger survival. The report includes sections on data science processes, Python libraries used for the projects like Pandas, Matplotlib, and Scikit-learn, as well as an overview of statistics concepts.

Uploaded by

12a.zuhebhasan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

MAULANA AZAD NATIONAL INSTITUTE OF

TECHNOLOGY, BHOPAL

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Scholar No. : 211112249


Section : CSE-2
Semester : 5th
Session : 2023-24

INTERNSHIP / INDUSTRIAL TRAINING REPORT


DATA SCIENCE

Submitted by: Submitted to:


Parth Rajput
TABLE OF CONTENTS

Internship Certificate …………………………………


Acknowledgement…………………………………….
Declaration………………………………………………
Abstract …………………………………………………..
Introduction
1.1 Introduction………………………………………
1.2 Objective ………………………………………..
1.3 Project Work description………………………….
Description
2.1 Data Process …………………………………………..
2.2 Python……………………………………………….
2.3 Python Libraries ………………………………………
2.4 Data Intrepretation and Statistics …………………………..
2.5 Project Description ……………………………………….
Tools and Technology Used …………………….
Application and It’s Outcome
4.1 Project Description …………………………………….
4.2 Project Code …………………………………………………….
4.3 Project Outcome …………………………………………………
Conclusion…………………………………………..
References…………………………………………..
ACKNOWLEDGEMENT

The work in this report is an outcome of continuous work over a


period and drew intellectual support from Technophilia Solutions
and other sources. I would like to articulate our profound gratitude
and indebtedness to Technophilia Solutions helped me in
completion of the training. I am thankful to Technophilia Solutions
Associates for teaching and assisting me in making the training
successful

Parth Rajput
211112249
5th Sem
CSE – 2
DECLARATION

I hereby certify that the work which is being presented in the report
entitled “Data Science” in fulfilment of the requirement for
completion of Summer industrial training in Department of
Computer Science of”Maulana Azad National Institute of
Technology ‘Bhopal’ is an authentic record of my own work and
project carried out during industrial training in this Summer

PARTH RAJPUT
211112249
5th Sem
CSE-2
ABSTRACT

Data Science :

Data Science as a multi-disciplinary subject that uses mathematics,


statistics, and computer science to study and evaluate data. The key
objective of Data Science is to extract valuable information for use in
strategic decision making, product development, trend analysis, and
forecasting.
Data Science concepts and processes are mostly derived from data
engineering, statistics, programming, social engineering, data
warehousing, machine learning, and natural language processing. The
key techniques in use are data mining, big data analysis, data extraction
and data retrieval.
Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to
extract meaningful insights from data. Data science practitioners apply
machine learning algorithms to numbers, text, images, video, audio, and
more to produce artificial intelligence (AI) systems to perform tasks
that ordinarily require human intelligence. In turn, these systems
generate insights which analysts and business users can translate into
tangible business value.
INTRODUCTION OF THE WORK
 Objective -:
 To explore, sort and analyse mega data from various sources to
take advantage of them and reach conclusions to optimize
business processes and for decision support.
 Examples include machine maintenance or (predictive
maintenance), in the fields of marketing and sales with sales
forecasting based on weather.
 TO provide the meaningful and cleaned data to the Machine
learning Algorithm so that the proper implementation of the
machine learning can be implemented.

During my work in Industrial training in Technophilia


Solutions I have developed and work on 2 projects which
basically based on Data Cleaning , data exploration and use of
Different type of API to Extract the data .
The Project Which I worked upon are :
1) Data Extracting based Project using WebScrapping :-
In this project I extract and retrieve the data and wanted result
using web scrapping tool and Web scrapped IMDB website
and generate desired Output without using API

2) Titanic Data Cleaning and Implementation of the generated


data in basic Machine Learning Algorithm :
In this I processed the Titanic data and Does the data Cleaning
so that the prediction related to the survival of any passengers
can be carried out by the machine. This project involves
following work during training :

Titanic: Machine Learning from Disaster


Predict survival on the Titanic

 Defining the problem statement


 Collecting the data
 Exploratory data analysis
 Feature engineering
 Modelling
 Testing

This report is a description of my 8 weeks internship carried out


as a compulsory component of the course . In the following
chapter details of tools and technology used and an overview of
the Above Mentioned Project is given. Afterwards, I explain my
specific technical details about my main tasks. Finally, a
conclusion is drawn from the experience.
Brief Description of Modules/Study
1) DATA SCIENCE PROCESS:

1. The first step of this process is setting a research goal. The


main purpose here is making sure all the stakeholders
understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data
available for analysis, so this step includes finding suitable
data and getting access to the data from the data owner. The
result is data in its raw form, which probably needs polishing
and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This
includes transforming the data from a raw form into data
that’s directly usable in your models. To achieve this, you’ll
detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If
you have successfully completed this step, you can progress
to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to
gain a deep understanding of the data. You’ll look for
patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase
will enable you to start modeling.

5. Finally, we get to the sexiest part: model building (often


referred to as “data modeling” throughout this book). It is
now that you attempt to gain the insights or make the
predictions stated in your project charter. Now is the time to
bring out the heavy guns, but remember research has taught
us that often (but not always) a combination of simple
models tends to outperform one complicated model. If
you’ve done this phase right, you’re almost done.

6. The last step of the data science model is presenting your


results and automating the analysis, if needed. One goal of a
project is to change a process and/or make better decisions.
You may still need to convince the business that your
findings will indeed change the business process as expected.
This is where you can shine in your influencer role. The
importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to
perform the business process over and over again, so
automating the project will save time.

2) PYTHON FOR DATA SCIENCE


Introduction to Python, Understanding Operators, Variables and Data
Types, Conditional Statements, Looping Constructs, Functions, Data
Structure, Lists, Dictionaries, Understanding Standard Libraries in Python,
reading a CSV File in Python, Data Frames and basic operations with Data
Frames, Indexing Data Frame.
3) PYTHON LIBRARIES:
i) Numpy
ii) Pandas
iii) Matplotlib
iv) sklearn (KNN – MODEL)
v) BeautifulSoup

4) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE


Introduction to Statistics, Measures of Central Tendency, Understanding the
spread of data, Data Distribution, Introduction to Probability, Probabilities
of Discrete and Continuous Variables, Normal Distribution, Introduction to
Inferential Statistics, Understanding the Confidence Interval and margin of
error, Hypothesis Testing, Various Tests, Correlation.

Project Developed : “WEBSCRAPING-IMDB”

Data set : gfg1.csv file downloaded from Kaggle website.

In this project we perform the data extraction and data collection


so that the data can further be used in processing .

We can extract data from various ways which include:

1) Using API (Application Programming Interface) where we


request the data from the API of Website which might
include some charges.
2) Using web scraping –. Web scraping is the process of
extracting data from websites. It involves using a program or
script to access a web page, retrieve the HTML content of the
page, and then parse and extract specific information from
that HTML, such as text, images, links, or structured data.
Web scraping is a valuable technique for collecting data from
websites.

In this Project we downloaded the Gfg.1 csv file from the website
which basically contains the IMDB Movie_ID . then ,we use the
the Web scraping Method of the data extraction for extracting the
data from the webpage of the IMDB Website.
Here we Pass the Movie_ID of the Movie as input with which the
extracted html script from webpage is then parse using beautiful
soup application and the desired Information is extracted which
Includes “Series_name” , “Series_rating” And “Series_genre”.
Then the corresponding output Graph is plotted using Matplotlib
library .

Tools and Technology used


1) Language used: Phyton
Python has become the language of choice for data scientists,
and for good reason. Its versatility and readability make it an ideal
tool for the entire data science workflow. With libraries such as
NumPy and Pandas, Python simplifies the manipulation and
analysis of data, providing data structures like arrays and
DataFrames. For visualization, Matplotlib and Seaborn offer
powerful tools to create insightful graphs and plots. When it comes
to machine learning, Scikit-learn provides a comprehensive set of
tools for implementing algorithms for classification, regression,
clustering, and more. Deep learning enthusiasts turn to TensorFlow
and PyTorch, which have gained widespread adoption for building
and training neural networks.
2) Jupyter Notebook:
Jupyter Notebook is an open-source web application that provides
an interactive environment for creating and sharing documents that
contain live code, equations, visualizations, and narrative text. It is
widely used in data science, scientific research, and education. With
Jupyter Notebook, you can write and execute code in various
programming languages, including Python, R, and Julia, within a
web-based interface. The notebook format allows users to combine
code cells (where they write and run code), markdown cells (for
text and explanations), and output cells (for displaying results) in a
single document. Jupyter Notebook is highly popular in data
analysis and machine learning due to its ability to create
reproducible and easily shareable data-driven narratives.

3). Numpy:
NumPy, short for "Numerical Python," is a fundamental open-
source library in Python for numerical and scientific computing. It
provides support for creating and manipulating arrays and matrices
of data, along with a wide range of mathematical functions to
operate on these arrays efficiently. NumPy is an essential tool for
tasks involving numerical computations, data analysis, and
scientific research. It's particularly valuable for its speed and
memory efficiency, making it a foundation for many other
4) Pandas:
Pandas is a popular open-source Python library for data manipulation
and analysis. It provides easy-to-use data structures and functions for
working with structured and tabular data, making it a valuable tool for
data scientists, analysts, and researchers. Pandas introduces two
primary data structures: the DataFrame, which is a two-dimensional
table-like data structure with rows and columns, and the Series, which
is a one-dimensional array-like structure.
Pandas simplifies various data operations, including data cleaning,
transformation, filtering, aggregation, and exploration. It allows users
to import data from a variety of sources, such as CSV files, Excel
spreadsheets, databases, and more, and then perform data wrangling
and analysis tasks with ease.

5) MATPLOTLIB :
Matplotlib is a popular open-source Python library for creating static,
animated, and interactive visualizations and plots. It offers a wide
range of tools and functions for generating high-quality charts,
graphs, and figures, making it an essential tool for data visualization
in fields such as data analysis, scientific research, and engineering.

6) json.loads:
‘json.loads’ is a Python method that stands for "JSON load string."
It is part of the json module in Python and is used to parse and
convert a JSON-formatted string into a Python data structure,
typically a dictionary, list, or a combination of both, depending on
the JSON content.

The primary use of json.loads is to deserialize JSON data, which


means converting data from its serialized JSON representation (a
string) back into a format that Python can work with as native data
types. This is particularly useful when you need to work with data
retrieved from external sources, such as web APIs, as many APIs
provide data in JSON format.

Implementation:
7) Beautiful soup():-
 It is implemented using bs4 library present in phyton
Beautiful Soup is a popular Python library for web scraping and
parsing HTML and XML documents. It provides a convenient way
to extract specific information from web pages.
Beautiful Soup makes it easier to work with web data by providing
functions and methods to:

 Parse HTML and XML: Beautiful Soup can take an HTML or


XML document and parse it into a structured tree-like data
structure that can be traversed and searched.
 Navigate the Document: It allows you to traverse the document's
structure by moving up and down the element tree, accessing
elements and their attributes.
 Search and Filter: You can search for specific tags, attributes, or
text content within the document, making it simple to locate and
extract the data you need.
 Modify and Manipulate: Beautiful Soup enables you to modify
the document, add or delete elements, and update attributes as
needed.

8) Implementation of basic Machine Learning


Algorithm(KNN) using Sklearn Library
Scikit-Learn, often abbreviated as sklearn, is a widely used open-
source machine learning library for the Python programming
language. It provides a comprehensive and user-friendly set of tools
for various machine learning tasks, including classification,
regression, clustering, dimensionality reduction, model selection,
and preprocessing of data. Scikit-Learn is built on top of other
popular Python libraries, such as NumPy, SciPy, and Matplotlib,
making it an integral part of the Python machine learning
ecosystem.
K-Nearest Neighbors (KNN) is a simple and intuitive machine
learning algorithm used for classification and regression tasks. In
KNN, an object is classified by a majority vote of its k nearest
neighbors from the training dataset. The "k" in KNN represents the
number of nearest neighbors that are considered when making a
prediction.

IMPLEMENTATION:
APPLICATION AND IT’S OUTCOME
Here I implemented the above mentioned Technology and tools
using Jupyter Notebook and I have demonstrates it’s Application in
My Project : “Web Scraping – IMDB”.
CODE:
OUTPUT :

a) When input is 5 Movie_id :-


b) When Output is 10 Movie_id:
CONCLUSION

The Data Science summer training program has been an invaluable


learning experience for me . Over the course of this summer, I have
delved into the multifaceted world of data science, covering topics
ranging from data collection and cleaning to advanced machine
learning techniques. This training has equipped me with a broad set of
skills and knowledge that are essential for a data scientist.

Throughout the program, I have the opportunity to work on real-


world projects and apply our learning to solve practical problems.
This hands-on experience has been instrumental in cementing our
understanding and boosting our confidence in tackling data-related
challenges.

I have not only gained proficiency in programming languages like


Python and tools such as Jupyter Notebook, but we have also honed
our skills in data analysis, data visualization, and statistical modeling.
The exposure to industry-relevant tools and techniques, including
libraries like Pandas, NumPy, Matplotlib, and scikit-learn, has been
particularly beneficial.

Moreover, the program's emphasis on understanding the ethical


considerations and responsible use of data has provided us with a
well-rounded perspective on data science. I am now better equipped
to approach data-driven decision-making with a strong ethical
foundation.In addition to the technical skills acquired, I have also
developed soft skills such as problem-solving, critical thinking, and
effective communication. These skills are essential for presenting our
findings and insights to stakeholders in a clear and understandable
manner.

As we move forward in our journey as aspiring data scientists, the


knowledge and experience gained during this summer training will
undoubtedly serve as a strong foundation. I am excited to continue
exploring the vast field of data science and applying our skills to
make a positive impact in various domains, whether it be business,
healthcare, finance, or any other field that benefits from data-driven
insights.

I would like to express our gratitude to the trainers and mentors who
have guided me throughout this program, providing their expertise,
support, and encouragement. I also appreciate the opportunity to
collaborate with our fellow trainees, as the exchange of ideas and
experiences has enriched our learning.

In conclusion, this Data Science summer training has been an


enriching and empowering experience, and I am enthusiastic about
the future possibilities and contributions I can make to the field of
data science.
REFERENCES-

 http://www.w3schools.com
 http://www.wikipidea.com
 http://www.w3.org/standards/semanticweb/
 http://www.technophilia.com
 http://www.kaggle.com
 http://www.javapoint.com
 http://www.Googledatasetsearch.com

******

You might also like