Internship Report on Data Science
Internship Report on Data Science
SCIENCE
AND MACHINE LEARNING
BY:
Srishti Kashyap
(2135051238)
Under the guidance of:
Dr. Alok Yadav
Department of Computer
Science Engineering,
GURU TEGH BAHADUR POLYTECHNIC
INSTITUTE G-8 AREA , RAJOURI GARDEN,
NEW DELHI-110064
ACKNOWLEDGEMENT
I would like to express my gratitude for the people who were part
of my report, directly or indirectly people who gave unending
support right from the stage the idea was conceived. It gives me a
great pleasure to have an opportunity to acknowledge and to express gratitude
those who were associated with me during my Internship at YBI Foundation .
I take this opportunity to thank industrial training coordinator, H.O.D of
Computer science and Engineering department. I am highly indebted to
my project guide Dr. Alok Yadav (Training Instructor) for his guidance
and words of wisdom. He always showed me the right direction during
the course of his report project work. I am duly thankful to him
for teaching and referring me to various blocks, providing work and for
permitting me to have training of duration of 6 weeks
DECLARATION
I hereby declare that the projects done by me at YBI foundation based on Data
Science and Machine Learning , submitted by me is a record of bona-
fide project work completed during internship training. I further declare that the
work reported in this project has not been submitted anywhere else and is
not copied from anywhere.
Chapter-1
Introduction and Literature Survey
Introduction
The name machine learning was coined in 1959 by Arthur Samuel. Tom M.
Mitchell provided a widely quoted, more formal definition of the algorithms
studied in the machine learning field: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if
its performance at tasks in 7, as measured by P, improves with experience E."
This follows Alan Turing's proposal in his paper "Computing Machinery and
Intelligence", in which the question "Can machines think?" is replaced with the
question "Can machines do what we (as thinking entities) can do?". In Turing's
proposal the characteristics that could be possessed by a thinking machine and the
various implications in constructing one are exposed.
1. Supervised Learning
2. UnSupervised Learning
3. Reinforcment Learning
4. Semi-Supervised Learning
Supervised Learning
Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged
with the correct output.
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Unsupervised Learning
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns
and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things.
Reinforcement Learning
Reinforcement learning is a learning method that interacts with its environment
by producing actions and discovers errors or rewards. Trial and error search and
delayed reward are the most relevant characteristics of reinforcement learning.
This method allows machines and software agents to automatically determine the
ideal behavior within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which action is best.
Semi-Supervised Learning
Semi-supervised learning fall somewhere in between supervised and
unsupervised learning, since they use both labeled and unlabeled data for training
– typically a small amount of labeled data and a large amount of unlabeled data.
The systems that use this method are able to considerably improve learning
accuracy. Usually, semi-supervised learning is chosen when the acquired labeled
data requires skilled and relevant resources in order to train it / learn from it.
Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.
Literature Survey
Human beings, at this moment, are the most intelligent and advanced species on
earth because they can think, evaluate and solve complex problems. On the other
side, AI is still in its initial stage and haven’t surpassed human intelligence in many
aspects. Then the question is that what is the need to make machine learn? The
most suitable reason for doing this is, “to make decisions, based on data, with
efficiency and scale”.
Lately, organizations are investing heavily in newer technologies like Artificial
Intelligence, Machine Learning and Deep Learning to get the key information from
data to perform several real-world tasks and solve problems. We can call it data-
driven decisions taken by machines, particularly to automate the process. These
data-driven decisions can be used, instead of using programing logic, in the
problems that cannot be programmed inherently. The fact is that we can’t do
without human intelligence, but other aspect is that we all need to solve real-
world problems with efficiency at a huge scale. That is why the need for machine
learning arises.
Quality of data: Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data
preprocessing and feature extraction.
. Time-Consuming task: Another challenge faced by ML models is the
consumption of time especially for data acquisition, feature extraction and
retrieval.
Lack of specialist persons: As ML technology is still in its infancy stage, availability
of expert resources is a tough job.
No clear objective for formulating business problems: Having no clear objective
and well-defined goal for business problems is another key challenge for ML
because this technology is not that mature yet.
Issue of overfitting & underfitting: If the model is overfitting or underfitting, it
cannot be represented well for the problem. Curse of dimensionality: Another
challenge ML model faces is too many features of data points. This can be a real
hindrance.
Difficulty in deployment: Complexity of the ML model makes it quite difficult to
be deployed in real life.
Features
Interpreted
In Python there is no separate compilation and execution steps like C/C++.
It directly run the program from the source code. Internally, Python
converts the source code into an intermediate form called bytecodes which
is then translated into native language of specific computer to run it.
Platform Independent
Python programs can be developed and executed on the multiple operating
system platform. Python can be used on Linux, Windows, Macintosh,
Solaris and many more.
Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many
of its features support functional programming and aspect-oriented
programming .
Simple
Python is a very simple language. It is a very easy to learn as it is closer to
English language. In python more emphasis is on the solution to the
problem rather than the syntax.
O Scikit-learn for handling basic ML algorithms like clustering, linear and logistic
regressions, regression, classification, and others.
O Pandas for high-level data structures and analysis. It allows merging and filtering
of data, as well as gathering it from other external sources like Excel, for instance.
O Keras for deep learning. It allows fast calculations and prototyping, as it uses the
GPU in addition to the CPU of the computer.
O TensorFlow for working with deep learning by setting up, training, and utilizing
artificial neural networks with massive datasets.
3. Flexibility-
Python for machine learning is a great choice, as this language is very
flexible:
It offers an option to choose either to use OOPs or scripting.
There’s also no need to recompile the source code, developers can
implement any changes and quickly see the results.
Programmers can combine Python and other languages to
reach their goals.
5. Community Support-
It’s always very helpful when there’s strong community support built around the
programming language. Python is an open-source language which means that
there’s a bunch of resources open for programmers starting from beginners
and ending with pros. A lot of Python documentation is available online as well as
in Python communities and forums, where programmers and machine learning
developers discuss errors, solve problems, and help each other out. Python
programming language is absolutely free as is the variety of useful libraries and
tools.
Machine Learning algorithms don’t work so well with processing raw data. Before
we can feed such data to an ML algorithm, we must preprocess it. We must apply
some transformations on it. With data preprocessing, we convert raw data into a
clean data set. To perform data this, there are 7 techniques –
1. Rescaling Data –
For data with attributes of varying scales, we can rescale attributes to possess
the same scale. We rescale attributes into the range 0 to 1 and call it
normalization. We use the MinMaxScaler class from scikit- learn. This gives us
values between 0 and 1.
2. Standardizing Data –
Standardization refers to shifting the distribution of each attribute to have
a mean of zero and a standard deviation of one (unit variable). It is useful
to standardization attributes for a model that relies on the distribution of
attributes such as Gaussian processes.
3. Normalizing Data –
This is used to rescale each row of data to have a length of 1. It is mainly
useful in Sparse dataset where we have lots of zeros. We can rescale the
data with the help of Normalizer class of scikit-learn Python library .
4. Binarizing Data –
This is the technique with the help of which we can make our data binary.
We can use a binary threshold for making our data binary. The values
above that threshold value will be converted to 1 and below that threshold
will be converted to 0. For example, if we choose threshold value = 0.5,
then the dataset value above it will become 1 and below this will become 0.
That is why we can call it binarizing the data or thresholding the data. This
technique is useful when we have probabilities in our dataset and want to
convert them into crisp values.
We can binarize the data with the help of Binarizer class of scikit-learn
Python library.
5. Mean Removal-
We can remove the mean from each feature to center it on zero.
7. Label Encoding -–
Some labels can be words or numbers. Usually, training data is labelled with
words to make it readable. Label encoding converts word labels into numbers
to let algorithms work on them.
Machine Learning Algorithms
There are many types of Machine Learning Algorithms specific to different use
cases. As we work with datasets, a machine learning algorithm works in two
stages. We usually split the data around 20%-80% between testing and training
stages. Under supervised learning, we split a dataset into a training data and test
data in Python ML. Followings are the Algorithms of Python Machine Learning –
1. Linear Regression-
Linear regression may be defined as the statistical model that analyzes the
linear relationship between a dependent variable with given set of
independent variables. Linear relationship between variables means that
when the value of one or more independent variables will change (increase
or decrease), the value of dependent variable will also change accordingly
(increase or decrease).
𝑌 = 𝑚𝑋 + b
following equation:
𝑚 is the slop of the regression line which represents the effect 𝑋 has on 𝑌
variable we are using to make predictions.
Chapter 4
Project Report
Objective-
Classification Model to predict Students marks
Dataset description-
Dataset Source –
https://www.kaggle.com/datasets/yasserh/student-marks-dataset