Unit 1 Notes
Unit 1 Notes
[Introduction to Machine Learning: What is Machine Learning? Why Use Machine Learning?
, Types of Machine Learning Systems, Main Challenges of Machine Learning, Applications of
Machine Learning. Why Python, scikit-learn, Essential Libraries and Tools.]
Model-based learning:
Another way to generalize from a set of examples is to build a model of these examples and then
use that model to make predictions. This is called model-based learning(Figure 1-16).
MAIN CHALLENGES OF MACHINE LEARNING:
Poor Quality of Data
Data plays a significant role in the machine learning process. One of the significant issues that
machine learning professionals face is the absence of good quality data. Unclean and noisy data
can make the whole process extremely exhausting. We don’t want our algorithm to make
inaccurate or faulty predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which includes removing
outliers, filtering missing values, and removing unwanted features, is done with the utmost level
of perfection.
Underfitting of Training Data
This process occurs when data is unable to establish an accurate relationship between input and
output variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple
to establish a precise relationship. To overcome this issue:
Overfitting refers to a machine learning model trained with a massive amount of data that
negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is
one of the significant issues faced by machine learning professionals. This means that the
algorithm is trained with noisy and biased data, which will affect its overall performance. Let’s
understand this with the help of an example. Let’s consider a model trained to differentiate between
a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and
4000 Rabbits. Then there is a considerable probability that it will identify the cat as a rabbit. In
this example, we had a vast amount of data, but it was biased; hence the prediction was negatively
affected.
The machine learning industry is young and is continuously changing. Rapid hit and trial
experiments are being carried on. The process is transforming, and hence there are high chances
of error which makes the learning complex. It includes analyzing the data, removing data bias,
training data, applying complex mathematical calculations, and a lot more. Hence it is a really
complicated process which is another big challenge for Machine learning professionals.
The most important task you need to do in the machine learning process is to train the data to
achieve an accurate output. Less amount training data will produce inaccurate or too biased
predictions. Let us understand this with the help of an example. Consider a machine learning
algorithm similar to training a child. One day you decided to explain to a child how to distinguish
between an apple and a watermelon. You will take an apple and a watermelon and show him the
difference between both based on their color, shape, and taste. In this way, soon, he will attain
perfection in differentiating between the two. But on the other hand, a machine-learning algorithm
needs a lot of data to distinguish. For complex problems, it may even require millions of data to
be trained. Therefore, we need to ensure that Machine learning algorithms are trained with
sufficient amounts of data.
Slow Implementation
This is one of the common issues faced by machine learning professionals. The machine learning
models are highly efficient in providing accurate results, but it takes a tremendous amount of time.
Slow programs, data overload, and excessive requirements usually take a lot of time to provide
accurate results. Further, it requires constant monitoring and maintenance to deliver the best
output.
So you have found quality data, trained it amazingly, and the predictions are really concise and
accurate. Yay, you have learned how to create a machine learning algorithm!! But wait, there is a
twist; the model may become useless in the future as data grows. The best model of the present
may become inaccurate in the coming Future and require further rearrangement. So you need
regular monitoring and maintenance to keep the algorithm working. This is one of the most
exhausting issues faced by machine learning professionals.
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from
the user and sends back to its database to improve the performance.
Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as
Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.
Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is
working on self-driving car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used
by Gmail:
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part. These assistant
record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms
and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is
a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern which
gets change for the fraud transaction hence, it detects it and makes our online transactions more
secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk
of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.
Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this, medical technology
is growing very fast and able to build 3D models that can predict the exact position of lesions in
the brain. It helps in finding brain tumors and other brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at
all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as automatic
translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
WHY PYTHON:
Python combines the power of general-purpose programming languages with the ease of use of
domain-specific scripting languages like MATLAB or R. Python has libraries for data loading,
visualization, statistics, natural language processing, image processing, and more. This vast
toolbox provides data scientists with a large array of general- and special-purpose functionality.
One of the main advantages of using Python is the ability to interact directly with the code, using
a terminal or other tools like the Jupyter Notebook, etc.
As a general-purpose programming language, Python also allows for the creation of complex
graphical user interfaces (GUIs) and web services, and for integration into existing systems.
SCIKIT-LEARN:
scikit-learn is an open source project, meaning that it is free to use and distribute, and anyone can
easily obtain the source code to see what is going on behind the scenes. The scikit-learn project is
constantly being developed and improved, and it has a very active user community. It contains a
number of state-of-the-art machine learning algorithms, as well as comprehensive documentation
about each algorithm.
Installing scikit-learn
scikit-learn depends on two other Python packages, NumPy and SciPy. For plotting and
interactive development, you should also install matplotlib, IPython, and the Jupyter Notebook.
Anaconda
A Python distribution made for large-scale data processing, predictive analytics, and scientific
computing. Anaconda comes with NumPy, SciPy, matplotlib, pandas, IPython, Jupyter Notebook,
and scikit-learn. Available on Mac OS, Windows, and Linux, it is a very convenient solution and
is the one we suggest for people without an existing installation of the scientific Python packages.
Anaconda now also includes the commercial Intel MKL library for free. Using MKL (which is
done automatically when Anaconda is installed) can give significant speed improvements for many
algorithms in scikit-learn.
Enthought Canopy
Another Python distribution for scientific computing. This comes with NumPy, SciPy, matplotlib,
pandas, and IPython, but the free version does not come with scikit-learn. If you are part of an
academic, degree-granting institution, you can request an academic license and get free access to
the paid subscription version of Enthought Canopy. Enthought Canopy is available for Python
2.7.x, and works on Mac OS, Windows, and Linux.
Python(x,y)
A free Python distribution for scientific computing, specifically for Windows. Python(x,y) comes
with NumPy, SciPy, matplotlib, pandas, IPython, and scikit-learn.
If you already have a Python installation set up, you can use pip to install all of these packages:
$ pip install numpy scipy matplotlib ipython scikit-learn pandas
Jupyter Notebook
The Jupyter Notebook is an interactive environment for running code in the browser. It is a great
tool for exploratory data analysis and is widely used by data scientists. While the Jupyter Notebook
supports many programming languages, we only need the Python support.
NumPy
NumPy is one of the fundamental packages for scientific computing in Python. It contains
functionality for multidimensional arrays, high-level mathematical functions such as linear algebra
operations and the Fourier transform, and pseudorandom number generators.
In scikit-learn, the NumPy array is the fundamental data structure. scikit-learn takes in data in the
form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. The
core functionality of NumPy is the ndarray class, a multidimensional (n-dimensional) array. All
elements of the array must be of the same type. A NumPy array looks like this:
import numpy as np
print("x:\n{}".format(x))
Output:
x:
[[1 2 3]
[4 5 6]]
SciPy
SciPy is a collection of functions for scientific computing in Python. It provides, among other
functionality, advanced linear algebra routines, mathematical function optimization, signal
processing, special mathematical functions, and statistical distributions. scikit-learn draws from
SciPy’s collection of functions for implementing its algorithms. The most important part of SciPy
for us is scipy.sparse: this provides sparse matrices, which are another representation that is used
for data in scikit-learn. Sparse matrices are used whenever we want to store a 2D array that contains
mostly zeros:
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))
Output:
NumPy array:
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
matplotlib
matplotlib is the primary scientific plotting library in Python. It provides functions for making
publication-quality visualizations such as line charts, histograms, scatter plots, and so on.
Visualizing your data and different aspects of your analysis can give you important insights, and
we will be using matplotlib for all our visualizations. When working inside the Jupyter Notebook,
you can show figures directly in the browser by using the %matplotlib notebook and %matplotlib
inline commands.
%matplotlib inline
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")
pandas
pandas is a Python library for data wrangling and analysis. It is built around a data structure called
the DataFrame that is modeled after the R DataFrame. Simply put, a pandas DataFrame is a table,
similar to an Excel spreadsheet. pandas provides a great range of methods to modify and operate
on this table; in particular, it allows SQL-like queries and joins of tables. In contrast to NumPy,
which requires that all entries in an array be of the same type, pandas allows each column to have
a separate type (for example, integers, dates, floating-point numbers, and strings). Another
valuable tool provided by pandas is its ability to ingest from a great variety of file formats and
databases, like SQL, Excel files, and comma-separated values (CSV) files.
import pandas as pd
data_pandas = pd.DataFrame(data)
display(data_pandas)
Mglearn
mglearn is a library of utility functions. If you see a call to mglearn in the code, it is usually a way
to make a pretty picture quickly, or to get our hands on some interesting data.
IMPORTANT QUESTIONS:
1. Why is Python preferred for Machine Learning and what are the essential libraries and tools used in
Python for Machine Learning?
2. What are the different types of Machine Learning systems and their key characteristics?
3. What are some popular applications of Machine Learning and how do they work?
4. What are the main challenges of machine learning?
5. Explain essential libraries and tools in python.