We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19
UNIT-5
Syllabus: Importance of Machine learning in Data Science
Machine Learning: Introduction to machine learning, applications for machine learning in data science, python tools used in machine learning, the modeling process, types of machine leaning.
Introduction to machine learning:
What is Machine Learning: In the real world, we are surrounded by humans who can learn everything from their experiences with their learning capability, and we have computers or machines which work on our instructions? But can a machine also learn from experiences or past data like a human does? So here comes the role of Machine Learning.
Features of Machine Learning: Machine learning uses data to detect various
patterns in a given dataset. It can learn from past data and improve automatically.It is a data-driven technology.Machine learning is much similar to data mining as it also deals with the huge amount of the data. Need for Machine Learning: The demand for machine learning is steadily rising. Because it is able to perform tasks that are too complex for a person to directly implement, machine learning is required.
N.KOTESWARA RAO, NEC, GUDUR. Page 1
Humans are constrained by our inability to manually access vast amounts of data; as a result, we require computer systems, which is where machine learning comes in to simplify our lives. By providing them with a large amount of data and allowing them to automatically explore the data, build models, and predict the required output, we can train machine learning algorithms. We can save both time and money by using machine learning. The significance of AI can be handily perceived by its utilization's cases, Presently, AI is utilized in self-driving vehicles, digital misrepresentation identification, face acknowledgment, and companion idea by Facebook, and so on. Different top organizations, for example, Netflix and Amazon have constructed AI models that are utilizing an immense measure of information to examine the client interest and suggest item likewise. Following are some key points which show the importance of Machine Learning: 1. Rapid increment in the production of data 2. Solving complex problems, which are difficult for a human 3. Decision making in various sector including finance 4. Finding hidden patterns and extracting useful information from data. Applications for machine learning in data science: Machine Learning is a buzzword in the technology world right now and for good reason, it represents a major step forward in how computers can learn. The following are the Applications of Machine Learning: 1. Traffic Alerts 2. Social Media 3. Transportation and Commuting 4. Products Recommendations 5. Virtual Personal Assistants 6. Self Driving Cars 7. Dynamic Pricing 8. Google Translate
N.KOTESWARA RAO, NEC, GUDUR. Page 2
9. Online Video Streaming 10. Fraud Detection 1. Traffic Alerts (Maps): Now, Google Maps is probably THE app we use whenever we go out and require assistance in directions and traffic. The other day I was traveling to another city and took the expressway and Maps suggested: “Despite the Heavy Traffic, you are on the fastest route“. But, How does it know that? Well, It’s a combination of People currently using the service, Historic Data of that route collected over time and few tricks acquired from other companies. Everyone using maps is providing their location, average speed, the route in which they are traveling which in turn helps Google collect massive Data about the traffic, which makes them predict the upcoming traffic and adjust your route according to it. 2. Social Media (Facebook): One of the most common applications of Machine Learning is Automatic Friend Tagging Suggestions in Facebook or any other social media platform. Facebook uses face detection and Image recognition to automatically find the face of the person which matches it’s Database and hence suggests us to tag that person based on DeepFace. Facebook’s Deep Learning project DeepFace is responsible for the recognition of faces and identifying which person is in the picture. It also provides Alt Tags (Alternative Tags) to images already uploaded on facebook. For eg., if we inspect the following image on Facebook, the alt-tag has a description. 3. Transportation and Commuting (Uber): If you have used an app to book a cab, you are already using Machine Learning to an extent. It provides a personalized application which is unique to you. Automatically detects your location and provides options to either go home or office or any other frequent place based on your History and Patterns. It uses Machine Learning algorithm layered on top of Historic Trip Data to make a more accurate ETA prediction. With the implementation of Machine Learning, they saw a 26% accuracy in Delivery and Pickup.
N.KOTESWARA RAO, NEC, GUDUR. Page 3
4.Products Recommendations: Suppose you check an item on Amazon, but you do not buy it then and there. But the next day, you’re watching videos on YouTube and suddenly you see an ad for the same item. You switch to Facebook, there also you see the same ad. So how does this happen? Well, this happens because Google tracks your search history, and recommends ads based on your search history. This is one of the coolest applications of Machine Learning. In fact, 35% of Amazon’s revenue is generated by Product Recommendations. 5. Virtual Personal Assistants: As the name suggests, Virtual Personal Assistants assist in finding useful information, when asked via text or voice. Few of the major applications of Machine Learning here are: 1. Speech Recognition 2. Speech to Text Conversion 3. Natural Language Processing 4. Text to Speech Conversion All you need to do is ask a simple question like “What is my schedule for tomorrow?” or “Show my upcoming Flights“. For answering, your personal assistant searches for information or recalls your related queries to collect info. Recently personal assistants are being used in Chatbots which are being implemented in various food ordering apps, online training websites and also in Commuting apps. 6. Self Driving Cars: This is one of the coolest application of Machine Learning. It’s here and people are already using it. Machine Learning plays a very important role in Self Driving Cars and I’m sure you guys might have heard about Tesla. The leader in this business and their current Artificial Intelligence is driven by hardware manufacturer NVIDIA, which is based on Unsupervised Learning Algorithm. NVIDIA stated that they didn’t train their model to detect people or any object as such. The model works on Deep Learning and it crowd sources data from all of its vehicles and its drivers. It uses internal and external sensors which are a
N.KOTESWARA RAO, NEC, GUDUR. Page 4
part of IOT. According to the data gathered by McKinsey, the automotive data will hold a tremendous value of $750 Billion. 7. Dynamic Pricing: Setting the right price for a good or service is an old problem in economic theory. There are a vast amount of pricing strategies that depend on the objective sought. Be it a movie ticket, a plane ticket or cab fares, everything is dynamically priced. In recent years, artificial intelligence has enabled pricing solutions to track buying trends and determine more competitive product prices. How does Uber determine the price of your ride? Uber’s biggest uses of Machine Learning comes in the form of surge pricing, a machine learning model nicknamed as “Geosurge”. If you are getting late for a meeting and you need to book an Uber in a crowded area, get ready to pay twice the normal fare. Even for flights, if you are traveling in the festive season the chances are prices will be twice the original price. 8.Google Translate:Remember the time when you traveled to a new place and you find it difficult to communicate with the locals or finding local spots where everything is written in a different language. Well, those days are gone now. Google’s GNMT(Google Neural Machine Translation) is a Neural Machine Learning that works on thousands of languages and dictionaries, uses Natural Language Processing to provide the most accurate translation of any sentence or words. Since the tone of the words also matters, it uses other techniques like POS Tagging, NER (Named Entity Recognition) and Chunking. It is one of the best and most used Applications of Machine Learning. 9. Online Video Streaming (Netflix): With over 100 million subscribers, there is no doubt that Netflix is the daddy of the online streaming world. Netflix’s speedy rise has all movie industrialists taken aback – forcing them to ask, “How on earth could one single website take on Hollywood?”. The answer is Machine Learning. The Netflix algorithm constantly gathers massive amounts of data about users’ activities like:When you pause, rewind, or fast forward What day you watch
N.KOTESWARA RAO, NEC, GUDUR. Page 5
content (TV Shows on Weekdays and Movies on Weekends) The Date and Time you watch When you pause and leave content (and if you ever come back) The ratings Given (about 4 million per day), Searches (about 3 million per day)Browsing and Scrolling Behavior and a lot more. They collect this data for each subscriber they have and use their Recommender System and a lot of Machine Learning Applications. That’s why they have such a huge customer retention rate. 10.Fraud Detection: Experts predict online credit card fraud to soar to a whopping $32 billion in 2020. That’s more than the profit made by Coca Cola and JP Morgan Chase combined. That’s something to worry about. Fraud Detection is one of the most necessary Applications of Machine Learning. The number of transactions has increased due to a plethora of payment channels – credit/debit cards, smart phones, numerous wallets, UPI and much more. At the same time, the amount of criminals has become adept at finding loopholes. Whenever a customer carries out a transaction – the Machine Learning model thoroughly x-rays their profile searching for suspicious patterns. In Machine Learning, problems like fraud detection are usually framed as classification problems. Python tools used in machine learning: Python has an overwhelming number of packages that can be used in a machine learning setting. The Python machine learning ecosystem can be divided into three main types of packages, as shown in below figure. The first type of package shown in below figure is mainly used in simple tasks and when data fits into memory. The second type is used to optimize your code when you’ve finished prototyping and run into speed or memory issues. The third type is specific to using Python with big data technologies.
N.KOTESWARA RAO, NEC, GUDUR. Page 6
Figure. Overview of Python packages used during the machine-learning phase Packages for working with data in memory: When prototyping, the following packages can get you started by providing advanced functionalities with a few lines of code: SciPy is a library that integrates fundamental packages often used in scientific computing such as NumPy, matplotlib, Pandas, and SymPy. NumPy gives you access to powerful array functions and linear algebra functions. Matplotlib is a popular 2D plotting package with some 3D functionality.
N.KOTESWARA RAO, NEC, GUDUR. Page 7
Pandas is a high-performance, but easy-to-use, data-wrangling package. It introduces dataframes to Python, a type of in-memory data table. It’s a concept that should sound familiar to regular users of R. SymPy is a package used for symbolic mathematics and computer algebra. StatsModels is a package for statistical methods and algorithms. Scikit-learn is a library filled with machine learning algorithms. RPy2 allows you to call R functions from within Python. R is a popular open source statistics program. NLTK (Natural Language Toolkit) is a Python toolkit with a focus on text analytics. These libraries are good to get started with, but once you make the decision to run a certain Python program at frequent intervals, performance comes into play. Optimizing operations: Once your application moves into production, the libraries listed here can help you deliver the speed you need. Sometimes this involves connecting to big data infrastructures such as Hadoop and Spark. Numba and NumbaPro —These use just-in-time compilation to speed up applications written directly in Python and a few annotations. NumbaPro also allows you to use the power of your graphics processor unit (GPU). PyCUDA —This allows you to write code that will be executed on the GPU instead of your CPU and is therefore ideal for calculation-heavy applications. It works best with problems that lend themselves to being parallelized and need little input compared to the number of required computing cycles. An example is studying the robustness of your predictions by calculating thousands of different outcomes based on a single start state. Cython, or C for Python —This brings the C programming language to Python. C is a lower-level language, so the code is closer to what the computer eventually uses (bytecode). The closer code is to bits and bytes, the faster it executes. A computer is also faster when it knows the type of a variable (called static typing).
N.KOTESWARA RAO, NEC, GUDUR. Page 8
Python wasn’t designed to do this, and Cython helps you to overcome this shortfall. Blaze —Blaze gives you data structures that can be bigger than your computer’s main memory, enabling you to work with large data sets. Dispy and IPCluster —These packages allow you to write code that can be distributed over a cluster of computers. PP —Python is executed as a single process by default. With the help of PP you can parallelize computations on a single machine or over clusters. Pydoop and Hadoopy —These connect Python to Hadoop, a common big data framework. PySpark —This connects Python and Spark, an in-memory big data framework.
The modeling process:
The modeling phase consists of four steps: 1. Feature engineering and model selection 2. Training the model 3. Model validation and selection 4. Applying the trained model to unseen data Before you find a good model, you will probably iterate among the first three steps. The last step isn’t always present because sometimes the goal isn’t prediction but explanation (root cause analysis). For instance, you might want to find out the causes of species’ extinctions but not necessarily predict which one is next in line to leave our planet. It’s possible to chain or combine multiple techniques. When you chain multiple models, the output of the first model becomes an input for the second model. When you combine multiple models, you train them independently and combine their results. This last technique is also known as ensemble learning. A model consists of constructs of information called features or predictors and a target or response variable. The best models are those that accurately represent reality, preferably while staying concise and interpretable.
N.KOTESWARA RAO, NEC, GUDUR. Page 9
To achieve this, feature engineering is the most important and arguably most interesting part of modeling. 1. Engineering features and selecting a model: With engineering features, you must come up with and create possible predictors for the model. This is one of the most important steps in the process because a model recombines these features to achieve its predictions. Often you may need to consult an expert or the appropriate literature to come up with meaningful features.When the initial features are created, a model can be trained to the data. 2. Training your model With the right predictors in place and a modeling technique in mind, you can progress to model training. In this phase you present to your model data from which it can learn. The most common modeling techniques have industry-ready implementations in almost every programming language, including Python. These enable you to train your models by executing a few lines of code. For more state-of-the art data science techniques, you’ll probably end up doing heavy mathematical calculations and implementing them with modern computer science techniques. Once a model is trained, it’s time to test whether it can be extrapolated to reality: model validation. 3. Validating a model Data science has many modeling techniques, and the question is which one is the right one to use. A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t seen. To achieve this you define an error measure (how wrong the model is) and a validation strategy. Two common error measures in machine learning are the classification error rate for classification problems and the mean squared error for regression problems. The classification error rate is the percentage of observations in the test data set that your model mislabeled; lower is better. The mean squared error measures
N.KOTESWARA RAO, NEC, GUDUR. Page 10
how big the average error of your prediction is. Squaring the average error has two consequences: you can’t cancel out a wrong prediction in one direction with a faulty prediction in the other direction. Many validation strategies exist, including the following common ones: Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)— This is the most common technique. K-folds cross validation —This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. This has the advantage that you use all the data available in the data set. Leave-1 out —This approach is the same as k-folds but with k=1. You always leave one observation out and train on the rest of the data. This is used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big data analysts. Validation is extremely important because it determines whether your model works in real-life conditions. Once you have constructed a good model, you can (optionally) use it to predict the future. .4. Predicting new observations: If you have implemented the first three steps successfully, you now have a performant model that generalizes to unseen data. The process of applying your model to new data is called model scoring. In fact, model scoring is something you implicitly did during validation, only now you don’t know the correct outcome. Model scoring involves two steps. First, you prepare a data set that has features exactly as defined by your model. This boils down to repeating the data preparation you did in step one of the modeling process but for a new data set. Then you apply the model on this new data set, and this result in a prediction.
N.KOTESWARA RAO, NEC, GUDUR. Page 11
Types of machine leaning: Machine learning is a subset of AI, which enables the machine to automatically learn from data, improve performance from past experiences, and make predictions. Machine learning contains a set of algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and on the basis of training, they build the model & perform a specific task. These ML algorithms help to solve different business problems like Regression, Classification, Forecasting, Clustering, and Associations, etc. Based on the methods and way of learning, machine learning is divided into mainly four types, which are: 1. Supervised Machine Learning 2. Unsupervised Machine Learning 3. Semi-Supervised Machine Learning 4. Reinforcement Learning
N.KOTESWARA RAO, NEC, GUDUR. Page 12
1. Supervised Machine Learning: As its name suggests, supervised machine learning is based on supervision. It means in the supervised learning technique, we train the machines using the "labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to the output. More preciously, we can say; first, we train the machine with the input and corresponding output, and then we ask the machine to predict the output using the test dataset. Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog images. So, first, we will provide the training to the machine to understand the images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of training, we input the picture of a cat and ask the machine to identify the object and predict the output. Now, the machine is well trained, so it will check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of how the machine identifies the objects in Supervised Learning. The main goal of the supervised learning technique is to map the input variable(x) with the output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam filtering, etc. Categories of Supervised Machine Learning: Supervised machine learning can be classified into two types of problems, which are given below: 1.Classification 2.Regression a) Classification: Classification algorithms are used to solve the classification problems in which the output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.The classification algorithms predict the categories present in the dataset. Some real-world examples of classification algorithms are Spam Detection, Email filtering, etc.
N.KOTESWARA RAO, NEC, GUDUR. Page 13
Some popular classification algorithms are given below: 1. Random Forest Algorithm 2. Decision Tree Algorithm 3. Logistic Regression Algorithm 4. Support Vector Machine Algorithm b) Regression Regression algorithms are used to solve regression problems in which there is a linear relationship between input and output variables. These are used to predict continuous output variables, such as market trends, weather prediction, etc. Some popular Regression algorithms are given below: 1. Simple Linear Regression Algorithm 2. Multivariate Regression Algorithm 3. Decision Tree Algorithm 4. Lasso Regression Advantages and Disadvantages of Supervised Learning: Advantages: 1. Since supervised learning work with the labelled dataset so we can have an exact idea about the classes of objects. 2. These algorithms are helpful in predicting the output on the basis of prior experience. Disadvantages: 1. These algorithms are not able to solve complex tasks. 2. It may predict the wrong output if the test data is different from the training data. 3. It requires lots of computational time to train the algorithm. 2. Unsupervised Machine Learning: Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and the machine predicts the output without any supervision.
N.KOTESWARA RAO, NEC, GUDUR. Page 14
In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the model acts on that data without any supervision. The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the similarities, patterns, and differences. Machines are instructed to find the hidden patterns from the input dataset. Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we input it into the machine learning model. The images are totally unknown to the model, and the task of the machine is to find the patterns and categories of the objects. So, now the machine will discover its patterns and differences, such as colour difference, shape difference, and predict the output when it is tested with the test dataset. Categories of Unsupervised Machine Learning: Unsupervised Learning can be further classified into two types, which are given below: 1. Clustering 2.Association 1) Clustering: The clustering technique is used when we want to find the inherent groups from the data. It is a way to group the objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no similarities with the objects of other groups. An example of the clustering algorithm is grouping the customers by their purchasing behaviour. Some of the popular clustering algorithms are given below: 1. K-Means Clustering algorithm 2. Mean-shift algorithm 3. DBSCAN Algorithm 4. Principal Component Analysis 5. Independent Component Analysis
N.KOTESWARA RAO, NEC, GUDUR. Page 15
2) Association Association rule learning is an unsupervised learning technique, which finds interesting relations among variables within a large dataset. The main aim of this learning algorithm is to find the dependency of one data item on another data item and map those variables accordingly so that it can generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production, etc. Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm. Advantages and Disadvantages of Unsupervised Learning Algorithm Advantages: 1. These algorithms can be used for complicated tasks compared to the supervised ones because these algorithms work on the unlabeled dataset. 2. Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as compared to the labelled dataset. Disadvantages: 1. The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and algorithms are not trained with the exact output in prior. 2. Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that does not map with the output. 3. Semi-Supervised Learning: Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of labelled and unlabeled datasets during the training period. Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates on the data that consists of a few
N.KOTESWARA RAO, NEC, GUDUR. Page 16
labels, it mostly consists of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels. It is completely different from supervised and unsupervised learning as they are based on the presence & absence of labels. To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively use all the available data, rather than only labelled data like in supervised learning. Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively more expensive acquisition than unlabeled data. We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision of an instructor at home and college. Further, if that student is self-analysing the same concept without any help from the instructor, it comes under unsupervised learning. Under semi- supervised learning, the student has to revise himself after analyzing the same concept under the guidance of an instructor at college. Advantages and disadvantages of Semi-supervised Learning Advantages: 1. It is simple and easy to understand the algorithm. 2. It is highly efficient. 3. It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms. Disadvantages: 1. Iterations results may not be stable. 2. We cannot apply these algorithms to network-level data. 3. Accuracy is low.
N.KOTESWARA RAO, NEC, GUDUR. Page 17
4. Reinforcement Learning: Reinforcement learning works on a feedback-based process, in which an AI agent (A software component) automatically explore its surrounding by hitting & trail, taking action, learning from experiences, and improving its performance. Agent gets rewarded for each good action and get punished for each bad action; hence the goal of reinforcement learning agent is to maximize the rewards. In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their experiences only. The reinforcement learning process is similar to a human being; for example, a child learns various things by experiences in his day-to-day life. An example of reinforcement learning is to play a game, where the Game is the environment, moves of an agent at each step define states, and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment and rewards. Due to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation Research, Information theory, multi- agent systems. A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment responds and generates a new state. Categories of Reinforcement Learning Reinforcement learning is categorized mainly into two types of methods/algorithms: Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency that the required behaviour would occur again by adding something. It enhances the strength of the behaviour of the agent and positively impacts it. Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the negative condition.
N.KOTESWARA RAO, NEC, GUDUR. Page 18
Real-world Use cases of Reinforcement Learning 1. VideoGames: RL algorithms are much popular in gaming applications. It is used to gain super-human performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero. 2. Resource Management:The "Resource Management with Deep Reinforcement Learning" paper showed that how to use RL in computer to automatically learn and schedule resources to wait for different jobs in order to minimize average job slowdown. 3. Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and manufacturing area, and these robots are made more powerful with reinforcement learning. There are different industries that have their vision of building intelligent robots using AI and Machine learning technology. 4. Text Mining: Text-mining, one of the great applications of NLP, is now being implemented with the help of Reinforcement Learning by Salesforce company. Advantages and Disadvantages of Reinforcement Learning Advantages 5. It helps in solving complex real-world problems which are difficult to be solved by general techniques. 6. The learning model of RL is similar to the learning of human beings; hence most accurate results can be found. 7. Helps in achieving long term results. Disadvantage 1. RL algorithms are not preferred for simple problems. 2. RL algorithms require huge data and computations. 3. Too much reinforcement learning can lead to an overload of states which can weaken the results.