0% found this document useful (0 votes)
50 views

Notes- Introduction to AI,ML,DS

Uploaded by

kuchbhi323232
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Notes- Introduction to AI,ML,DS

Uploaded by

kuchbhi323232
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

‭ CA520A‬

M
‭Introduction to Artificial Intelligence‬
‭Machine Learning and Data Science‬

‭ nit-1‬
U
‭Introduction to Artificial Intelligence‬

‭1. Definition and Scope of Artificial Intelligence‬

‭ rtificial Intelligence (AI) refers to the simulation of human intelligence in machines that‬
A
‭are programmed to think and learn like humans. It involves the study and development‬
‭of intelligent agents capable of perceiving their environment and taking actions to‬
‭maximize their chances of success.‬
‭The scope of AI encompasses various cognitive functions such as understanding‬
‭natural language, reasoning, problem-solving, learning from experience, and adapting‬
‭to new situations.‬

‭ ome daily life applications of AI we are using are chatbots, google assistant, facial‬
S
‭recognition in mobile phones, Social media applications, spam mail detection etc.‬

‭2. Historical Background and Milestones in AI Development‬

‭ rtificial Intelligence (AI) stands at the forefront of technological advancements today,‬


A
‭but its roots trace back through a fascinating history marked by significant milestones‬
‭and breakthroughs. From early conceptualizations to modern applications, AI has‬
‭evolved into a transformative force shaping various aspects of society.‬
‭ 950s: British mathematician Alan Turing proposed a test to determine a machine's‬
1
‭ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a‬
‭human. This seminal idea laid foundational principles for AI research.‬

‭ 960s-1970s: Symbolic AI emerged with systems capable of manipulating symbols and‬


1
‭using logical reasoning (e.g., expert systems).‬

‭ 980s-1990s: Knowledge-based systems gained popularity, focusing on encoding‬


1
‭expert knowledge into systems. Machine learning approaches began to gain traction.‬

‭ 000s-2010s: The rise of big data fueled advancements in machine learning, especially‬
2
‭with neural networks and deep learning, achieving breakthroughs in tasks like image‬
‭and speech recognition.‬

‭3. Various Branches of AI‬

‭ ymbolic AI:‬‭Involves the use of algorithms to manipulate‬‭symbols based on predefined‬


S
‭rules. Expert systems, which emulate the decision-making ability of a human expert, are‬
‭a notable application.‬
‭An expert system is a computer program that is designed to solve complex problems‬
‭and to provide decision-making ability like a human expert. It performs this by extracting‬
‭knowledge from its knowledge base using the reasoning and inference rules according‬
‭to the user queries.‬
‭The system helps in decision making for complex problems using‬‭both facts and‬
‭heuristics like a human expert‬‭. It is called so because‬‭it contains the expert knowledge‬
‭of a specific domain and can solve any complex problem of that particular domain.‬
‭These systems are designed for a specific domain, such as‬‭medicine, science,‬‭etc.‬

‭ tatistical AI:‬‭Focuses on developing algorithms that‬‭can learn from and make‬


S
‭predictions or decisions based on data. Machine learning techniques such as‬
s‭ upervised learning, unsupervised learning, and reinforcement learning fall under this‬
‭category.‬
‭Statistical AI models include linear regression (for trend prediction), logistic regression‬
‭(binary classification), decision trees (hierarchical decision-making), SVMs‬
‭(high-dimensional classification), naive Bayes (text classification), KNN (similarity‬
‭learning), and neural networks (complex tasks like image/speech recognition). To use‬
‭them, understand the problem and data, select a suitable model, prepare and split data‬
‭into training/testing sets, train the model, evaluate on the test set, and optimize as‬
‭needed.‬

‭ ther Branches:‬‭Natural language processing (NLP)‬‭enables computers to understand‬


O
‭and generate human language, while computer vision allows machines to interpret and‬
‭understand visual information.‬

‭4. Applications of AI in Different Fields‬

‭1.Healthcare:‬
‭Medical Imaging and Diagnostics: AI aids in interpreting medical images like X-rays‬
‭and MRIs, improving accuracy and speed of diagnosis.‬
‭Personalized Medicine: AI analyzes patient data to tailor treatment plans based on‬
‭individual genetic profiles and medical histories.‬
‭Virtual Health Assistants: AI-powered chatbots and virtual agents provide patient‬
‭support, appointment scheduling, and medical advice.‬
‭Predictive Analytics: AI predicts patient outcomes and identifies at-risk individuals,‬
‭aiding in early intervention and preventive care.‬
‭Administrative Efficiency: AI automates tasks such as medical coding, scheduling, and‬
‭billing, improving operational efficiency.‬
‭Drug Discovery and Development: AI accelerates drug discovery processes and‬
‭predicts molecular interactions for new treatments.‬

‭ . Finance‬
2
‭Algorithmic Trading: AI analyzes large datasets and market trends to execute trades‬
‭autonomously and optimize investment strategies.‬
‭ raud Detection: AI algorithms identify unusual patterns in transactions to detect and‬
F
‭prevent fraudulent activities in real-time.‬
‭Credit Scoring and Risk Assessment: AI evaluates creditworthiness by analyzing‬
‭financial data and behavioral patterns, improving accuracy in risk assessment.‬
‭Customer Service and Chatbots: AI-powered chatbots provide personalized customer‬
‭support, assist with inquiries, and manage financial transactions.‬
‭Robo-Advisors: AI algorithms recommend investment portfolios based on individual risk‬
‭profiles and financial goals, providing automated wealth management solutions.‬
‭Sentiment Analysis: AI analyzes news, social media, and other textual data to gauge‬
‭market sentiment and predict market movements.‬

‭ .Gaming‬
3
‭AI techniques are employed to create realistic game environments, develop intelligent‬
‭non-player characters (NPCs), and enhance player experience through procedural‬
‭content generation and adaptive gameplay.‬

‭5. Ethical Considerations and social impact of AI‬

‭ thical Issues: AI systems can exhibit biases learned from training data, leading to‬
E
‭unfair treatment or decisions. Moreover, the automation of jobs raises concerns about‬
‭unemployment and the need for retraining the workforce.‬
‭Societal Impact: AI-driven automation has the potential to improve productivity and‬
‭create new job opportunities in emerging fields such as AI engineering and data‬
‭science. However, it also requires careful management to ensure that societal benefits‬
‭are equitably distributed and that ethical guidelines protect individual rights and privacy.‬
‭Unit 2‬
‭Fundamentals of Machine Learning‬

‭1. Introduction to Machine Learning (ML)‬

‭ achine Learning is a subfield of artificial intelligence (AI) that focuses on developing‬


M
‭algorithms and techniques that enable computers to learn from and make predictions or‬
‭decisions based on data. Unlike traditional programming where rules are explicitly‬
‭defined, ML algorithms learn patterns and relationships from data to improve their‬
‭performance over time.‬

‭Importance of Machine Learning:‬

‭●‬ M
‭ L enables computers to handle complex tasks that are difficult to program‬
‭explicitly.‬
‭●‬ I‭t powers various applications such as recommendation systems, image and‬
‭speech recognition, medical diagnostics, and autonomous driving.‬

‭2. Types of Machine Learning‬

‭ achine Learning can be broadly categorized into three main types based on the nature‬
M
‭of the learning process and the availability of labeled data:‬

‭●‬ ‭Supervised Learning:‬


‭○‬ ‭In supervised learning, the algorithm learns from labeled data, where each‬
‭example is paired with a target label.‬

‭ or example consider a scenario where you have to build an image‬


F
‭classifier to differentiate between cats and dogs. If you feed the datasets‬
‭of dogs and cats labeled images to the algorithm, the machine will learn to‬
‭classify between a dog or a cat from these labeled images. When we input‬
‭new dog or cat images that it has never seen before, it will use the learned‬
‭algorithms and predict whether it is a dog or a cat. This is how supervised‬
‭learning works, and this is particularly an image classification.‬
‭○‬ O
‭ ther examples: Predicting house prices based on features like square‬
‭footage, number of bedrooms, Classifying the images into categories, etc.‬

‭ here are two main categories of supervised learning that are mentioned‬
T
‭below:‬
‭1. Classification 2. Regression‬

‭1. Classification‬

‭ lassification deals with predicting categorical target variables, which‬


C
‭represent discrete classes or labels. For instance, classifying emails as‬
‭spam or not spam, or predicting whether a patient has a high risk of heart‬
‭disease. Classification algorithms learn to map the input features to one of‬
‭the predefined classes.‬

‭Some classification algorithms:‬

‭ aive Bayes‬
N
‭Decision Tree‬
‭Support Vector Machine‬
‭Random Forest‬
‭K-Nearest Neighbors (KNN)‬

‭2. Regression‬

‭ egression, on the other hand, deals with predicting continuous target‬


R
‭variables, which represent numerical values. For example, predicting the‬
‭price of a house based on its size, location, and amenities, or forecasting‬
‭the sales of a product. Regression algorithms learn to map the input‬
‭features to a continuous numerical value.‬

‭Some regression algorithms:‬

‭ inear Regression‬
L
‭Polynomial Regression‬
‭Ridge Regression‬
‭Lasso Regression‬
‭●‬ ‭Unsupervised Learning:‬
‭○‬ ‭Unsupervised learning involves learning patterns from unlabeled data.‬
‭○‬ ‭Example: Clustering similar documents together based on their content.‬

‭ xample: Consider that you have a dataset that contains information‬


E
‭about the purchases you made from the shop. Through clustering, the‬
‭algorithm can group the same purchasing behavior among you and other‬
‭customers, which reveals potential customers without predefined labels.‬
‭This type of information can help businesses get target customers as well‬
‭as identify outliers.‬

‭●‬ ‭Reinforcement Learning:‬


‭○‬ ‭Reinforcement machine learning algorithm is a learning method that‬
‭interacts with the environment by producing actions and discovering‬
‭errors.‬
‭○‬ ‭Trial, error, and delay are the most relevant characteristics of‬
‭reinforcement learning. In this technique, the model keeps on increasing‬
‭its performance using Reward Feedback to learn the behavior or pattern.‬
‭These algorithms are specific to a particular problem‬
‭○‬ ‭Examples are Google Self Driving car, AlphaGo where a bot competes‬
‭with humans and even itself to get better and better performers in Go‬
‭Game. Each time we feed in data, they learn and add the data to their‬
‭knowledge which is training data. So, the more it learns the better it gets‬
‭trained and hence experienced.‬
‭3. Basic Concepts in Machine Learning‬

‭●‬ ‭Features and Labels:‬


‭○‬ ‭Features‬‭(or predictors) are individual measurable properties or‬
‭characteristics of the phenomenon being observed.‬
‭○‬ ‭Labels‬‭(or targets) are the outcomes or predictions‬‭that the model aims to‬
‭predict or classify.‬
‭●‬ ‭Training Data:‬
‭○‬ ‭Training data is the dataset used to train the machine learning model. It‬
‭consists of input-output pairs (features-labels) used to teach the model‬
‭patterns and relationships.‬

‭4. Popular Machine Learning Algorithms‬

‭Here are some widely used machine learning algorithms across different types:‬

‭●‬ ‭Linear Regression:‬


‭○‬ ‭Used for predicting a continuous value based on a linear relationship‬
‭between input features and the target variable.‬
‭○‬ ‭Example: Predicting house prices based on square footage.‬

‭ hen there is only one independent feature, it is known as Simple Linear‬


W
‭Regression, and when there are more than one feature, it is known as‬
‭Multiple Linear Regression.‬

‭Linear Regression:‬

‭ his is the simplest form of linear regression, and it involves only one‬
T
‭independent variable and one dependent variable. The equation for simple‬
‭linear regression is:‬
‭y=β0+β1*X‬
‭Where,‬
‭Y is the dependent variable‬
‭X is the independent variable‬
‭ 0 is the intercept‬
β
‭β1 is the slope‬

‭Multiple Linear Regression‬

‭ his involves more than one independent variable and one dependent‬
T
‭variable. The equation for multiple linear regression is:‬
‭𝑦=𝛽0+𝛽1*𝑋+𝛽2*𝑋+………𝛽𝑛*𝑋‬
‭Where‬
‭Y is the dependent variable‬
‭X1, X2, …, Xp are the independent variables‬
‭β0 is the intercept‬
‭β1, β2, …, βn are the slopes‬

‭ est fit line:‬


B
‭The goal of the algorithm is to find the best Fit Line equation that can predict‬
‭the values based on the independent variables.‬
‭In regression set of records are present with X and Y values and these values‬
‭are used to learn a function so if you want to predict Y from an unknown X‬
‭this learned function can be used.‬
‭●‬ ‭Logistic Regression:‬

‭ sed for binary classification problems where the output is a probability value between‬
U
‭0 and 1.Example: Predicting whether an email is spam or not.‬
‭For example, we have two classes Class 0 and Class 1 if the value of the logistic‬
‭function for an input is greater than 0.5 (threshold value) then it belongs to Class 1‬
‭otherwise it belongs to Class 0. It’s referred to as regression because it is the‬
‭extension of linear regression but is mainly used for classification problems.‬

‭ ey points:‬
K
‭=>Logistic regression predicts the output of a categorical dependent variable.‬
‭Therefore, the outcome must be a categorical or discrete value.‬

‭=>It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact‬
‭value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.‬

‭=>In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic‬
‭function, which predicts two maximum values (0 or 1).‬
‭●‬ ‭Decision Trees:‬

‭ versatile algorithm that can perform both classification and regression‬


A
‭tasks by recursively splitting the data into subsets based on features.‬
‭A decision tree is a flowchart-like structure used to make decisions or‬
‭predictions. It consists of nodes representing decisions or tests on‬
‭attributes, branches representing the outcome of these decisions, and leaf‬
‭nodes representing final outcomes or predictions. Each internal node‬
‭corresponds to a test on an attribute, each branch corresponds to the‬
‭result of the test, and each leaf node corresponds to a class label or a‬
‭continuous value.‬

‭ xample: Predicting whether a customer will purchase a product based on‬


E
‭demographic data.‬

‭Structure of a Decision Tree:‬

‭ oot Node: Represents the entire dataset and the initial decision to be‬
R
‭made.‬
‭Internal Nodes: Represent decisions or tests on attributes. Each internal‬
‭node has one or more branches.‬
‭Branches: Represent the outcome of a decision or test, leading to another‬
‭node.‬
‭Leaf Nodes: Represent the final decision or prediction. No further splits‬
‭occur at these nodes.‬

‭How Decision Trees Work?‬

‭ he process of creating a decision tree involves:‬


T
‭Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or‬
‭information gain, the best attribute to split the data is selected.‬
‭Splitting the Dataset: The dataset is split into subsets based on the‬
‭selected attribute.‬
‭Repeating the Process: The process is repeated recursively for each‬
‭subset, creating a new internal node or leaf node until a stopping criterion‬
‭is met (e.g., all instances in a node belong to the same class or a‬
‭predefined depth is reached).‬

‭●‬ ‭k-Nearest Neighbors (k-NN):‬


‭○‬ ‭A simple algorithm that classifies new data points based on majority vote‬
‭of their neighbors.‬
‭○‬ ‭Example: Classifying a flower species based on its measurements by‬
‭comparing it to the measurements of its nearest neighbors.‬

I‭ntuition Behind KNN Algorithm:‬


‭If we plot these points on a graph, we may be able to locate some clusters‬
‭or groups. Now, given an unclassified point, we can assign it to a group by‬
‭observing what group its nearest neighbors belong to. This means a point‬
‭close to a cluster of points classified as ‘Red’ has a higher probability of‬
‭getting classified as ‘Red’.‬
‭Distance Metrics Used in KNN Algorithm‬

‭ s we know that the KNN algorithm helps us identify the nearest points or‬
A
‭the groups for a query point. But to determine the closest groups or the‬
‭nearest points for a query point we need some metric. For this purpose,‬
‭we use below distance metrics:Euclidean Distance, Manhattan Distance,‬
‭Minkowski Distance‬

‭Workings of KNN algorithm:‬

‭ tep 1‬‭: Selecting the optimal value of K‬


S
‭K represents the number of nearest neighbors that needs to be‬
c‭ onsidered while making prediction.‬
‭Step 2‬‭: Calculating distance‬
‭To measure the similarity between target and training data points,‬
‭Euclidean distance is used. Distance is calculated between each of the‬
‭data points in the dataset and target point.‬
‭Step 3‬‭: Finding Nearest Neighbors‬
‭The k data points with the smallest distances to the target point are the‬
‭nearest neighbors.‬
‭Step 4:‬‭Voting for Classification or Taking Average‬‭for Regression‬
‭In the classification problem, the class labels of are determined by‬
‭performing majority voting. The class with the most occurrences among‬
‭the neighbors becomes the predicted class for the target data point.‬
‭n the regression problem, the class label is calculated by taking average‬
‭of the target values of K nearest neighbors. The calculated average value‬
‭becomes the predicted output for the target data point.‬

‭5. Evaluation Metrics for Machine Learning Models‬

‭●‬ ‭Accuracy:‬‭Proportion of correctly predicted instances‬‭among the total instances.‬

‭●‬ P
‭ recision:‬‭There is another metric named Precision.‬‭Precision is a measure of a‬
‭model’s performance that tells you how many of the positive predictions made by‬
‭the model are actually correct. It is calculated as the number of true positive‬
‭predictions divided by the number of true positive and false positive predictions.‬

‭●‬ R
‭ ecall (Sensitivity):‬‭Proportion of true positive‬‭predictions among all actual‬
‭positive instances.‬
‭●‬ F
‭ 1-score:‬‭Harmonic mean of precision and recall, providing‬‭a balanced measure‬
‭between them.‬
‭F1 score = 2*(1/((1/precision)+(1/recall)))‬

‭Note: Lower recall and higher precision give you great accuracy but then it misses‬
‭a large number of instances. The more the F1 score better will be‬
‭Performance.‬
‭Note:‬
‭True Positives: It is the case where we predicted Yes and the real output was also‬
‭Yes.‬
‭True Negatives: It is the case where we predicted No and the real output was also‬
‭No.‬
‭False Positives: It is the case where we predicted Yes but it was actually No.‬
‭False Negatives: It is the case where we predicted No but it was actually Yes.‬
‭Unit-3‬
‭Machine Learning Techniques‬

‭1. Data Preprocessing Techniques‬

‭ . Handling Missing Data‬


a
‭Missing values are data points that are absent for a specific variable in a dataset. They‬
‭can be represented in various ways, such as blank cells, null values, or special symbols‬
‭like “NA” or “unknown.” These missing data points pose a significant challenge in data‬
‭analysis and can lead to inaccurate or biased results.‬

I‭mportance: Missing data is common in real-world datasets and can adversely affect‬
‭model performance if not handled properly.‬

‭Why Is Data Missing From the Dataset?‬

‭ ata can be missing for many reasons like technical issues, human errors, privacy‬
D
‭concerns, data processing issues, or the nature of the variable itself. Understanding the‬
‭cause of missing data helps choose appropriate handling strategies and ensure the‬
‭quality of your analysis.‬
‭It’s important to understand the reasons behind missing data:‬

‭●‬ I‭dentifying the type of missing data: Is it Missing Completely at Random (MCAR),‬
‭Missing at Random (MAR), or Missing Not at Random (MNAR)?‬
‭●‬ ‭Evaluating the impact of missing data: Is the missingness causing bias or‬
‭affecting the analysis?‬
‭●‬ ‭Choosing appropriate handling strategies: Different techniques are suitable for‬
‭different types of missing data.‬

‭Functions‬ ‭Descriptions‬

‭.isnull()‬ I‭dentifies missing values in a Series or‬


‭DataFrame.‬

‭.notnull()‬ ‭ heck for missing values in a pandas Series‬


c
‭or DataFrame. It returns a boolean Series or‬
‭DataFrame, where True indicates non-missing‬
‭values and False indicates missing values.‬

‭.info()‬ ‭ isplays information about the DataFrame,‬


D
‭including data types, memory usage, and‬
‭presence of missing values.‬

‭.isna()‬ ‭ imilar to notnull() but returns True for missing‬


s
‭values and False for non-missing values.‬

‭dropna()‬ ‭ rops rows or columns containing missing‬


D
‭values based on custom criteria.‬

‭fillna()‬ ‭ ills missing values with specific values,‬


F
‭means, medians, or other calculated values.‬

‭replace()‬ ‭Replaces specific values with other values,‬


f‭acilitating data correction and‬
‭standardization.‬

‭drop_duplicates()‬ R
‭ emoves duplicate rows based on specified‬
‭columns.‬

‭unique()‬ ‭ inds unique values in a Series or‬


F
‭DataFrame.‬

‭Techniques:‬

‭ eletion: Remove rows or columns with missing data (simplest but can lead to loss of‬
D
‭valuable information).‬

I‭mputation: Replace missing values with a statistical estimate (mean, median, mode) or‬
‭use predictive methods like K-Nearest Neighbors (KNN) imputation.‬

‭ dvanced Techniques: Use algorithms like Iterative Imputer or MICE (Multivariate‬


A
‭Imputation by Chained Equations) for more complex missing data patterns.‬

‭b. Feature Scaling‬

‭ eature Scaling is a technique to standardize the independent features present in the‬


F
‭data in a fixed range. It is performed during the data pre-processing to handle highly‬
‭varying magnitudes or values or units. If feature scaling is not done, then a machine‬
‭learning algorithm tends to weigh greater values, higher and consider smaller values as‬
‭the lower values, regardless of the unit of the values.‬
‭Why feature Scaling:‬

‭●‬ S
‭ caling guarantees that all features are on a comparable scale and have‬
‭comparable ranges. This process is known as feature normalization.‬

‭●‬ A
‭ lgorithm performance improvement: When the features are scaled, several‬
‭machine learning methods, including gradient descent-based algorithms,‬
‭distance-based algorithms (such k-nearest neighbours), and support vector‬
‭machines, perform better or converge more quickly.‬

‭●‬ P
‭ reventing numerical instability: Numerical instability can be prevented by‬
‭avoiding significant scale disparities between features. Examples include‬
‭distance calculations or matrix operations, where having features with radically‬
‭differing scales can result in numerical overflow or underflow problems.‬

‭●‬ S
‭ caling features makes ensuring that each characteristic is given the same‬
‭consideration during the learning process. Without scaling, bigger scale features‬
‭could dominate the learning, producing skewed outcomes.‬

‭Techniques:‬

‭ tandardization:‬‭This method of scaling is basically‬‭based on the central tendencies‬


S
‭and variance of the data.‬

‭ irst, we should calculate the mean and standard deviation of the data we would like to‬
F
‭normalize.Then we are supposed to subtract the mean value from each entry and then‬
‭divide the result by the standard deviation.‬

‭ his helps us achieve a normal distribution(if it is already normal but skewed) of the‬
T
‭data with a mean equal to zero and a standard deviation equal to 1.‬

‭ ormalization:‬‭Scale features to a range, typically‬‭[0, 1] (e.g., using MinMaxScaler). we‬


N
‭subtract each entry of data by the mean value of the whole data and then divide the‬
‭results by the difference between the minimum and the maximum value.‬
‭ obust Scaling:‬‭Scale features using statistics robust‬‭to outliers (e.g., using‬
R
‭RobustScaler).‬

‭c. Feature Encoding‬

‭ etter encoding leads to a better model and most algorithms cannot handle the‬
B
‭categorical variables unless they are converted into a numerical value.‬

‭ urpose: Convert categorical variables into numerical representations suitable for‬


P
‭model algorithms.‬

‭ ategorical features are generally divided into 3 types:‬


C
‭A. Binary: Either/or‬
‭Examples: {Yes, No}, {True, False}‬
‭B. Ordinal: Specific ordered Groups.‬
‭Examples: {low, medium, high}, {cold, hot, lava Hot}‬
‭C. Nominal: Unordered Groups.‬
‭Examples: {cat, dog, tiger}, {pizza, burger, coke}‬

‭Techniques:‬

‭ abel Encoding:‬‭Label Encoding is a technique that‬‭is used to convert categorical‬


L
‭columns into numerical ones so that they can be fitted by machine learning models‬
‭which only take numerical data. It is an important pre-processing step in a‬
‭machine-learning project.‬

‭ uppose we have a column Height in some dataset that has elements as Tall, Medium,‬
S
‭and short. To convert this categorical column into a numerical column we will apply label‬
‭encoding to this column. After applying label encoding, the Height column is converted‬
‭into a numerical column having elements 0,1, and 2 where 0 is the label for tall, 1 is the‬
‭label for medium, and 2 is the label for short height.‬
‭Height‬ ‭Height‬

‭Tall‬ ‭0‬

‭Medium‬ ‭1‬

‭Short‬ ‭2‬

‭ ne-Hot Encoding:‬‭Create binary columns for each category‬‭(suitable for nominal‬


O
‭categorical variables).‬

I‭n One Hot Encoding, the categorical parameters will prepare separate columns for both‬
‭Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male‬
‭column and 0 in the Female column, and vice-versa.‬

‭ et’s understand with an example: Consider the data where fruits, their corresponding‬
L
‭categorical values, and prices are given.‬

‭Fruit‬ ‭Categorical value of fruit‬ ‭price‬

‭apple‬ ‭1‬ ‭5‬

‭mango‬ ‭2‬ ‭10‬

‭apple‬ ‭1‬ ‭15‬

‭orange‬ ‭3‬ ‭20‬


‭The output after applying one-hot encoding on the data is given as follows,‬

‭apple‬ ‭mango‬ ‭orange‬ ‭price‬

‭1‬ ‭0‬ ‭0‬ ‭5‬

‭0‬ ‭1‬ ‭0‬ ‭10‬

‭1‬ ‭0‬ ‭0‬ ‭15‬

‭0‬ ‭0‬ ‭1‬ ‭20‬

‭ arget Encoding:‬‭Encode categories based on the target‬‭variable's mean or other‬


T
‭statistics (useful for high-cardinality categorical variables).‬

‭ n a binary classifier, the simplest way to do that is by calculating the probability p(t = 1‬
O
‭| x = ci) in which t denotes the target, x is the input and ci is the i-th category. In‬
‭Bayesian statistics, this is considered the posterior probability of t=1 given the input was‬
‭the category ci.‬

‭2. Model Selection and Hyperparameter Tuning‬

‭a. Model Selection‬

‭ rocess: Evaluate and compare different machine learning models to identify the best‬
P
‭performer for the given task.‬
‭Techniques:‬

‭ rain-Validation-Test Split: Divide data into training, validation, and test sets for model‬
T
‭evaluation.‬
‭Parameters vs Hyperparameters: Parameters of a model are generated by model itself‬
‭during training or learning. Examples are weights of a ML model or neural network.‬
‭While hyperparameters are manually fixed by us before the training phase. Examples:‬
‭Epoch size, batch size, number of layers in neural network, activation function, etc.‬
‭Hyperparameters are adjustable parameters that can be used to obtain an optimal‬
‭model.‬

I‭n machine learning, training, validation, and test data sets are used for different‬
‭purposes to evaluate the performance of algorithms that learn from data and make‬
‭predictions:‬

‭Training data‬

‭ he largest subset of data used to train the model by adjusting its parameters. This‬
T
‭helps the model learn underlying patterns in the data. The training set should not be too‬
‭small, or the model won't have enough data to learn.‬

‭Validation data‬

‭ sed to evaluate the model during the training phase to fine-tune its parameters and‬
U
‭select the best-performing model. The validation set helps improve model performance‬
‭by predicting responses for observations in the data set. If there are multiple models to‬
‭select from, the validation set can help with model selection. Otherwise, it might be‬
‭redundant and can be omitted.‬

‭Test data‬

‭ sed to evaluate the final model's performance on completely unseen data after the‬
U
‭model has been trained and validated. The test set can help approximate the model's‬
‭unbiased accuracy in the real world‬

‭ etrics: Use appropriate metrics (accuracy, precision, recall, F1-score, etc.) for‬
M
‭evaluation based on the problem type (classification, regression).‬
‭b. Hyperparameter Tuning‬

‭ Machine Learning model is defined as a mathematical model with several parameters‬


A
‭that need to be learned from the data. By training a model with existing data, we can fit‬
‭the model parameters.‬
‭However, there is another kind of parameter, known as Hyperparameters, that cannot‬
‭be directly learned from the regular training process. They are usually fixed before the‬
‭actual training process begins. These parameters express important properties of the‬
‭model such as its complexity or how fast it should learn.‬

‭Purpose: Optimize model performance by adjusting hyperparameters.‬

‭Techniques:‬

‭ rid Search: Exhaustively search through a manually specified subset of‬


G
‭hyperparameters. For example: if we want to set two hyperparameters C and Alpha of‬
‭the Logistic Regression Classifier model, with different sets of values. The grid search‬
‭technique will construct many versions of the model with all possible combinations of‬
‭hyperparameters and will return the best one.‬
‭ s in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a‬
A
‭combination of C=0.3 and Alpha=0.2, the performance score comes out to be‬
‭0.726(Highest), therefore it is selected.‬

‭ andom Search:‬‭Randomly sample hyperparameter combinations‬‭from a defined‬


R
‭space.‬

‭ ayesian Optimization: Sequential model-based optimization that uses results from past‬
B
‭iterations to guide the search for optimal hyperparameters.‬

‭3. Cross-Validation Techniques‬

I‭n machine learning, we couldn’t fit the model on the training data and can’t say that the‬
‭model will work accurately for the real data. For this, we must assure that our model got‬
‭the correct patterns from the data, and it is not getting up too much noise. For this‬
‭purpose, we use the cross-validation technique. In this article, we’ll delve into the‬
‭process of cross-validation in machine learning.‬

‭ urpose: Evaluate model performance while maximizing data utilization and minimizing‬
P
‭overfitting.‬

‭Note it is done on a training dataset.‬

‭Techniques:‬

‭ -Fold Cross-Validation‬‭:In K-Fold Cross Validation,‬‭we split the dataset into k number of‬
K
‭subsets (known as folds) then we perform training on the all the subsets but leave‬
‭one(k-1) subset for the evaluation of the trained model. In this method, we iterate k‬
‭times with a different subset reserved for testing purpose each time.‬

‭ he diagram below shows an example of the training subsets and evaluation subsets‬
T
‭generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration‬
‭we use the first 20 percent of data for evaluation, and the remaining 80 percent for‬
‭training ([1-5] testing and [5-25] training) while in the second iteration we use the‬
s‭ econd subset of 20 percent for evaluation, and the remaining three subsets of the data‬
‭for training ([5-10] testing and [1-5 and 10-25] training), and so on.‬

‭ tratified Cross-Validation:‬‭Maintain the percentage‬‭of samples for each class in each‬


S
‭fold to ensure representative training and validation sets.‬

‭●‬ T ‭ he dataset is divided into k folds while maintaining the proportion of classes in‬
‭each fold.‬
‭●‬ ‭During each iteration, one-fold is used for testing, and the remaining folds are‬
‭used for training.‬
‭●‬ ‭The process is repeated k times, with each fold serving as the test set exactly‬
‭once.‬

‭ eave-One-Out Cross-Validation (LOOCV)‬‭: Use each sample‬‭as a validation set once,‬


L
‭particularly useful for small datasets.‬

‭4. Ensemble Methods‬


‭a. Bagging‬

‭ efinition: Bootstrap Aggregating involves training multiple models independently and‬


D
‭combining their predictions.‬

I‭mplementation Steps of Bagging‬


‭Step 1: Multiple subsets are created from the original data set with equal tuples,‬
‭selecting observations with replacement.‬
‭Step 2: A base model is created on each of these subsets.‬
‭Step 3: Each model is learned in parallel with each training set and independent of each‬
‭other.‬
‭Step 4: The final predictions are determined by combining the predictions from all the‬
‭models.‬

‭ xample: Random Forest algorithm, which uses bagging to train decision trees on‬
E
‭random subsets of the data and aggregates their predictions.‬

‭b. Boosting‬

‭ efinition: Sequentially train models where each subsequent model corrects errors‬
D
‭made by the previous one.‬

‭ .‬ I‭nitialise the dataset and assign equal weight to each of the data point.‬
1
‭2.‬ ‭Provide this as input to the model and identify the wrongly classified data points.‬
‭3.‬ ‭Increase the weight of the wrongly classified data points and decrease the‬
‭weights of correctly classified data points. And then normalize the weights of all‬
‭data points.‬
‭4.‬ i‭f (got required results): Goto step 5‬
‭Else: Goto step 2‬
‭5.‬ ‭End‬

‭Example: Gradient Boosting Machines (GBM), XGBoost, AdaBoost.‬

‭c. Random Forests‬

‭ efinition: Ensemble learning method that constructs a multitude of decision trees at‬
D
‭training time and outputs the class that is the mode of the classes (classification) or‬
‭mean prediction (regression) of the individual trees.‬
‭5. Introduction to Deep Learning and Neural Networks‬

‭ he definition of Deep learning is that it is the branch of machine learning that is based‬
T
‭on artificial neural network architecture. An artificial neural network or ANN uses layers‬
‭of interconnected nodes called neurons that work together to process and learn from‬
‭the input data.‬

‭Components:‬

‭ eural Networks: Basic building blocks comprising layers of interconnected nodes‬


N
‭(neurons).‬

‭Example: ANN(Artificial Neural Networks):‬

‭ rtificial neural networks are built on the principles of the structure and operation of‬
A
‭human neurons. It is also known as neural networks or neural nets. An artificial neural‬
‭network’s input layer, which is the first layer, receives input from external sources and‬
‭passes it on to the hidden layer, which is the second layer. Each neuron in the hidden‬
‭layer gets information from the neurons in the previous layer, computes the weighted‬
‭total, and then transfers it to the neurons in the next layer. These connections are‬
‭weighted, which means that the impacts of the inputs from the preceding layer are more‬
‭or less optimized by giving each input a distinct weight. These weights are then‬
‭adjusted during the training process to enhance the performance of the model.‬

‭ ctivation Functions: Functions applied to node outputs to introduce non-linearity (e.g.,‬


A
‭ReLU, Sigmoid, Tanh).‬

‭ raining: Techniques like backpropagation and gradient descent to optimize network‬


T
‭weights.‬
‭Unit‬‭4‬
‭Introduction to Data Science‬

‭1. What is Data Science and Why it is Important?‬

‭ efinition:‬
D
‭Data is widely considered a crucial resource in different organizations across every‬
‭industry. Data Science can be described in simple terms as a separate field of work that‬
‭deals with the management and processing of data using statistical methods, artificial‬
‭intelligence, and other tools in partnership with domain specialists. Pursuing Data‬
‭Science encompasses concepts and epochs derived from different fields including‬
‭Mathematics and Computer Science and Information Theory to interpret large data.‬

‭Importance:‬

‭ usiness Insights: Helps organizations make informed decisions based on data-driven‬


B
‭insights.‬
‭Scientific Discoveries: Facilitates discovery in various fields by analyzing large datasets.‬
‭Personalization: Enables personalized experiences in products and services.‬
‭Predictive Capabilities: Predicts future trends and behaviors.‬
‭2. Role of Data Scientist and Skills Required‬

‭ ole:‬
R
‭Data Scientist is responsible for analyzing, interpreting, and deriving actionable insights‬
‭from complex data sets.‬

‭ kills Required:‬
S
‭Programming: Proficiency in languages like Python, R, or SQL.‬
‭Statistics and Mathematics: Understanding of statistical methods and mathematical‬
‭concepts.‬
‭Machine Learning: Knowledge of algorithms for predictive modeling and pattern‬
‭recognition.‬
‭Data Wrangling: Cleaning, transforming, and preparing data for analysis.‬
‭Data Visualization: Communicating insights through charts, graphs, and dashboards.‬
‭Domain Knowledge: Understanding of the industry or field in which data is being‬
‭analyzed.‬

‭3. Data Acquisition‬

‭ ources of Data:‬
S
‭Internal Sources: Data generated within an organization (e.g., databases, CRM‬
‭systems).‬
‭External Sources: Data obtained from third-party providers, APIs, social media, etc.‬
‭Public Datasets: Available from government agencies, research institutions, etc.‬

‭ ata Formats:‬
D
‭Structured Data: Organized in a predefined format (e.g., databases, spreadsheets).‬
‭Unstructured Data: Not organized in a predefined manner (e.g., text documents,‬
‭images, videos).‬

‭4. Data Cleaning‬

‭ ata cleaning, also known as data cleansing or data preprocessing, is a crucial step in‬
D
‭the data science pipeline that involves identifying and correcting or removing errors,‬
‭inconsistencies, and inaccuracies in the data to improve its quality and usability. Data‬
‭cleaning is essential because raw data is often noisy, incomplete, and inconsistent,‬
‭which can negatively impact the accuracy and reliability of the insights derived from it.‬
‭ rocess:‬
P
‭Handling Missing Values: Imputation techniques or removal.‬
‭Handling Outliers: Identifying and treating outliers appropriately.‬
‭Normalization and Standardization: Scaling numerical data.‬
‭Data Formatting: Ensuring data is in a consistent format.‬

‭5. Exploratory Data Analysis (EDA)‬

‭ xploratory Data Analysis (EDA) is a crucial initial step in data science projects. It‬
E
‭involves analyzing and visualizing data to understand its key characteristics, uncover‬
‭patterns, and identify relationships between variables refers to the method of studying‬
‭and exploring record sets to apprehend their predominant traits, discover patterns,‬
‭locate outliers, and identify relationships between variables. EDA is normally carried out‬
‭as a preliminary step before undertaking extra formal statistical analyses or modeling.‬

‭Why Exploratory Data Analysis is Important?‬

‭ xploratory Data Analysis (EDA) is important for several reasons, especially in the‬
E
‭context of data science and statistical modeling. Here are some of the key reasons why‬
‭EDA is a critical step in the data analysis process:‬

‭●‬ U
‭ nderstanding Data Structures: EDA helps in getting familiar with the dataset,‬
‭understanding the number of features, the type of data in each feature, and the‬
‭distribution of data points.‬

‭●‬ I‭dentifying Patterns and Relationships: Through visualizations and statistical‬


‭summaries, EDA can reveal hidden patterns and intrinsic relationships between‬
‭variables.‬

‭●‬ D
‭ etecting Anomalies and Outliers: EDA is essential for identifying errors or‬
‭unusual data points that may adversely affect the results of your analysis.‬

‭●‬ T
‭ esting Assumptions: Many statistical models assume that data follow a certain‬
‭distribution or that variables are independent. EDA involves checking these‬
‭assumptions. If the assumptions do not hold, the conclusions drawn from the‬
‭model could be invalid.‬
‭●‬ I‭nforming Feature Selection and Engineering: Insights gained from EDA can‬
‭inform which features are most relevant to include in a model and how to‬
‭transform them (scaling, encoding) to improve model performance.‬

‭●‬ O
‭ ptimizing Model Design: By understanding the data’s characteristics, analysts‬
‭can choose appropriate modeling techniques, decide on the complexity of the‬
‭model, and better tune model parameters.‬

‭●‬ F
‭ acilitating Data Cleaning: EDA helps in spotting missing values and errors in the‬
‭data, which are critical to address before further analysis to improve data quality‬
‭and integrity.‬

‭Key aspects of EDA include:‬

‭ istribution of Data, , Graphical Representations, Outlier Detection, Correlation‬


D
‭Analysis, Handling Missing Values, Summary Statistics, Testing Assumptions.‬
‭Statistical Analysis‬

‭ . Descriptive Statistics‬
1
‭Definition:‬
‭Descriptive statistics are used to describe and summarize the features of a dataset.‬
‭They provide simple summaries about the sample and the measures.‬

‭ easures:‬
M
‭Mean: Average of all values in a dataset, sensitive to outliers.‬
‭Median: Middle value of a dataset when arranged in ascending order, less sensitive to‬
‭outliers.‬
‭Mode: Most frequent value in a dataset.‬
‭Range: Difference between the maximum and minimum values.‬
‭Variance: Measure of the spread of data points around the mean.‬
‭Standard Deviation: Square root of the variance, indicating the average deviation from‬
‭the mean.‬

‭ . Inferential Statistics‬
2
‭Definition:‬
‭Inferential statistics use data from a sample to make inferences or generalizations about‬
‭a larger population.‬
‭ echniques:‬
T
‭Hypothesis Testing: Evaluates the likelihood that a result is due to chance.‬
‭Null Hypothesis (H0): Statement of no effect or no difference.‬
‭Alternative Hypothesis (H1): Statement to be tested.‬
‭Significance Level (α): Threshold for rejecting the null hypothesis (typically 0.05).‬
‭Confidence Intervals: Range of values within which the true population parameter is‬
‭estimated to lie.‬
‭Correlation Analysis: Measures the strength and direction of the linear relationship‬
‭between two variables (Pearson correlation coefficient).‬
‭Regression Analysis: Predicts the value of one variable based on the value of another‬
‭(linear regression, logistic regression, etc.).‬

‭ ata Visualization Techniques‬


D
‭Data visualization is crucial for exploring and communicating patterns, trends, and‬
‭insights from data. Here are some key techniques and their applications:‬

‭ . Scatter Plots‬
1
‭Definition:‬
‭A scatter plot is a graph that displays values for two variables as points on a Cartesian‬
‭plane. Each point represents the value of one variable corresponding to the value of the‬
‭other.‬

‭ pplications:‬
A
‭Relationship Exploration: Visualize relationships and correlations between variables.‬
‭Outlier Detection: Identify outliers and anomalies in data.‬
‭ rend Identification: Spot trends such as clusters or patterns in data points.‬
T
‭Example:‬
‭In a dataset of student scores vs. study hours, a scatter plot can show whether there's a‬
‭correlation between hours studied and exam scores.‬

‭ . Line Charts‬
2
‭Definition:‬
‭A line chart displays data points connected by straight line segments. It is particularly‬
‭useful for showing trends over time or ordered categories.‬
‭Applications:‬

‭ ime Series Analysis: Track changes in data over time (e.g., stock prices, temperature‬
T
‭trends).‬
‭Comparison: Compare trends in multiple datasets (e.g., sales performance across‬
‭different regions).‬
‭Example:‬
‭Showing the growth of a company's revenue over the past five years using a line chart.‬

‭ . Histograms‬
3
‭Definition:‬
‭ histogram is a graphical representation of the distribution of numerical data. It‬
A
‭consists of bars that show the frequency of data points within defined intervals (bins).‬

‭ pplications:‬
A
‭Distribution Analysis: Understand the shape, center, and spread of data.‬
‭Identifying Skewness: Determine whether data is symmetric or skewed.‬
‭Data Preprocessing: Assess data quality and potential outliers.‬
‭Example:‬
‭Visualizing the distribution of ages in a population to understand the demographic‬
‭profile.‬

‭ . Bar Charts‬
4
‭Definition:‬

‭ bar chart uses rectangular bars to represent categorical data. The length or height of‬
A
‭each bar corresponds to the frequency, count, or percentage of the categories.‬
‭Applications:‬

‭ omparison: Compare quantities or values across different categories.‬


C
‭Ranking: Rank categories based on their values.‬
‭Part-to-Whole Relationships: Show how each category contributes to the total.‬
‭Example:‬

‭ omparing sales performance of different product categories in a retail store over a‬


C
‭month using a bar chart.‬
‭5. Pie Charts‬
‭Definition:‬

‭ pie chart is a circular statistical graphic divided into slices to illustrate numerical‬
A
‭proportions. The arc length of each slice is proportional to the quantity it represents.‬
‭Applications:‬

‭ roportional Representation: Show the contribution of each category to a whole.‬


P
‭Percentage Breakdown: Display parts of a whole as percentages.‬
‭Example:‬

‭ howing the distribution of expenses (e.g., rent, utilities, groceries) in a household‬


S
‭budget using a pie chart.‬

‭ ools for Data Visualization‬


T
‭Python Libraries: Matplotlib, Seaborn, Plotly, Bokeh.‬
‭R Packages: ggplot2, lattice, plotly.‬
‭Business Intelligence Tools: Tableau, Power BI, QlikView.‬

‭6. Introduction to Libraries/Tools‬

‭ . NumPy‬
1
‭Definition:‬
‭NumPy (Numerical Python) is a library for the Python programming language that‬
‭supports large, multi-dimensional arrays and matrices, along with a collection of‬
‭mathematical functions to operate on these arrays.‬

‭Key Features:‬
‭●‬ ‭N-Dimensional Arrays: Core data structure is ndarray.‬
‭●‬ ‭Mathematical Functions: Functions for linear algebra, statistics, and‬
‭mathematical operations.‬
‭●‬ ‭Broadcasting: Support for arithmetic operations on arrays of different shapes.‬

I‭nstallation:‬
‭pip install numpy‬

‭Basic Usage:‬

‭ . Importing NumPy:‬
1
‭import numpy as np‬
‭2. Creating Arrays:‬
‭# Creating a 1D array‬
‭ rr1 = np.array([1, 2, 3, 4, 5])‬
a
‭print(arr1)‬
‭# Creating a 2D array‬
‭arr2 = np.array([[1, 2, 3], [4, 5, 6]])‬
‭print(arr2)‬
‭3. Array Operations:‬
‭# Array addition‬
‭arr_sum = arr1 + 10‬
‭print(arr_sum)‬
‭# Matrix multiplication‬
‭arr_mult = np.dot(arr2, arr2.T)‬
‭print(arr_mult)‬
‭4. Array Statistics:‬
‭# Mean, Median, Standard Deviation‬
‭mean_val = np.mean(arr1)‬
‭median_val = np.median(arr1)‬
‭std_dev = np.std(arr1)‬

‭print(f"Mean: {mean_val}, Median: {median_val}, Std Dev: {std_dev}")‬

‭ . Pandas‬
2
‭Definition:‬
‭Pandas is a data manipulation and analysis library for Python. It provides data‬
‭structures and functions needed to work on structured data seamlessly.‬

‭Key Features:‬
‭●‬ ‭DataFrames: 2D labeled data structure with columns of potentially different types.‬
‭●‬ ‭Series: 1D labeled array capable of holding any data type.‬

I‭nstallation:‬
‭pip install pandas‬

‭Basic Usage:‬

‭ . Importing Pandas:‬
1
‭import pandas as pd‬
‭2. Creating DataFrames:‬
‭# Creating a DataFrame‬
‭data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}‬
‭ f = pd.DataFrame(data)‬
d
‭print(df)‬
‭3. DataFrame Operations:‬
‭# Adding a new column‬
‭df['Gender'] = ['F', 'M', 'M']‬
‭print(df)‬

‭ Descriptive Statistics‬
#
‭print(df.describe())‬
‭# Data Selection‬
‭print(df['Name'])‬ ‭# Selecting a column‬
‭print(df.iloc[0])‬ ‭# Selecting a row by index‬
‭4. Data Cleaning:‬
‭# Handling Missing Values‬
‭df_missing = df.copy()‬
‭df_missing.loc[1, 'Age'] = None‬ ‭# Introduce a missing‬‭value‬
‭df_cleaned = df_missing.fillna(df_missing['Age'].mean()) # Fill missing values with‬
‭mean‬
‭print(df_cleaned)‬

‭ . Matplotlib‬
3
‭Definition:‬
‭Matplotlib is a plotting library for Python and is widely used for creating static, animated,‬
‭and interactive visualizations.‬

‭ ey Features:‬
K
‭Flexibility: Wide range of plot types.‬
‭Customization: Extensive customization options for plots.‬

I‭nstallation:‬
‭pip install matplotlib‬

‭ asic Usage:‬
B
‭1. Importing Matplotlib:‬
‭import matplotlib.pyplot as plt‬
‭2. Creating Plots:‬
‭Line Plot:‬
‭# Line Plot‬
‭x = [1, 2, 3, 4, 5]‬
y‭ = [1, 4, 9, 16, 25]‬
‭plt.plot(x, y)‬
‭plt.title('Line Plot')‬
‭plt.xlabel('x-axis')‬
‭plt.ylabel('y-axis')‬
‭plt.show()‬
‭Bar Plot:‬
‭# Bar Plot‬
‭categories = ['A', 'B', 'C']‬
‭values = [10, 15, 7]‬
‭plt.bar(categories, values)‬
‭plt.title('Bar Plot')‬
‭plt.xlabel('Categories')‬
‭plt.ylabel('Values')‬
‭plt.show()‬

‭ istogram:‬
H
‭# Histogram‬
‭data = np.random.randn(1000) # Generate 1000 random data points‬
‭plt.hist(data, bins=30)‬
‭plt.title('Histogram')‬
‭plt.xlabel('Value')‬
‭plt.ylabel('Frequency')‬
‭plt.show()‬

‭ . Seaborn‬
4
‭Definition:‬
‭Seaborn is a Python visualization library based on Matplotlib that provides a high-level‬
‭interface for drawing attractive and informative statistical graphics.‬

‭ ey Features:‬
K
‭High-Level Interface: Easier syntax for complex plots.‬
‭Integrated Data Analysis: Built-in functions for statistical plotting.‬
‭Installation:‬
‭pip install seaborn‬

‭ asic Usage:‬
B
‭1. Importing Seaborn:‬
i‭mport seaborn as sns‬
‭2. Creating Plots:‬

‭ catter Plot:‬
S
‭# Scatter Plot‬
‭tips = sns.load_dataset('tips')‬
‭sns.scatterplot(x='total_bill', y='tip', data=tips)‬
‭plt.title('Scatter Plot of Total Bill vs Tip')‬
‭plt.show()‬
‭Box Plot:‬
‭# Box Plot‬
‭sns.boxplot(x='day', y='total_bill', data=tips)‬
‭plt.title('Box Plot of Total Bill by Day')‬
‭plt.show()‬
‭Heatmap:‬
‭# Heatmap‬
‭corr = tips.corr()‬
‭sns.heatmap(corr, annot=True, cmap='coolwarm')‬
‭plt.title('Heatmap of Correlations')‬
‭plt.show()‬
‭Unit 5‬
‭Advanced Topics and Applications‬

‭ . Support Vector Machines (SVM)‬


1
‭Definition:‬
‭Support Vector Machines (SVM) are supervised learning algorithms used for‬
‭classification and regression tasks. SVM aims to find the hyperplane that best separates‬
‭different classes in the feature space.‬

‭ ey Concepts:‬
K
‭Hyperplane: A decision boundary that separates different classes.‬
‭Support Vectors: Data points that are closest to the hyperplane and influence its‬
‭position.‬
‭Margin: The distance between the hyperplane and the support vectors.‬
‭Types of SVM:‬

‭Linear SVM: Finds a linear hyperplane to separate classes.‬


‭ on-Linear SVM: Uses kernels to transform the feature space into higher dimensions to‬
N
‭find a non-linear decision boundary.‬

‭ ow does SVM work?‬


H
‭One reasonable choice as the best hyperplane is the one that represents the largest‬
‭separation or margin between the two classes.‬

‭ o we choose the hyperplane whose distance from it to the nearest data point on each‬
S
‭side is maximized. If such a hyperplane exists it is known as the maximum-margin‬
‭hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a‬
‭scenario like shown below‬

‭ ere we have one blue ball in the boundary of the red ball. So how does SVM classify‬
H
‭the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls.‬
‭The SVM algorithm has the characteristics to ignore the outlier and finds the best‬
‭hyperplane that maximizes the margin. SVM is robust to outliers.‬
‭ . Neural Networks‬
2
‭Definition:‬
‭Neural Networks are computational models inspired by the human brain's network of‬
‭neurons. They consist of layers of interconnected nodes (neurons) that transform input‬
‭data into output predictions.‬

‭ ey Concepts:‬
K
‭Neurons: Basic units that receive inputs, apply weights, and pass the result through an‬
‭activation function.‬
‭Activation Functions: Functions that introduce non-linearity (e.g., Sigmoid, ReLU, Tanh).‬

‭ ommon Architectures:‬
C
‭Feedforward Neural Networks (FNN): Data moves in one direction, from input to output.‬
‭Multi-Layer Perceptrons (MLP): FNNs with one or more hidden layers.‬

‭Let’s understand with an example of how a neural network works:‬

‭ onsider a neural network for email classification. The input layer takes features like‬
C
‭email content, sender information, and subject. These inputs, multiplied by adjusted‬
‭weights, pass through hidden layers. The network, through training, learns to recognize‬
‭patterns indicating whether an email is spam or not. The output layer, with a binary‬
‭activation function, predicts whether the email is spam (1) or not (0). As the network‬
‭iteratively refines its weights through backpropagation, it becomes adept at‬
‭distinguishing between spam and legitimate emails, showcasing the practicality of‬
‭neural networks in real-world applications like email filtering.‬
‭ eural networks are complex systems that mimic some features of the functioning of‬
N
‭the human brain. It is composed of an input layer, one or more hidden layers, and an‬
‭output layer made up of layers of artificial neurons that are coupled. The two stages of‬
‭the basic process are called backpropagation and forward propagation.‬

‭Forward Propagation‬
‭●‬ ‭Input Layer: Each feature in the input layer is represented by a node on the‬
‭network, which receives input data.‬
‭●‬ ‭Weights and Connections: The weight of each neuronal connection indicates‬
‭how strong the connection is. Throughout training, these weights are changed.‬
‭●‬ ‭Hidden Layers: Each hidden layer neuron processes inputs by multiplying them‬
‭by weights, adding them up, and then passing them through an activation‬
‭function. By doing this, non-linearity is introduced, enabling the network to‬
‭recognize intricate patterns.‬
‭●‬ ‭Output: The final result is produced by repeating the process until the output‬
‭layer is reached.‬

‭Backpropagation:‬
‭●‬ ‭Loss Calculation: The network’s output is evaluated against the real goal values,‬
‭and a loss function is used to compute the difference. For a regression problem,‬
‭the Mean Squared Error (MSE) is commonly used as the cost function.‬
‭●‬ G ‭ radient Descent: Gradient descent is then used by the network to reduce the‬
‭loss. To lower the inaccuracy, weights are changed based on the derivative of the‬
‭loss with respect to each weight.‬
‭●‬ ‭Adjusting weights: The weights are adjusted at each connection by applying this‬
‭iterative process, or backpropagation, backward across the network.‬
‭●‬ ‭Training: During training with different data samples, the entire process of‬
‭forward propagation, loss calculation, and backpropagation is done iteratively,‬
‭enabling the network to adapt and learn patterns from the data.‬
‭●‬ ‭Actvation Functions: Model non-linearity is introduced by activation functions like‬
‭the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a‬
‭neuron is based on the whole weighted input.‬

‭Types of Neural Networks:‬

‭There are seven types of neural networks that can be used.‬

‭ eedforward Neteworks:‬‭A feedforward neural network‬‭is a simple artificial neural‬


F
‭network architecture in which data moves from input to output in a single direction. It‬
‭has input, hidden, and output layers; feedback loops are absent. Its straightforward‬
‭architecture makes it appropriate for a number of applications, such as regression and‬
‭pattern recognition.‬

‭ ultilayer Perceptron (MLP):‬‭MLP is a type of feedforward‬‭neural network with three or‬


M
‭more layers, including an input layer, one or more hidden layers, and an output layer. It‬
‭uses nonlinear activation functions.‬

‭ onvolutional Neural Network (CNN)‬‭: A Convolutional‬‭Neural Network (CNN) is a‬


C
‭specialized artificial neural network designed for image processing. It employs‬
‭convolutional layers to automatically learn hierarchical features from input images,‬
‭enabling effective image recognition and classification. CNNs have revolutionized‬
‭computer vision and are pivotal in tasks like object detection and image analysis.‬

‭ ecurrent Neural Network (RNN):‬‭An artificial neural‬‭network type intended for‬


R
‭sequential data processing is called a Recurrent Neural Network (RNN). It is‬
‭appropriate for applications where contextual dependencies are critical, such as time‬
‭series prediction and natural language processing, since it makes use of feedback‬
‭loops, which enable information to survive within the network.‬
‭ ong Short-Term Memory (LSTM):‬‭LSTM is a type of RNN‬‭that is designed to overcome‬
L
‭the vanishing gradient problem in training RNNs. It uses memory cells and gates to‬
‭selectively read, write, and erase information.‬

‭3. Convolutional Neural Networks (CNNs)‬

‭ efinition:‬
D
‭Convolutional Neural Networks (CNNs) are specialized neural networks designed for‬
‭processing structured grid data, like images.‬

‭ ey Concepts:‬
K
‭Convolutions: Operations that apply a filter to an image to create feature maps.‬
‭Pooling Layers: Reduce the spatial dimensions of feature maps (e.g., Max Pooling,‬
‭Average Pooling).‬
‭Fully Connected Layers: Layers where each neuron is connected to every neuron in the‬
‭previous layer.‬

‭ . Convolutional Layer‬
1
‭Function:‬
‭ he convolutional layer is the core building block of a CNN. It applies a set of filters‬
T
‭(kernels) to the input image to produce feature maps. Each filter detects specific‬
‭features like edges, textures, or patterns.‬

‭How It Works:‬

‭ ilters: Small matrices (e.g., 3x3 or 5x5) that slide over the input image. Each filter‬
F
‭detects different features.‬
‭Convolution Operation: The filter multiplies its values with the pixel values of the image‬
‭and sums the results to produce a single output value. This operation is performed‬
‭across the entire image.‬
‭Mathematical Operation:‬

‭Feature Map=Image∗Filter‬

‭ . Activation Function (ReLU)‬


2
‭Function:‬

‭ he ReLU (Rectified Linear Unit) activation function introduces non-linearity into the‬
T
‭model. It replaces all negative pixel values with zero.‬

‭Mathematical Operation:‬

‭ReLU(𝑥)=max(0,𝑥)‬

‭ . Pooling Layer‬
3
‭Function:‬

‭ ooling layers reduce the spatial dimensions of feature maps, decreasing the number of‬
P
‭parameters and computation required, and helping to avoid overfitting.‬

‭Types of Pooling:‬

‭ ax Pooling: Takes the maximum value from a feature map segment.‬


M
‭Average Pooling: Computes the average value from a feature map segment.‬
‭Mathematical Operation (Max Pooling):‬
‭Output=Max(Region)‬

‭ . Flattening Layer‬
4
‭Function:‬

‭ he flattening layer converts the 2D matrix into a 1D vector. This step is necessary to‬
T
‭feed the output into fully connected layers.‬

‭Mathematical Operation:‬

‭The 2D matrix is transformed into a single long vector.‬

‭ . Fully Connected Layer‬


5
‭Function:‬

‭ he fully connected layer (Dense layer) performs the final classification or regression‬
T
‭tasks. Every neuron in this layer is connected to every neuron in the previous layer.‬

‭Mathematical Operation:‬

‭𝑦=𝑊𝑥+𝑏‬

‭ here:‬
W
‭x is the input vector,‬
‭W is the weight matrix,‬
‭b is the bias term,‬
‭y is the output vector.‬
‭ . Recurrent Neural Networks (RNNs)‬
4
‭Definition:‬
‭Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to‬
‭handle sequential data. Unlike traditional feedforward neural networks, RNNs have‬
‭connections that form directed cycles, allowing information to persist across time steps.‬

‭ pplications:‬
A
‭RNNs are used in various applications such as:‬

‭ ime Series Prediction: Forecasting stock prices, weather, etc.‬


T
‭Natural Language Processing: Language modeling, sentiment analysis.‬
‭Speech Recognition: Transcribing spoken language to text.‬
‭Machine Translation: Translating text from one language to another.‬

‭ . How RNNs Work‬


2
‭RNNs process sequences of data by maintaining a state that carries information about‬
‭previous time steps. Here’s a step-by-step explanation of the process:‬

‭ . Recurrent Structure‬
a
‭RNNs have a structure that includes a feedback loop, allowing the network to use‬
‭information from previous time steps.‬
‭Description:‬

I‭nput Vector‬
‭x t: The data at time step 𝑡‬

‭ idden State‬
H
‭ht: The output of the hidden layer that carries information from previous time steps.‬
‭Recurrent Connection: The hidden state ht is used as input for the next time step.‬
‭Mathematical Representation:For a given time step t, the RNN performs the following‬
‭operations:‬

‭ pdate Hidden State:‬


U
‭ℎ𝑡=tanh(𝑊ℎ⋅ℎ𝑡−1+𝑊𝑥⋅𝑥𝑡+𝑏ℎ)‬
‭Wh: Recurrent weight matrix‬
‭𝑊𝑥: Input weight matrix‬
‭bh: Bias term‬
‭Generate Output:‬
‭𝑦𝑡=𝑊𝑦⋅ℎ𝑡+𝑏𝑦‬

‭ y: Output weight matrix‬


W
‭by: Bias term‬
‭.‬

‭ . Backpropagation Through Time (BPTT)‬


b
‭To train RNNs, we use Backpropagation Through Time (BPTT), which is an extension of‬
‭the backpropagation algorithm for sequence data.‬
‭Explanation:‬

‭ nroll the RNN: Expand the RNN into a chain of layers corresponding to each time‬
U
‭step.‬
‭Compute Gradients: Calculate the gradients for each layer over the entire sequence.‬
‭Update Weights: Adjust the weights using the computed gradients.‬

‭5‬‭.‬‭Natural Language Processing‬

‭ hat is NLP?‬
W
‭Definition:‬
‭Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses‬
‭on enabling computers to understand, interpret, and generate human language. It‬
‭combines computational linguistics, machine learning, and computer science to facilitate‬
‭interactions between humans and machines through natural language.‬

‭ urpose:‬
P
‭NLP aims to bridge the gap between human communication and computer‬
‭understanding, making it possible for machines to process and analyze large amounts‬
‭of natural language data.‬

‭Image:‬

‭ ey Techniques in NLP‬
K
‭Text Classification‬
‭ hat: Categorizes text into predefined categories.‬
W
‭Examples: Spam detection, sentiment analysis.‬
‭Technique: TF-IDF, Naive Bayes, SVM.‬

‭ entiment Analysis‬
S
‭What: Determines the sentiment expressed in text.‬
‭Examples: Analyzing product reviews, social media sentiment.‬
‭Technique: Rule-based systems, machine learning, deep learning.‬

‭ achine Translation‬
M
‭What: Translates text from one language to another.‬
‭Examples: Google Translate, language learning apps.‬
‭Technique: Statistical Machine Translation, Neural Machine Translation.‬

‭ amed Entity Recognition (NER)‬


N
‭What: Identifies and classifies entities in text.‬
‭Examples: Extracting names, dates, and locations from documents.‬
‭Technique: Rule-based methods, machine learning, deep learning.‬

‭ uestion Answering‬
Q
‭What: Provides answers to questions posed in natural language.‬
‭Examples: Virtual assistants, customer support.‬
‭Technique: Retrieval-based systems, generative models.‬

‭ peech Recognition‬
S
‭What: Converts spoken language into text.‬
‭Examples: Voice assistants, transcription services.‬
‭Technique: Acoustic models, language models, deep learning.‬

‭ hatbots and Virtual Assistants‬


C
‭What: Simulate human conversation for user interactions.‬
‭Examples: Customer service bots, personal assistants.‬
‭Technique: Rule-based systems, AI-driven conversation models.‬

‭ ext Summarization‬
T
‭What: Creates a concise summary of a longer text.‬
‭Examples: Summarizing news articles, executive summaries.‬
‭Technique: Extractive summarization, abstractive summarization.‬

‭Information Retrieval‬
‭ hat: Searches for relevant information from large datasets.‬
W
‭Examples: Search engines, document retrieval.‬
‭Technique: Vector space models, ranking algorithms.‬

‭ ext Generation‬
T
‭What: Generates coherent and contextually relevant text.‬
‭Examples: Content creation, creative writing.‬
‭Technique: Language models, generative models.‬

‭ eal-World Applications‬
R
‭Customer Support‬
‭Example: Chatbots answering customer queries.‬
‭Benefit: 24/7 support and cost efficiency.‬
‭Language Translation‬
‭Example: Google Translate for multilingual communication.‬
‭Benefit: Breaking language barriers globally.‬
‭Sentiment Analysis‬
‭Example: Analyzing Twitter posts for brand sentiment.‬
‭Benefit: Understanding customer opinions and market trends.‬
‭Healthcare‬
‭Example: Extracting information from medical records.‬
‭Benefit: Improving patient care and management.‬
‭Education‬
‭Example: Language learning apps and automated tutoring.‬
‭Benefit: Enhancing learning experiences.‬
‭Future Trends‬
‭Advanced Models: Development of more sophisticated models like GPT-4.‬
‭Multimodal Approaches: Combining text with other data types (images, videos).‬
‭Ethical NLP: Addressing biases and ensuring fairness in AI applications.‬

‭ onclusion‬
C
‭NLP is a dynamic and rapidly evolving field that leverages techniques from AI and‬
‭machine learning to process and analyze human language. Its applications are diverse‬
‭and impactful, from enhancing customer service to enabling real-time translation and‬
‭generating human-like text. As technology advances, NLP will continue to transform‬
‭how we interact with machines and process language data.‬
‭Introduction to Big Data Technologies: Hadoop, Spark, and More‬

‭ . Overview of Big Data‬


1
‭Definition:‬
‭Big Data refers to large volumes of data that are too complex to be processed using‬
‭traditional data management tools. It encompasses vast datasets that can be analyzed‬
‭for insights and decision-making, characterized by the 3Vs:‬

‭ olume: The sheer amount of data generated.‬


V
‭Velocity: The speed at which data is generated and processed.‬
‭Variety: The different types and sources of data.‬

‭ urpose:‬
P
‭The aim of Big Data technologies is to efficiently store, manage, and analyze massive‬
‭datasets to extract valuable insights, drive decisions, and create innovative solutions.‬

‭ . Key Big Data Technologies‬


2
‭a. Apache Hadoop‬
‭What is Hadoop?‬
‭Apache Hadoop is an open-source framework for storing and processing large datasets‬
‭across a distributed cluster of computers. It is designed to scale from a single server to‬
‭thousands of machines.‬

‭Components of Hadoop:‬

‭Hadoop Distributed File System (HDFS):‬

‭ unction: A distributed file system designed to run on commodity hardware.‬


F
‭Feature: Stores data across multiple machines and ensures high availability and fault‬
‭tolerance.‬
‭Architecture: Data is split into blocks and replicated across different nodes in the cluster.‬

‭ apReduce:‬
M
‭Function: A programming model for processing large data sets with a distributed‬
‭algorithm.‬
‭Components:‬
‭ apper: Processes input data and generates key-value pairs.‬
M
‭Reducer: Aggregates and processes the results of the Mapper.‬

‭YARN (Yet Another Resource Negotiator):‬

‭ unction: Manages resources and job scheduling across the Hadoop cluster.‬
F
‭Components:‬
‭ResourceManager: Manages resources across the cluster.‬
‭NodeManager: Manages resources and tasks on individual nodes.‬

‭Use Cases:‬

‭ ata Storage: Store and manage large datasets from various sources.‬
D
‭Data Processing: Analyze large-scale data for patterns and insights.‬
‭Data Integration: Combine data from different sources for a unified analysis.‬

‭ . Apache Spark‬
b
‭What is Spark?‬
‭Apache Spark is an open-source unified analytics engine for large-scale data‬
‭processing. It provides fast, in-memory data processing capabilities and supports‬
‭various workloads like batch processing, streaming, and machine learning.‬

‭Components of Spark:‬

‭Spark Core:‬

‭ unction: The foundation of Spark, providing essential functionalities like task‬


F
‭scheduling and fault tolerance.‬
‭Features:‬
‭Resilient Distributed Datasets (RDDs): Immutable collections of objects that can be‬
‭processed in parallel.‬
‭DataFrames: A higher-level abstraction for working with structured data.‬
‭Spark SQL:‬

‭ unction: Provides a programming interface for working with structured and‬


F
‭semi-structured data.‬
‭Features: Supports querying data through SQL as well as DataFrame and Dataset‬
‭APIs.‬
‭Image:‬

‭Spark Streaming:‬

‭ unction: Enables processing of real-time data streams.‬


F
‭Features: Supports processing data from sources like Kafka and Flume.‬

‭ se Cases:‬
U
‭Data Processing: High-performance processing of large datasets.‬
‭Real-Time Analytics: Analyzing data as it is generated.‬
‭Machine Learning: Building and deploying machine learning models.‬
‭Image:‬

‭ . Case Studies and Real-World Applications‬


3
‭a. Case Study: Netflix‬
‭Overview:‬
‭Netflix uses Big Data technologies to recommend movies and TV shows to users. It‬
‭leverages Apache Spark for data processing and Hadoop for data storage.‬

‭Approach:‬

‭ ata Collection: Collects user activity data and viewing preferences.‬


D
‭Data Analysis: Analyzes data to provide personalized recommendations.‬
‭Outcome: Improved user engagement and satisfaction.‬
‭Image:‬

‭Benefits:‬

‭Personalized Recommendations: Suggests content based on user preferences.‬


‭ nhanced User Experience: Tailors content to individual tastes.‬
E
‭b. Case Study: LinkedIn‬
‭Overview:‬
‭LinkedIn uses Hadoop for managing user data and Spark for real-time analytics to‬
‭provide job recommendations and improve the user experience.‬

‭Approach:‬

‭ ata Collection: Collects data on job applications, user profiles, and interactions.‬
D
‭Data Analysis: Analyzes data to improve job matching algorithms.‬
‭Outcome: More relevant job recommendations and improved user engagement.‬
‭Image:‬

‭Benefits:‬

I‭mproved Job Matching: Provides better job recommendations.‬


‭Real-Time Insights: Analyzes user interactions and trends.‬
‭c. Case Study: Amazon‬
‭Overview:‬
‭Amazon uses Hadoop and Spark to handle its vast amounts of transactional data and to‬
‭optimize its supply chain.‬

‭Approach:‬

‭ ata Collection: Collects data on purchases, reviews, and inventory.‬


D
‭Data Analysis: Analyzes data to forecast demand and optimize inventory.‬
‭Outcome: More efficient supply chain management and personalized shopping‬
‭experiences.‬
‭Image:‬

‭Benefits:‬

‭ ptimized Inventory: Better demand forecasting and inventory management.‬


O
‭Personalized Shopping Experience: Recommendations based on user behavior.‬
‭4. Future Trends and Career Prospects‬
‭a. Future Trends‬
‭Increased Adoption of Cloud Solutions:‬
‭ rend: More organizations are moving to cloud-based Big Data solutions like AWS,‬
T
‭Azure, and Google Cloud.‬
‭Example: Amazon EMR for Hadoop and Spark.‬
‭Enhanced Real-Time Analytics:‬

‭ rend: Growing use of real-time data processing technologies.‬


T
‭Example: Apache Flink for advanced stream processing.‬
‭Integration with Machine Learning and AI:‬

‭ rend: Combining Big Data technologies with machine learning and AI for advanced‬
T
‭analytics.‬
‭Example: Databricks platform for integrated analytics.‬
‭Image:‬

‭ . Career Prospects‬
b
‭Job Roles:‬

‭ ig Data Engineer: Designs and builds Big Data systems.‬


B
‭Data Scientist: Analyzes data to extract insights.‬
‭Data Analyst: Interprets data and generates reports.‬
‭Skills Needed:‬

‭ rogramming Languages: Python, Java, Scala.‬


P
‭Tools and Frameworks: Hadoop, Spark, Kafka.‬
‭Mathematics and Statistics: Understanding of statistical models and data analysis‬
‭techniques.‬
‭Image:‬

‭ onclusion‬
C
‭Big Data technologies like Hadoop and Spark are crucial for managing and analyzing‬
‭massive datasets. They offer tools for scalable data storage, efficient processing, and‬
‭advanced analytics. Real-world applications span various domains, from entertainment‬
‭to e-commerce, demonstrating the impact of Big Data on modern businesses.‬

You might also like