Notes- Introduction to AI,ML,DS
Notes- Introduction to AI,ML,DS
M
Introduction to Artificial Intelligence
Machine Learning and Data Science
nit-1
U
Introduction to Artificial Intelligence
rtificial Intelligence (AI) refers to the simulation of human intelligence in machines that
A
are programmed to think and learn like humans. It involves the study and development
of intelligent agents capable of perceiving their environment and taking actions to
maximize their chances of success.
The scope of AI encompasses various cognitive functions such as understanding
natural language, reasoning, problem-solving, learning from experience, and adapting
to new situations.
ome daily life applications of AI we are using are chatbots, google assistant, facial
S
recognition in mobile phones, Social media applications, spam mail detection etc.
000s-2010s: The rise of big data fueled advancements in machine learning, especially
2
with neural networks and deep learning, achieving breakthroughs in tasks like image
and speech recognition.
1.Healthcare:
Medical Imaging and Diagnostics: AI aids in interpreting medical images like X-rays
and MRIs, improving accuracy and speed of diagnosis.
Personalized Medicine: AI analyzes patient data to tailor treatment plans based on
individual genetic profiles and medical histories.
Virtual Health Assistants: AI-powered chatbots and virtual agents provide patient
support, appointment scheduling, and medical advice.
Predictive Analytics: AI predicts patient outcomes and identifies at-risk individuals,
aiding in early intervention and preventive care.
Administrative Efficiency: AI automates tasks such as medical coding, scheduling, and
billing, improving operational efficiency.
Drug Discovery and Development: AI accelerates drug discovery processes and
predicts molecular interactions for new treatments.
. Finance
2
Algorithmic Trading: AI analyzes large datasets and market trends to execute trades
autonomously and optimize investment strategies.
raud Detection: AI algorithms identify unusual patterns in transactions to detect and
F
prevent fraudulent activities in real-time.
Credit Scoring and Risk Assessment: AI evaluates creditworthiness by analyzing
financial data and behavioral patterns, improving accuracy in risk assessment.
Customer Service and Chatbots: AI-powered chatbots provide personalized customer
support, assist with inquiries, and manage financial transactions.
Robo-Advisors: AI algorithms recommend investment portfolios based on individual risk
profiles and financial goals, providing automated wealth management solutions.
Sentiment Analysis: AI analyzes news, social media, and other textual data to gauge
market sentiment and predict market movements.
.Gaming
3
AI techniques are employed to create realistic game environments, develop intelligent
non-player characters (NPCs), and enhance player experience through procedural
content generation and adaptive gameplay.
thical Issues: AI systems can exhibit biases learned from training data, leading to
E
unfair treatment or decisions. Moreover, the automation of jobs raises concerns about
unemployment and the need for retraining the workforce.
Societal Impact: AI-driven automation has the potential to improve productivity and
create new job opportunities in emerging fields such as AI engineering and data
science. However, it also requires careful management to ensure that societal benefits
are equitably distributed and that ethical guidelines protect individual rights and privacy.
Unit 2
Fundamentals of Machine Learning
● M
L enables computers to handle complex tasks that are difficult to program
explicitly.
● It powers various applications such as recommendation systems, image and
speech recognition, medical diagnostics, and autonomous driving.
achine Learning can be broadly categorized into three main types based on the nature
M
of the learning process and the availability of labeled data:
here are two main categories of supervised learning that are mentioned
T
below:
1. Classification 2. Regression
1. Classification
aive Bayes
N
Decision Tree
Support Vector Machine
Random Forest
K-Nearest Neighbors (KNN)
2. Regression
inear Regression
L
Polynomial Regression
Ridge Regression
Lasso Regression
● Unsupervised Learning:
○ Unsupervised learning involves learning patterns from unlabeled data.
○ Example: Clustering similar documents together based on their content.
Here are some widely used machine learning algorithms across different types:
Linear Regression:
his is the simplest form of linear regression, and it involves only one
T
independent variable and one dependent variable. The equation for simple
linear regression is:
y=β0+β1*X
Where,
Y is the dependent variable
X is the independent variable
0 is the intercept
β
β1 is the slope
his involves more than one independent variable and one dependent
T
variable. The equation for multiple linear regression is:
𝑦=𝛽0+𝛽1*𝑋+𝛽2*𝑋+………𝛽𝑛*𝑋
Where
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
sed for binary classification problems where the output is a probability value between
U
0 and 1.Example: Predicting whether an email is spam or not.
For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0. It’s referred to as regression because it is the
extension of linear regression but is mainly used for classification problems.
ey points:
K
=>Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
=>It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
=>In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
● Decision Trees:
oot Node: Represents the entire dataset and the initial decision to be
R
made.
Internal Nodes: Represent decisions or tests on attributes. Each internal
node has one or more branches.
Branches: Represent the outcome of a decision or test, leading to another
node.
Leaf Nodes: Represent the final decision or prediction. No further splits
occur at these nodes.
s we know that the KNN algorithm helps us identify the nearest points or
A
the groups for a query point. But to determine the closest groups or the
nearest points for a query point we need some metric. For this purpose,
we use below distance metrics:Euclidean Distance, Manhattan Distance,
Minkowski Distance
● P
recision:There is another metric named Precision.Precision is a measure of a
model’s performance that tells you how many of the positive predictions made by
the model are actually correct. It is calculated as the number of true positive
predictions divided by the number of true positive and false positive predictions.
● R
ecall (Sensitivity):Proportion of true positivepredictions among all actual
positive instances.
● F
1-score:Harmonic mean of precision and recall, providinga balanced measure
between them.
F1 score = 2*(1/((1/precision)+(1/recall)))
Note: Lower recall and higher precision give you great accuracy but then it misses
a large number of instances. The more the F1 score better will be
Performance.
Note:
True Positives: It is the case where we predicted Yes and the real output was also
Yes.
True Negatives: It is the case where we predicted No and the real output was also
No.
False Positives: It is the case where we predicted Yes but it was actually No.
False Negatives: It is the case where we predicted No but it was actually Yes.
Unit-3
Machine Learning Techniques
Importance: Missing data is common in real-world datasets and can adversely affect
model performance if not handled properly.
ata can be missing for many reasons like technical issues, human errors, privacy
D
concerns, data processing issues, or the nature of the variable itself. Understanding the
cause of missing data helps choose appropriate handling strategies and ensure the
quality of your analysis.
It’s important to understand the reasons behind missing data:
● Identifying the type of missing data: Is it Missing Completely at Random (MCAR),
Missing at Random (MAR), or Missing Not at Random (MNAR)?
● Evaluating the impact of missing data: Is the missingness causing bias or
affecting the analysis?
● Choosing appropriate handling strategies: Different techniques are suitable for
different types of missing data.
Functions Descriptions
drop_duplicates() R
emoves duplicate rows based on specified
columns.
Techniques:
eletion: Remove rows or columns with missing data (simplest but can lead to loss of
D
valuable information).
Imputation: Replace missing values with a statistical estimate (mean, median, mode) or
use predictive methods like K-Nearest Neighbors (KNN) imputation.
● S
caling guarantees that all features are on a comparable scale and have
comparable ranges. This process is known as feature normalization.
● A
lgorithm performance improvement: When the features are scaled, several
machine learning methods, including gradient descent-based algorithms,
distance-based algorithms (such k-nearest neighbours), and support vector
machines, perform better or converge more quickly.
● P
reventing numerical instability: Numerical instability can be prevented by
avoiding significant scale disparities between features. Examples include
distance calculations or matrix operations, where having features with radically
differing scales can result in numerical overflow or underflow problems.
● S
caling features makes ensuring that each characteristic is given the same
consideration during the learning process. Without scaling, bigger scale features
could dominate the learning, producing skewed outcomes.
Techniques:
irst, we should calculate the mean and standard deviation of the data we would like to
F
normalize.Then we are supposed to subtract the mean value from each entry and then
divide the result by the standard deviation.
his helps us achieve a normal distribution(if it is already normal but skewed) of the
T
data with a mean equal to zero and a standard deviation equal to 1.
etter encoding leads to a better model and most algorithms cannot handle the
B
categorical variables unless they are converted into a numerical value.
Techniques:
uppose we have a column Height in some dataset that has elements as Tall, Medium,
S
and short. To convert this categorical column into a numerical column we will apply label
encoding to this column. After applying label encoding, the Height column is converted
into a numerical column having elements 0,1, and 2 where 0 is the label for tall, 1 is the
label for medium, and 2 is the label for short height.
Height Height
Tall 0
Medium 1
Short 2
In One Hot Encoding, the categorical parameters will prepare separate columns for both
Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male
column and 0 in the Female column, and vice-versa.
et’s understand with an example: Consider the data where fruits, their corresponding
L
categorical values, and prices are given.
n a binary classifier, the simplest way to do that is by calculating the probability p(t = 1
O
| x = ci) in which t denotes the target, x is the input and ci is the i-th category. In
Bayesian statistics, this is considered the posterior probability of t=1 given the input was
the category ci.
rocess: Evaluate and compare different machine learning models to identify the best
P
performer for the given task.
Techniques:
rain-Validation-Test Split: Divide data into training, validation, and test sets for model
T
evaluation.
Parameters vs Hyperparameters: Parameters of a model are generated by model itself
during training or learning. Examples are weights of a ML model or neural network.
While hyperparameters are manually fixed by us before the training phase. Examples:
Epoch size, batch size, number of layers in neural network, activation function, etc.
Hyperparameters are adjustable parameters that can be used to obtain an optimal
model.
In machine learning, training, validation, and test data sets are used for different
purposes to evaluate the performance of algorithms that learn from data and make
predictions:
Training data
he largest subset of data used to train the model by adjusting its parameters. This
T
helps the model learn underlying patterns in the data. The training set should not be too
small, or the model won't have enough data to learn.
Validation data
sed to evaluate the model during the training phase to fine-tune its parameters and
U
select the best-performing model. The validation set helps improve model performance
by predicting responses for observations in the data set. If there are multiple models to
select from, the validation set can help with model selection. Otherwise, it might be
redundant and can be omitted.
Test data
sed to evaluate the final model's performance on completely unseen data after the
U
model has been trained and validated. The test set can help approximate the model's
unbiased accuracy in the real world
etrics: Use appropriate metrics (accuracy, precision, recall, F1-score, etc.) for
M
evaluation based on the problem type (classification, regression).
b. Hyperparameter Tuning
Techniques:
ayesian Optimization: Sequential model-based optimization that uses results from past
B
iterations to guide the search for optimal hyperparameters.
In machine learning, we couldn’t fit the model on the training data and can’t say that the
model will work accurately for the real data. For this, we must assure that our model got
the correct patterns from the data, and it is not getting up too much noise. For this
purpose, we use the cross-validation technique. In this article, we’ll delve into the
process of cross-validation in machine learning.
urpose: Evaluate model performance while maximizing data utilization and minimizing
P
overfitting.
Techniques:
-Fold Cross-Validation:In K-Fold Cross Validation,we split the dataset into k number of
K
subsets (known as folds) then we perform training on the all the subsets but leave
one(k-1) subset for the evaluation of the trained model. In this method, we iterate k
times with a different subset reserved for testing purpose each time.
he diagram below shows an example of the training subsets and evaluation subsets
T
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration
we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training ([1-5] testing and [5-25] training) while in the second iteration we use the
s econd subset of 20 percent for evaluation, and the remaining three subsets of the data
for training ([5-10] testing and [1-5 and 10-25] training), and so on.
● T he dataset is divided into k folds while maintaining the proportion of classes in
each fold.
● During each iteration, one-fold is used for testing, and the remaining folds are
used for training.
● The process is repeated k times, with each fold serving as the test set exactly
once.
xample: Random Forest algorithm, which uses bagging to train decision trees on
E
random subsets of the data and aggregates their predictions.
b. Boosting
efinition: Sequentially train models where each subsequent model corrects errors
D
made by the previous one.
. Initialise the dataset and assign equal weight to each of the data point.
1
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the
weights of correctly classified data points. And then normalize the weights of all
data points.
4. if (got required results): Goto step 5
Else: Goto step 2
5. End
efinition: Ensemble learning method that constructs a multitude of decision trees at
D
training time and outputs the class that is the mode of the classes (classification) or
mean prediction (regression) of the individual trees.
5. Introduction to Deep Learning and Neural Networks
he definition of Deep learning is that it is the branch of machine learning that is based
T
on artificial neural network architecture. An artificial neural network or ANN uses layers
of interconnected nodes called neurons that work together to process and learn from
the input data.
Components:
rtificial neural networks are built on the principles of the structure and operation of
A
human neurons. It is also known as neural networks or neural nets. An artificial neural
network’s input layer, which is the first layer, receives input from external sources and
passes it on to the hidden layer, which is the second layer. Each neuron in the hidden
layer gets information from the neurons in the previous layer, computes the weighted
total, and then transfers it to the neurons in the next layer. These connections are
weighted, which means that the impacts of the inputs from the preceding layer are more
or less optimized by giving each input a distinct weight. These weights are then
adjusted during the training process to enhance the performance of the model.
efinition:
D
Data is widely considered a crucial resource in different organizations across every
industry. Data Science can be described in simple terms as a separate field of work that
deals with the management and processing of data using statistical methods, artificial
intelligence, and other tools in partnership with domain specialists. Pursuing Data
Science encompasses concepts and epochs derived from different fields including
Mathematics and Computer Science and Information Theory to interpret large data.
Importance:
ole:
R
Data Scientist is responsible for analyzing, interpreting, and deriving actionable insights
from complex data sets.
kills Required:
S
Programming: Proficiency in languages like Python, R, or SQL.
Statistics and Mathematics: Understanding of statistical methods and mathematical
concepts.
Machine Learning: Knowledge of algorithms for predictive modeling and pattern
recognition.
Data Wrangling: Cleaning, transforming, and preparing data for analysis.
Data Visualization: Communicating insights through charts, graphs, and dashboards.
Domain Knowledge: Understanding of the industry or field in which data is being
analyzed.
ources of Data:
S
Internal Sources: Data generated within an organization (e.g., databases, CRM
systems).
External Sources: Data obtained from third-party providers, APIs, social media, etc.
Public Datasets: Available from government agencies, research institutions, etc.
ata Formats:
D
Structured Data: Organized in a predefined format (e.g., databases, spreadsheets).
Unstructured Data: Not organized in a predefined manner (e.g., text documents,
images, videos).
ata cleaning, also known as data cleansing or data preprocessing, is a crucial step in
D
the data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent,
which can negatively impact the accuracy and reliability of the insights derived from it.
rocess:
P
Handling Missing Values: Imputation techniques or removal.
Handling Outliers: Identifying and treating outliers appropriately.
Normalization and Standardization: Scaling numerical data.
Data Formatting: Ensuring data is in a consistent format.
xploratory Data Analysis (EDA) is a crucial initial step in data science projects. It
E
involves analyzing and visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the method of studying
and exploring record sets to apprehend their predominant traits, discover patterns,
locate outliers, and identify relationships between variables. EDA is normally carried out
as a preliminary step before undertaking extra formal statistical analyses or modeling.
xploratory Data Analysis (EDA) is important for several reasons, especially in the
E
context of data science and statistical modeling. Here are some of the key reasons why
EDA is a critical step in the data analysis process:
● U
nderstanding Data Structures: EDA helps in getting familiar with the dataset,
understanding the number of features, the type of data in each feature, and the
distribution of data points.
● D
etecting Anomalies and Outliers: EDA is essential for identifying errors or
unusual data points that may adversely affect the results of your analysis.
● T
esting Assumptions: Many statistical models assume that data follow a certain
distribution or that variables are independent. EDA involves checking these
assumptions. If the assumptions do not hold, the conclusions drawn from the
model could be invalid.
● Informing Feature Selection and Engineering: Insights gained from EDA can
inform which features are most relevant to include in a model and how to
transform them (scaling, encoding) to improve model performance.
● O
ptimizing Model Design: By understanding the data’s characteristics, analysts
can choose appropriate modeling techniques, decide on the complexity of the
model, and better tune model parameters.
● F
acilitating Data Cleaning: EDA helps in spotting missing values and errors in the
data, which are critical to address before further analysis to improve data quality
and integrity.
. Descriptive Statistics
1
Definition:
Descriptive statistics are used to describe and summarize the features of a dataset.
They provide simple summaries about the sample and the measures.
easures:
M
Mean: Average of all values in a dataset, sensitive to outliers.
Median: Middle value of a dataset when arranged in ascending order, less sensitive to
outliers.
Mode: Most frequent value in a dataset.
Range: Difference between the maximum and minimum values.
Variance: Measure of the spread of data points around the mean.
Standard Deviation: Square root of the variance, indicating the average deviation from
the mean.
. Inferential Statistics
2
Definition:
Inferential statistics use data from a sample to make inferences or generalizations about
a larger population.
echniques:
T
Hypothesis Testing: Evaluates the likelihood that a result is due to chance.
Null Hypothesis (H0): Statement of no effect or no difference.
Alternative Hypothesis (H1): Statement to be tested.
Significance Level (α): Threshold for rejecting the null hypothesis (typically 0.05).
Confidence Intervals: Range of values within which the true population parameter is
estimated to lie.
Correlation Analysis: Measures the strength and direction of the linear relationship
between two variables (Pearson correlation coefficient).
Regression Analysis: Predicts the value of one variable based on the value of another
(linear regression, logistic regression, etc.).
. Scatter Plots
1
Definition:
A scatter plot is a graph that displays values for two variables as points on a Cartesian
plane. Each point represents the value of one variable corresponding to the value of the
other.
pplications:
A
Relationship Exploration: Visualize relationships and correlations between variables.
Outlier Detection: Identify outliers and anomalies in data.
rend Identification: Spot trends such as clusters or patterns in data points.
T
Example:
In a dataset of student scores vs. study hours, a scatter plot can show whether there's a
correlation between hours studied and exam scores.
. Line Charts
2
Definition:
A line chart displays data points connected by straight line segments. It is particularly
useful for showing trends over time or ordered categories.
Applications:
ime Series Analysis: Track changes in data over time (e.g., stock prices, temperature
T
trends).
Comparison: Compare trends in multiple datasets (e.g., sales performance across
different regions).
Example:
Showing the growth of a company's revenue over the past five years using a line chart.
. Histograms
3
Definition:
histogram is a graphical representation of the distribution of numerical data. It
A
consists of bars that show the frequency of data points within defined intervals (bins).
pplications:
A
Distribution Analysis: Understand the shape, center, and spread of data.
Identifying Skewness: Determine whether data is symmetric or skewed.
Data Preprocessing: Assess data quality and potential outliers.
Example:
Visualizing the distribution of ages in a population to understand the demographic
profile.
. Bar Charts
4
Definition:
bar chart uses rectangular bars to represent categorical data. The length or height of
A
each bar corresponds to the frequency, count, or percentage of the categories.
Applications:
pie chart is a circular statistical graphic divided into slices to illustrate numerical
A
proportions. The arc length of each slice is proportional to the quantity it represents.
Applications:
. NumPy
1
Definition:
NumPy (Numerical Python) is a library for the Python programming language that
supports large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on these arrays.
Key Features:
● N-Dimensional Arrays: Core data structure is ndarray.
● Mathematical Functions: Functions for linear algebra, statistics, and
mathematical operations.
● Broadcasting: Support for arithmetic operations on arrays of different shapes.
Installation:
pip install numpy
Basic Usage:
. Importing NumPy:
1
import numpy as np
2. Creating Arrays:
# Creating a 1D array
rr1 = np.array([1, 2, 3, 4, 5])
a
print(arr1)
# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)
3. Array Operations:
# Array addition
arr_sum = arr1 + 10
print(arr_sum)
# Matrix multiplication
arr_mult = np.dot(arr2, arr2.T)
print(arr_mult)
4. Array Statistics:
# Mean, Median, Standard Deviation
mean_val = np.mean(arr1)
median_val = np.median(arr1)
std_dev = np.std(arr1)
. Pandas
2
Definition:
Pandas is a data manipulation and analysis library for Python. It provides data
structures and functions needed to work on structured data seamlessly.
Key Features:
● DataFrames: 2D labeled data structure with columns of potentially different types.
● Series: 1D labeled array capable of holding any data type.
Installation:
pip install pandas
Basic Usage:
. Importing Pandas:
1
import pandas as pd
2. Creating DataFrames:
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
f = pd.DataFrame(data)
d
print(df)
3. DataFrame Operations:
# Adding a new column
df['Gender'] = ['F', 'M', 'M']
print(df)
Descriptive Statistics
#
print(df.describe())
# Data Selection
print(df['Name']) # Selecting a column
print(df.iloc[0]) # Selecting a row by index
4. Data Cleaning:
# Handling Missing Values
df_missing = df.copy()
df_missing.loc[1, 'Age'] = None # Introduce a missingvalue
df_cleaned = df_missing.fillna(df_missing['Age'].mean()) # Fill missing values with
mean
print(df_cleaned)
. Matplotlib
3
Definition:
Matplotlib is a plotting library for Python and is widely used for creating static, animated,
and interactive visualizations.
ey Features:
K
Flexibility: Wide range of plot types.
Customization: Extensive customization options for plots.
Installation:
pip install matplotlib
asic Usage:
B
1. Importing Matplotlib:
import matplotlib.pyplot as plt
2. Creating Plots:
Line Plot:
# Line Plot
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
Bar Plot:
# Bar Plot
categories = ['A', 'B', 'C']
values = [10, 15, 7]
plt.bar(categories, values)
plt.title('Bar Plot')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
istogram:
H
# Histogram
data = np.random.randn(1000) # Generate 1000 random data points
plt.hist(data, bins=30)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
. Seaborn
4
Definition:
Seaborn is a Python visualization library based on Matplotlib that provides a high-level
interface for drawing attractive and informative statistical graphics.
ey Features:
K
High-Level Interface: Easier syntax for complex plots.
Integrated Data Analysis: Built-in functions for statistical plotting.
Installation:
pip install seaborn
asic Usage:
B
1. Importing Seaborn:
import seaborn as sns
2. Creating Plots:
catter Plot:
S
# Scatter Plot
tips = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Scatter Plot of Total Bill vs Tip')
plt.show()
Box Plot:
# Box Plot
sns.boxplot(x='day', y='total_bill', data=tips)
plt.title('Box Plot of Total Bill by Day')
plt.show()
Heatmap:
# Heatmap
corr = tips.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Heatmap of Correlations')
plt.show()
Unit 5
Advanced Topics and Applications
ey Concepts:
K
Hyperplane: A decision boundary that separates different classes.
Support Vectors: Data points that are closest to the hyperplane and influence its
position.
Margin: The distance between the hyperplane and the support vectors.
Types of SVM:
o we choose the hyperplane whose distance from it to the nearest data point on each
S
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a
scenario like shown below
ere we have one blue ball in the boundary of the red ball. So how does SVM classify
H
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls.
The SVM algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust to outliers.
. Neural Networks
2
Definition:
Neural Networks are computational models inspired by the human brain's network of
neurons. They consist of layers of interconnected nodes (neurons) that transform input
data into output predictions.
ey Concepts:
K
Neurons: Basic units that receive inputs, apply weights, and pass the result through an
activation function.
Activation Functions: Functions that introduce non-linearity (e.g., Sigmoid, ReLU, Tanh).
ommon Architectures:
C
Feedforward Neural Networks (FNN): Data moves in one direction, from input to output.
Multi-Layer Perceptrons (MLP): FNNs with one or more hidden layers.
onsider a neural network for email classification. The input layer takes features like
C
email content, sender information, and subject. These inputs, multiplied by adjusted
weights, pass through hidden layers. The network, through training, learns to recognize
patterns indicating whether an email is spam or not. The output layer, with a binary
activation function, predicts whether the email is spam (1) or not (0). As the network
iteratively refines its weights through backpropagation, it becomes adept at
distinguishing between spam and legitimate emails, showcasing the practicality of
neural networks in real-world applications like email filtering.
eural networks are complex systems that mimic some features of the functioning of
N
the human brain. It is composed of an input layer, one or more hidden layers, and an
output layer made up of layers of artificial neurons that are coupled. The two stages of
the basic process are called backpropagation and forward propagation.
Forward Propagation
● Input Layer: Each feature in the input layer is represented by a node on the
network, which receives input data.
● Weights and Connections: The weight of each neuronal connection indicates
how strong the connection is. Throughout training, these weights are changed.
● Hidden Layers: Each hidden layer neuron processes inputs by multiplying them
by weights, adding them up, and then passing them through an activation
function. By doing this, non-linearity is introduced, enabling the network to
recognize intricate patterns.
● Output: The final result is produced by repeating the process until the output
layer is reached.
Backpropagation:
● Loss Calculation: The network’s output is evaluated against the real goal values,
and a loss function is used to compute the difference. For a regression problem,
the Mean Squared Error (MSE) is commonly used as the cost function.
● G radient Descent: Gradient descent is then used by the network to reduce the
loss. To lower the inaccuracy, weights are changed based on the derivative of the
loss with respect to each weight.
● Adjusting weights: The weights are adjusted at each connection by applying this
iterative process, or backpropagation, backward across the network.
● Training: During training with different data samples, the entire process of
forward propagation, loss calculation, and backpropagation is done iteratively,
enabling the network to adapt and learn patterns from the data.
● Actvation Functions: Model non-linearity is introduced by activation functions like
the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a
neuron is based on the whole weighted input.
efinition:
D
Convolutional Neural Networks (CNNs) are specialized neural networks designed for
processing structured grid data, like images.
ey Concepts:
K
Convolutions: Operations that apply a filter to an image to create feature maps.
Pooling Layers: Reduce the spatial dimensions of feature maps (e.g., Max Pooling,
Average Pooling).
Fully Connected Layers: Layers where each neuron is connected to every neuron in the
previous layer.
. Convolutional Layer
1
Function:
he convolutional layer is the core building block of a CNN. It applies a set of filters
T
(kernels) to the input image to produce feature maps. Each filter detects specific
features like edges, textures, or patterns.
How It Works:
ilters: Small matrices (e.g., 3x3 or 5x5) that slide over the input image. Each filter
F
detects different features.
Convolution Operation: The filter multiplies its values with the pixel values of the image
and sums the results to produce a single output value. This operation is performed
across the entire image.
Mathematical Operation:
Feature Map=Image∗Filter
he ReLU (Rectified Linear Unit) activation function introduces non-linearity into the
T
model. It replaces all negative pixel values with zero.
Mathematical Operation:
ReLU(𝑥)=max(0,𝑥)
. Pooling Layer
3
Function:
ooling layers reduce the spatial dimensions of feature maps, decreasing the number of
P
parameters and computation required, and helping to avoid overfitting.
Types of Pooling:
. Flattening Layer
4
Function:
he flattening layer converts the 2D matrix into a 1D vector. This step is necessary to
T
feed the output into fully connected layers.
Mathematical Operation:
he fully connected layer (Dense layer) performs the final classification or regression
T
tasks. Every neuron in this layer is connected to every neuron in the previous layer.
Mathematical Operation:
𝑦=𝑊𝑥+𝑏
here:
W
x is the input vector,
W is the weight matrix,
b is the bias term,
y is the output vector.
. Recurrent Neural Networks (RNNs)
4
Definition:
Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to
handle sequential data. Unlike traditional feedforward neural networks, RNNs have
connections that form directed cycles, allowing information to persist across time steps.
pplications:
A
RNNs are used in various applications such as:
. Recurrent Structure
a
RNNs have a structure that includes a feedback loop, allowing the network to use
information from previous time steps.
Description:
Input Vector
x t: The data at time step 𝑡
idden State
H
ht: The output of the hidden layer that carries information from previous time steps.
Recurrent Connection: The hidden state ht is used as input for the next time step.
Mathematical Representation:For a given time step t, the RNN performs the following
operations:
nroll the RNN: Expand the RNN into a chain of layers corresponding to each time
U
step.
Compute Gradients: Calculate the gradients for each layer over the entire sequence.
Update Weights: Adjust the weights using the computed gradients.
hat is NLP?
W
Definition:
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses
on enabling computers to understand, interpret, and generate human language. It
combines computational linguistics, machine learning, and computer science to facilitate
interactions between humans and machines through natural language.
urpose:
P
NLP aims to bridge the gap between human communication and computer
understanding, making it possible for machines to process and analyze large amounts
of natural language data.
Image:
ey Techniques in NLP
K
Text Classification
hat: Categorizes text into predefined categories.
W
Examples: Spam detection, sentiment analysis.
Technique: TF-IDF, Naive Bayes, SVM.
entiment Analysis
S
What: Determines the sentiment expressed in text.
Examples: Analyzing product reviews, social media sentiment.
Technique: Rule-based systems, machine learning, deep learning.
achine Translation
M
What: Translates text from one language to another.
Examples: Google Translate, language learning apps.
Technique: Statistical Machine Translation, Neural Machine Translation.
uestion Answering
Q
What: Provides answers to questions posed in natural language.
Examples: Virtual assistants, customer support.
Technique: Retrieval-based systems, generative models.
peech Recognition
S
What: Converts spoken language into text.
Examples: Voice assistants, transcription services.
Technique: Acoustic models, language models, deep learning.
ext Summarization
T
What: Creates a concise summary of a longer text.
Examples: Summarizing news articles, executive summaries.
Technique: Extractive summarization, abstractive summarization.
Information Retrieval
hat: Searches for relevant information from large datasets.
W
Examples: Search engines, document retrieval.
Technique: Vector space models, ranking algorithms.
ext Generation
T
What: Generates coherent and contextually relevant text.
Examples: Content creation, creative writing.
Technique: Language models, generative models.
eal-World Applications
R
Customer Support
Example: Chatbots answering customer queries.
Benefit: 24/7 support and cost efficiency.
Language Translation
Example: Google Translate for multilingual communication.
Benefit: Breaking language barriers globally.
Sentiment Analysis
Example: Analyzing Twitter posts for brand sentiment.
Benefit: Understanding customer opinions and market trends.
Healthcare
Example: Extracting information from medical records.
Benefit: Improving patient care and management.
Education
Example: Language learning apps and automated tutoring.
Benefit: Enhancing learning experiences.
Future Trends
Advanced Models: Development of more sophisticated models like GPT-4.
Multimodal Approaches: Combining text with other data types (images, videos).
Ethical NLP: Addressing biases and ensuring fairness in AI applications.
onclusion
C
NLP is a dynamic and rapidly evolving field that leverages techniques from AI and
machine learning to process and analyze human language. Its applications are diverse
and impactful, from enhancing customer service to enabling real-time translation and
generating human-like text. As technology advances, NLP will continue to transform
how we interact with machines and process language data.
Introduction to Big Data Technologies: Hadoop, Spark, and More
urpose:
P
The aim of Big Data technologies is to efficiently store, manage, and analyze massive
datasets to extract valuable insights, drive decisions, and create innovative solutions.
Components of Hadoop:
apReduce:
M
Function: A programming model for processing large data sets with a distributed
algorithm.
Components:
apper: Processes input data and generates key-value pairs.
M
Reducer: Aggregates and processes the results of the Mapper.
unction: Manages resources and job scheduling across the Hadoop cluster.
F
Components:
ResourceManager: Manages resources across the cluster.
NodeManager: Manages resources and tasks on individual nodes.
Use Cases:
ata Storage: Store and manage large datasets from various sources.
D
Data Processing: Analyze large-scale data for patterns and insights.
Data Integration: Combine data from different sources for a unified analysis.
. Apache Spark
b
What is Spark?
Apache Spark is an open-source unified analytics engine for large-scale data
processing. It provides fast, in-memory data processing capabilities and supports
various workloads like batch processing, streaming, and machine learning.
Components of Spark:
Spark Core:
Spark Streaming:
se Cases:
U
Data Processing: High-performance processing of large datasets.
Real-Time Analytics: Analyzing data as it is generated.
Machine Learning: Building and deploying machine learning models.
Image:
Approach:
Benefits:
Approach:
ata Collection: Collects data on job applications, user profiles, and interactions.
D
Data Analysis: Analyzes data to improve job matching algorithms.
Outcome: More relevant job recommendations and improved user engagement.
Image:
Benefits:
Approach:
Benefits:
rend: Combining Big Data technologies with machine learning and AI for advanced
T
analytics.
Example: Databricks platform for integrated analytics.
Image:
. Career Prospects
b
Job Roles:
onclusion
C
Big Data technologies like Hadoop and Spark are crucial for managing and analyzing
massive datasets. They offer tools for scalable data storage, efficient processing, and
advanced analytics. Real-world applications span various domains, from entertainment
to e-commerce, demonstrating the impact of Big Data on modern businesses.