Data Analytics Compendium BITeSys 2024
Data Analytics Compendium BITeSys 2024
"Information is the oil of the 21st century, and analytics is the combustion engine”
- Peter Sondergaard
1. Foundational Knowledge:
o Basics of statistics and data analysis.
o Key concepts in machine learning (ML) and artificial intelligence (AI),
including supervised and unsupervised learning, deep learning, and neural
networks.
2. Advanced Topics:
o Exploration of large language models (LLMs) and generative pre-trained
transformers (GPT).
o Latest trends and developments in data analytics and AI.
Use this compendium as a starting point and seek out additional resources to expand your
knowledge. Feel free to reach out to us with any questions. We are here to support you
throughout your learning journey.
Your ability to analyse and interpret data will distinguish you in the professional world. This
compendium is designed to help you succeed in your placement process.
Best regards,
Team bITeSys
Outline
1. Why Statistics?
2. Statistical Methods
3. Types of Statistics: Descriptive and Inferential Statistics
4. Data Sources and Types of Datasets
5. Attributes of Datasets
Statistics play a pivotal role in modern analytical decision-making processes, driven by three
significant events:
• Revolution of Internet and Social Networks: The explosion of internet usage and
social media platforms has led to an unprecedented amount of data being generated.
Platforms like Facebook, Twitter, and Instagram produce massive amounts of user-
generated content daily. Each post, comment, like, and share contributes to a growing
dataset that can provide deep insights into human behaviour and social trends.
• Mobile Phones and Electronic Devices: The proliferation of smartphones and other
electronic devices contributes significantly to data generation. Every interaction,
search, and transaction produce data. For example, GPS data from mobile phones can
track movement patterns, while app usage data reveals user preferences and habits.
• Insight Discovery: Organizations harness this data to uncover patterns and trends.
These insights help in improving profitability, understanding customer expectations,
and appropriately pricing products. For example, e-commerce companies analyse
browsing and purchasing behaviour to recommend products, while social media
platforms use data to personalize user feeds. This strategic use of data allows
companies to gain a competitive advantage in the marketplace.
• Massive Data Processing: Enhanced computing capabilities now allow for the
processing and analysis of large datasets that were previously unmanageable. This
includes advancements in both hardware (e.g., GPUs, TPUs) and software (e.g.,
distributed computing frameworks like Hadoop and Spark) that enable complex
calculations to be performed quickly.
• Sophisticated Algorithms: The development of faster and more efficient algorithms
has significantly improved problem-solving capabilities. Algorithms can now handle
large volumes of data and provide more accurate predictions and analyses. For
example, machine learning algorithms like deep learning can process vast amounts of
unstructured data, such as images and text, to identify patterns and make predictions.
Big Data
Big Data refers to datasets that cannot be managed, processed, or analysed with traditional
software or algorithms within a reasonable timeframe. The concept of Big Data is defined by
the following characteristics:
Examples:
• Walmart: Handles over one million purchase transactions per hour, generating
massive amounts of transactional data. This data is used to optimize inventory
management, improve supply chain efficiency, and personalize marketing efforts.
• Facebook: Processes more than 250 million picture uploads per day, showcasing the
volume and variety of data. Analysing this data helps Facebook improve user
experience through targeted ads, personalized content, and enhanced security
measures.
Classification:
• Purpose: Segments customers into groups based on key characteristics. This helps in
targeting specific customer segments with tailored marketing strategies.
• Applications:
o Customer Segmentation: Organizations can segment customers into Long
Term Customers, Medium Term Customers, and Brand Switchers. This
segmentation helps in designing loyalty programs, targeted promotions, and
personalized communications.
o Buyers and Non-Buyers: Classification models can differentiate between
customers who are likely to make a purchase and those who are not. This
helps in optimizing marketing spend by focusing efforts on high-potential
customers.
• Benefits: Helps professionals understand customer behaviour, allowing them to better
position their products and brands. For example, a company could develop different
marketing strategies for long-term loyal customers versus occasional buyers. By
understanding the characteristics of each segment, businesses can tailor their offerings
and communications to better meet customer needs.
Pattern Recognition:
• Purpose: Reveals hidden patterns in data that might not be immediately obvious.
• Techniques:
o Histogram: Visualizes the distribution of data. For example, a histogram of
customer incomes might reveal a bell curve or skewed distribution, providing
insights into income inequality and purchasing power.
o Box Plot: Identifies outliers and provides a summary of the data distribution.
Box plots are useful for comparing distributions across different groups or
time periods.
o Scatter Plot: Captures relationships between two variables, such as age and
expenditure. Scatter plots can help identify correlations and trends that inform
business decisions.
• Benefits: Visual analytics provide clear insights that can be leveraged by retail
professionals. For instance, recognizing spending patterns among different age groups
can inform targeted promotions. By visualizing data, businesses can quickly identify
and act on opportunities and challenges.
Association:
Predictive Modelling:
Types of Statistics
Descriptive Statistics:
Inferential Statistics:
Data Sources:
Types of Datasets:
Attributes of Datasets
• Quality: Accuracy and reliability of the data. High-quality data is free from errors
and inconsistencies, providing a solid foundation for analysis.
• Relevance: The importance of the data in relation to the problem being analysed.
Relevant data addresses specific research questions and objectives.
• Timeliness: How up to date the data is. Timely data reflects current conditions and
trends, ensuring that analysis and decisions are based on the latest information.
• Completeness: The extent to which all required data is present. Complete data
includes all necessary variables and observations, reducing the risk of bias and
missing information.
• Consistency: The uniformity of the data across different sources. Consistent data
maintains the same formats and definitions, facilitating integration and comparison
across datasets.
Outline
1. Raw Data
2. Frequency Distribution - Histograms
3. Cumulative Frequency Distribution
4. Measures of Central Tendency
5. Mean, Median, Mode
6. Measures of Dispersion
7. Range, IQR, Standard Deviation, Coefficient of Variation
8. Normal Distribution, Chebyshev Rule
9. Five Number Summary, Boxplots, QQ Plots, Quantile Plot, Scatter Plot
10. Scatter Plot Matrix
11. Correlation Analysis
When analysts encounter a plethora of data that initially seems nonsensical, they seek
methods to classify and organize this data to convey meaningful insights. The objective is to
transform raw data into information that aids in drawing accurate conclusions.
Raw Data
Raw Data represents numbers and facts in their original format as collected. This data needs
to be processed and converted into information for effective decision-making.
Frequency Distribution
Frequency Distribution is a summarized table where raw data is arranged into classes with
corresponding frequencies. This technique classifies raw data into usable information and is
widely used in descriptive statistics.
Histogram
A Cumulative Frequency Distribution shows how many observations fall below the upper
boundary of each class.
Central tendency measures describe the centre point of a data set. Common measures include
the mean, median, and mode.
Arithmetic Mean:
The mean is the sum of all observations divided by the number of observations.
Median:
The median is the middle value when data is arranged in ascending order. It divides the data
set into two equal parts.
Mode:
The mode is the value that occurs most frequently in a data set.
Dispersion measures indicate the spread of data around the central tendency. They help in
understanding the variability within a data set.
Range:
The range is the difference between the maximum and minimum values in a data set.
The IQR is the range of the middle 50% of observations, calculated as the difference between
the third quartile (Q3) and the first quartile (Q1).
Standard deviation measures the average deviation of each data point from the mean.
Variance is the square of the standard deviation.
Empirical Rule:
In a bell-shaped distribution:
Boxplot:
A boxplot graphically displays the five-number summary and shows the distribution shape,
spread, and potential outliers.
Scatter Plot:
A scatter plot shows the relationship between two variables, helping to identify clusters,
outliers, and correlations.
A scatter plot matrix displays multiple scatter plots for pairs of variables, providing a
comprehensive view of relationships within the data set.
Scatter plots and correlation coefficients (r) quantify the strength and direction of the
relationship between numeric variables.
Outline
1. Data Cleaning
2. Data Handling
3. Deep Wrangling
4. Data Repositories
5. Types of data repositories
6. Types of charts and data visualisation graph
Data Cleaning:
This involves identifying and correcting errors, inconsistencies, duplication, missing values,
and outliers in the dataset. Techniques include:
Removing duplicate records: Ensures data uniqueness.
Imputation: Filling in missing values using statistical methods or machine learning
algorithms.
Formatting corrections: Standardizing date formats, correcting typos, etc.
Handling outliers: Identifying and addressing data points that significantly deviate from the
norm.
Data Handling:
This encompasses the overall management of data from collection to storage, including:
Data Collection: Gathering data from various sources such as databases, web scraping, and
APIs.
Data Storage: Using databases, data lakes, and warehouses to store large datasets efficiently.
Data Security: Ensuring data privacy and protection through encryption, access controls, and
compliance with regulations.
Data Wrangling:
This is the process of transforming and mapping data from one "raw" form into another
format to make it more appropriate and valuable format for analysis or machine learning. It's
also known as data munging, data cleaning or data remediation. Techniques include:
Data Transformation: Converting data into a suitable format for analysis, such as
normalization, scaling, and encoding categorical variables.
Data Repositories
The data repository is just like the infrastructure of databases that collect, manage and store
data sets accordingly. Some other names of data repository are data archive, data library.
Why do we need Data repositories?
1. Centralized data management: Ability to store all the data in one central location,
making it easier to access, manage and analyse.
2. Data Consistency: Consistency and accuracy of data across different departments
and systems.
3. Improved Decision Making: Comprehensive data enables informed decisions
regarding risk management, customer service, product development and more.
4. Regulatory Compliance: A single source of truth aids in auditing and reporting
purposes.
5. Cost Efficiency: Consolidating data reduce cost in terms of data storage, maintenance
and integration.
Pie Charts: Show parts/percentages as a part of a whole. Useful for proportional data when
you want to illustrate the proportion of each category in the dataset. Ideal to use if you have
less than 7-8 categories otherwise the chart may loose clarity.
Bar Charts: A bar chart visually represents data using rectangular bars or columns. Here, the
length of each bar corresponds proportionally to its value. It is used for comparing quantities
of different categories by showing their relative sizes. A horizontal bar chart is better to use
when the text of bar is lengthy.
Outline
Artificial Intelligence (AI) is a broad field of computer science focused on creating systems
capable of performing tasks that typically require human intelligence. These tasks include
reasoning, learning, problem-solving, perception, language understanding, and more. AI aims
to create machines that can mimic human cognitive functions.
Key Concepts:
• General AI: AI systems that possess the ability to perform any intellectual task that a
human can do. This remains largely theoretical.
• Narrow AI: AI systems designed for specific tasks, such as speech recognition,
image recognition, and language translation. This is where most current AI
applications lie.
Key Concepts:
• Supervised Learning: Algorithms are trained on labelled data, meaning the input
data is paired with the correct output. The model learns to map inputs to outputs.
• Unsupervised Learning: Algorithms are trained on unlabelled data and must find
hidden patterns or intrinsic structures within the input data.
• Reinforcement Learning: Algorithms learn by interacting with an environment and
receiving rewards or penalties based on their actions.
Deep Learning (DL) is a subset of machine learning that utilizes neural networks with many
layers (hence "deep") to model complex patterns in large amounts of data. These neural
networks, inspired by the human brain, are capable of learning hierarchical representations of
data, which makes them particularly effective for tasks like image and speech recognition.
Key Concepts:
Overview
• Data as Input: Machine learning models take data as input to find patterns.
• Finding Patterns: The goal is to identify and summarize patterns in a mathematically
precise way.
• Automating Model Building: Machine learning automates the process of model
building, making it more efficient and scalable.
Challenges in Data
• Overfitting: When a model captures the noise along with the information, it is
overfitting. Overfitting leads to poor prediction performance on new, unseen data.
• Underfitting: When a model fails to capture all the relevant information, it is
underfitting. Underfitting also results in poor prediction performance.
1. Supervised Learning:
• Definition: Building a mathematical model using data that contains both inputs and
desired outputs (ground truth).
• Examples:
o Image classification (e.g., determining if an image contains a horse).
o Loan default prediction.
o Employee turnover prediction.
• Evaluation: Model performance can be evaluated by comparing predictions to the
actual desired outputs.
2. Unsupervised Learning:
• Definition: Building a mathematical model using data that contains only inputs and
no desired outputs.
• Purpose: To find structure in the data, such as grouping or clustering data points to
discover patterns.
• Example: An advertising platform segments the population into groups with similar
demographics and purchasing habits, aiding in targeted advertising.
• Evaluation: Since no labels are provided, there is no straightforward way to compare
model performance.
Supervised Learning:
Unsupervised Learning:
Supervised Learning is a type of machine learning where the model is trained using labeled
data. This means the training dataset includes both the input features (what we use to make
predictions) and the output labels (the actual outcomes). The goal of supervised learning is to
learn a mapping from inputs to outputs, allowing the model to make accurate predictions on
new, unseen data.
Key Concepts
1. Inputs (Features): The variables or attributes that are used to make predictions. For
example, in predicting house prices, features might include the size of the house, the
number of bedrooms, and the location.
2. Outputs (Labels): The target variable or outcome that the model aims to predict.
Continuing with the house price example, the output would be the actual price of the
house.
3. Training Phase: The process where the model learns the relationship between inputs
and outputs by being exposed to the training data.
4. Prediction Phase: After training, the model uses the learned relationships to predict
the output for new, unseen inputs.
1. Regression:
2. Classification:
1. Image Classification:
2. Spam Detection:
3. Sentiment Analysis:
4. Fraud Detection:
5. Medical Diagnosis:
1. Data Collection:
o Collect a dataset that includes both the inputs (features) and the corresponding
outputs (labels).
2. Data Preparation:
o Clean and preprocess the data to make it suitable for training. This might
include handling missing values, normalizing the data, and converting
categorical variables into numerical ones.
1. Overfitting:
o Occurs when the model learns not only the underlying patterns but also the
noise in the training data, leading to poor performance on new data.
2. Underfitting:
o Happens when the model is too simple to capture the underlying patterns in
the data, resulting in poor performance on both training and test data.
3. Data Quality:
o The quality of the training data significantly impacts the model’s performance.
High-quality, labelled data is crucial.
4. Computational Resources:
o Training complex models on large datasets requires significant computational
power.
5. Bias-Variance Trade-off:
o Balancing the complexity of the model to minimize both bias (error due to
overly simplistic models) and variance (error due to overly complex models)
is a key challenge.
Outline
Linear Regression is one of the most fundamental and widely used techniques in supervised
learning. It is used to model the relationship between a dependent variable (target) and one or
more independent variables (features) by fitting a linear equation to the observed data. The
goal is to predict the target variable based on the values of the features.
In linear regression, the relationship between the dependent variable Y and the independent
variable(s) X is modelled by a linear equation:
•
• Y: Dependent variable (what you are trying to predict)
• X: Independent variable (the feature used for prediction)
• β0: Intercept (the value of Y when X is 0)
• β1: Slope (the change in Y for a one-unit change in X)
• ϵ: Error term (the difference between the predicted and actual values)
Key Concepts:
• Intercept (β0 ): Represents the starting point of the line when X is zero.
• Slope (β1 ): Indicates how much Y changes for a unit change in X.
• Error Term (ϵ): Captures the variation in Y that cannot be explained by the linear
relationship with X.
• Usage: Estimating the price of a house based on features like size, number of
bedrooms, and location.
2. Sales Forecasting:
• Usage: Predicting future sales figures based on historical sales data and market
trends.
3. Risk Management:
4. Medical Outcomes:
• Usage: Predicting patient outcomes (e.g., blood pressure) based on medical history
and lifestyle factors.
5. Market Analysis:
1. Data Collection:
• Gather a dataset that includes both the dependent variable and independent variables.
2. Data Preparation:
• Clean and preprocess the data, handle missing values, and ensure the data is suitable
for analysis.
• Analyse the data to understand its structure, identify patterns, and check for
relationships between variables.
4. Model Training:
6. Prediction:
1. R-squared (R²):
• Represents the proportion of the variance for the dependent variable that's explained
by the independent variables.
• R^2 ranges from 0 to 1, with higher values indicating better model performance.
• The square root of the average of squared differences between prediction and actual
observation.
1. Assumption Violations:
2. Outliers:
• Outliers can significantly affect the parameters of the linear regression model, leading
to biased predictions.
3. Multicollinearity:
4. Overfitting:
• If the model is too complex, it may fit the training data too closely, capturing noise
rather than the underlying pattern.
5. Underfitting:
• If the model is too simple, it may not capture the underlying pattern in the data,
leading to poor predictive performance.
Outline
Unsupervised Learning is a type of machine learning where the model is trained using data
that consists only of input features and no corresponding output labels. The goal is to identify
patterns, structures, or relationships in the data without any prior knowledge of the results.
Key Concepts
1. Inputs (Features): The variables or attributes that are used to find patterns in the
data. Unlike supervised learning, there are no predefined labels or outcomes.
2. Clustering: The process of grouping similar data points together based on their
features. The aim is to maximize intra-cluster similarity and minimize inter-cluster
similarity.
3. Dimensionality Reduction: The process of reducing the number of random variables
under consideration, by obtaining a set of principal variables.
4. Association Rule Learning: The process of discovering interesting relations between
variables in large databases.
1. Clustering:
Clustering techniques group data points into clusters based on their similarities.
• Examples:
o Customer segmentation in marketing.
o Grouping similar documents for topic modelling.
Common Algorithms:
• K-Means Clustering
2. Dimensionality Reduction:
Dimensionality reduction techniques reduce the number of features in the data while retaining
its essential characteristics.
• Examples:
o Reducing the complexity of data for visualization.
o Simplifying data for improved model performance.
Common Algorithms:
• Examples:
o Market basket analysis in retail.
o Recommender systems.
Common Algorithms:
• Apriori Algorithm
• Eclat Algorithm
• FP-Growth Algorithm
1. Customer Segmentation:
2. Anomaly Detection:
• Usage: Identifying unusual data points that do not fit the general pattern.
3. Data Visualization:
• Usage: Reducing the dimensionality of data for better visualization and interpretation.
• Example: Using PCA to visualize high-dimensional gene expression data in a 2D
plot.
4. Recommender Systems:
5. Topic Modelling:
6. Image Compression:
• Usage: Reducing the size of image files while preserving important information.
• Example: Using autoencoders to compress and reconstruct images with minimal loss
of quality.
7. Genomics:
1. Data Collection:
2. Data Preparation:
• Clean and preprocess the data, handle missing values, and normalize the data.
• Analyse the data to understand its structure and identify any patterns or anomalies.
4. Model Training:
5. Model Evaluation:
• Evaluate the model’s performance using metrics specific to the chosen technique
(e.g., silhouette score for clustering).
6. Interpretation:
1. No Ground Truth:
• Selecting the appropriate algorithm for a specific task requires expertise and
experimentation.
• Deciding the optimal number of clusters in clustering algorithms can be difficult and
often requires domain knowledge.
4. High Dimensionality:
• High-dimensional data can complicate the analysis and may require dimensionality
reduction techniques to simplify.
5. Interpretability:
6. Computational Complexity:
Outline
Deep Learning is a subset of machine learning that uses neural networks with many layers to
analyse various kinds of data. These layers of networks enable computers to perform tasks
like image recognition, language translation, and playing games. The "deep" in deep learning
refers to the number of layers through which the data is transformed.
Deep learning models are inspired by the human brain and are designed to simulate how we
learn and process information. This allows machines to perform complex tasks by
understanding intricate patterns in data, which traditional machine learning models might
struggle with.
Neurons
Neurons are the basic units of a neural network. Each neuron receives inputs, processes them,
and produces an output. Think of neurons as tiny decision-makers that look at the data they
receive and decide whether to pass it on to the next layer or not.
Layers
1. Input Layer: This is where the network receives the raw data. For example, if you
are feeding in images, the input layer would receive pixel values.
2. Hidden Layers: These are intermediate layers where neurons process the inputs from
the previous layer. A deep learning model can have many hidden layers, which helps
it learn more complex patterns.
Activation Functions
Activation functions determine whether a neuron should be activated or not. They introduce
non-linearity into the network, which allows it to learn more complex patterns. Common
activation functions include:
1. ReLU (Rectified Linear Unit): Outputs the input directly if it is positive; otherwise,
it outputs zero. This helps the network deal with the problem of vanishing gradients,
where gradients (used to update the model) become too small for effective learning.
2. Sigmoid: Outputs a value between 0 and 1, which is useful for binary classification
tasks.
3. Tanh (Hyperbolic Tangent): Outputs values between -1 and 1 and is often used in
hidden layers.
1. Input Data:
o The process begins with input data being fed into the network. This could be
images, text, or any other type of data.
2. Forward Propagation:
o The data passes through the layers of the network. Each neuron in a layer
processes the data, applies weights (which determine the importance of
inputs), adds a bias (a constant value to adjust the output), and then applies an
activation function to produce an output.
o This process repeats as the data moves through each layer, transforming and
combining the information to learn complex features and patterns.
3. Output Generation:
o The final layer produces the output. For a classification task, this could be the
probability of different classes (e.g., cat vs. dog). For a regression task, it
might be a predicted value (e.g., house price).
Linear regression is a fundamental technique often used to introduce the concepts of neural
networks and machine learning. Here’s how it works and an example of its application:
Purpose: Linear regression aims to model the relationship between a dependent variable
(target) and one or more independent variables (features) by fitting a linear equation to the
observed data.
Example Scenario: Predicting house prices based on features like size, number of rooms,
and location.
1. Input Data:
o Collect data on house prices and their corresponding features (size, number of
rooms, location).
2. Model Representation:
o Represent the relationship between the house price (Y) and its features (X1,
Example Data:
• Assume we have data for three houses with the following features:
o House 1: Size = 1500 sq. ft, Rooms = 3, Location = 2 (coded value)
o House 2: Size = 2000 sq. ft, Rooms = 4, Location = 3 (coded value)
Model Training:
• The linear regression model learns the weights and bias from the data:
Making Predictions:
• For a new house with Size = 1800 sq. ft, Rooms = 3, Location = 2:
This simplified example illustrates how linear regression can be used to make predictions
based on the relationship learned from the data.
1. Data:
o High-quality, labelled data is crucial for training neural networks. The more
data, the better the network can learn.
2. Computational Power:
o Training deep networks requires significant computational resources, typically
involving GPUs or specialized hardware like TPUs.
3. Frameworks and Libraries:
o Popular frameworks such as TensorFlow, PyTorch, and Keras provide tools
and functions to build and train neural networks efficiently.
4. Hyperparameters:
o These are settings that must be specified before training begins, such as the
number of layers, number of neurons per layer, learning rate, and batch size.
5. Training Time:
o Deep networks can take a long time to train, depending on the complexity of
the model and the size of the data.
Outline
Large Language Models (LLMs) are advanced AI systems designed to process and generate
human language. They are trained on vast amounts of text data and can understand and
produce text that is coherent and contextually relevant. These models have revolutionized the
field of natural language processing (NLP) by enabling machines to perform tasks that
require a deep understanding of language.
Key Features:
• Generative: GPT can generate text that continues from a given prompt, making it
useful for tasks like writing essays, composing emails, and creating dialogue.
• Pre-trained: GPT is initially trained on a large corpus of text data, learning the
nuances of language without any specific task in mind.
• Transformer Architecture: This architecture enables GPT to handle long-range
dependencies in text, making it highly effective at understanding and generating
language.
1. Pre-training Phase
During the pre-training phase, GPT is exposed to a massive dataset containing diverse text
from the internet. The model learns to predict the next word in a sentence, given the previous
2. Fine-tuning Phase
After pre-training, GPT is fine-tuned on a more specific dataset, often with human feedback,
to adjust its performance for tasks. This phase helps the model specialize in tasks like
question answering, summarization, or sentiment analysis.
3. Input Processing
When GPT receives an input, it processes the text through its layers. Each layer consists of
neurons that transform the input data. The model pays attention to different parts of the input
text using mechanisms called attention heads. This helps GPT understand which words are
important and how they relate to each other.
• Tokenization: The input text is broken down into tokens (words or sub words).
• Embedding: Tokens are converted into numerical vectors that the model can process.
• Attention Mechanism: The model uses self-attention to focus on relevant parts of the
text, enhancing its understanding of context and relationships.
4. Generating Output
Based on the processed input, GPT generates a relevant and coherent response. It uses the
learned patterns and knowledge from the training phases to produce text that fits the context
of the input.
• Decoding: The model generates the output token by token, each time considering the
previous tokens.
• Beam Search: A technique that helps in generating the most likely sequence of
words.
• Output: The final generated text is assembled from the individual tokens.
Example:
Applications of GPT
2. Content Creation:
• Usage: GPT can assist writers by generating ideas, drafting articles, or even writing
entire pieces based on prompts.
• Example: A marketer uses GPT to generate a blog post about the benefits of a new
product.
3. Translation:
• Usage: Translating text from one language to another while preserving the meaning
and context.
• Example: GPT translates an English document into Spanish for a global audience.
5. Personal Assistants:
• Usage: Virtual assistants like Siri and Alexa use GPT to understand and respond to
user commands.
• Example: A user asks their virtual assistant to set a reminder for a meeting, and it
schedules the reminder accordingly.
• LLMs like GPT can inadvertently learn and reproduce biases present in the training
data. It is crucial to continuously monitor and mitigate these biases to ensure fair and
ethical use.
2. Data Privacy:
• Ensuring that the data used for training and the interactions with the model respect
user privacy is vital. Users should be informed about how their data is used and
protected.
3. Resource Intensive:
4. Interpretability:
5. Ethical Use:
• It is essential to use LLMs responsibly, ensuring they are not employed for malicious
purposes, such as generating fake news or harmful content.
Puzzle: How do you measure exactly 4 Liters using only a 7-liter jar and a 5-liter jar, with no
markings to measure intermediate amounts?
Solution:
Puzzle: You have two ropes and a lighter. Each rope takes exactly one hour to burn, but they
burn at inconsistent rates along their length. How can you measure exactly 45 minutes?
Solution:
1. Light both ends of the first rope and one end of the second rope simultaneously.
2. The first rope will burn completely in 30 minutes because it is burning from both ends.
3. When the first rope is completely burned, light the other end of the second rope.
4. The second rope will take another 15 minutes to burn completely from both ends.
5. In total, this process will measure exactly 45 minutes.
Puzzle: You are in a room with three light switches, all of which are off. Each switch
controls one of three light bulbs in another room. You cannot see the bulbs from where you
are. How can you determine which switch controls which bulb if you can only enter the room
with the bulbs once?
Solution:
Puzzle: A farmer needs to transport a wolf, a goat, and a cabbage across a river using a boat.
The boat can only carry the farmer and one other item. If left alone, the wolf will eat the goat,
and the goat will eat the cabbage. How can the farmer get all three across the river safely?
Solution:
1. Take the goat across the river and leave it on the other side.
2. Go back alone and take the wolf across the river.
3. Leave the wolf on the other side but take the goat back with you.
4. Leave the goat on the starting side and take the cabbage across the river.
5. Leave the cabbage with the wolf on the other side and go back alone.
6. Finally, take the goat across the river.
Puzzle: Four people need to cross a narrow bridge at night. They have only one torch, and
the bridge is too dangerous to cross without it. The bridge can hold a maximum of two people
at a time. The four people take 1, 2, 7, and 10 minutes respectively to cross. When two people
cross together, they must move at the slower person's pace. How can they all get across the
bridge in the least amount of time?
Solution: