0% found this document useful (0 votes)

42 views

Data Analytics Compendium BITeSys 2024

Uploaded by

Avinash Kumar Batch 2018 A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Data Analytics Compendium BITeSys 2024

Uploaded by

Avinash Kumar Batch 2018 A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Overview

"Information is the oil of the 21st century, and analytics is the combustion engine”
- Peter Sondergaard

Dear PGP 2024-26,

In today's dynamic world, data is the cornerstone of informed decision-making. No matter

your career path, mastering data handling, analysis, and modelling is essential for generating
insights and driving success.

How to Use This Compendium

This compendium is your guide to prepare for placement interviews. It covers:

1. Foundational Knowledge:
o Basics of statistics and data analysis.
o Key concepts in machine learning (ML) and artificial intelligence (AI),
including supervised and unsupervised learning, deep learning, and neural
networks.
2. Advanced Topics:
o Exploration of large language models (LLMs) and generative pre-trained
transformers (GPT).
o Latest trends and developments in data analytics and AI.

Guidance and Support

Use this compendium as a starting point and seek out additional resources to expand your
knowledge. Feel free to reach out to us with any questions. We are here to support you
throughout your learning journey.

Your ability to analyse and interpret data will distinguish you in the professional world. This
compendium is designed to help you succeed in your placement process.

Best regards,

Team bITeSys

Analytics Compendium by bITeSys – IIM Shillong

Contents

1. Statistical Learning: Introduction to Statistics

2. Statistical Learning: Measures of Central Tendency,
Dispersion, and Correlation
3. Data Processing and Visualisation techniques
4. Understanding AI, Machine Learning, and Deep Learning
5. Understanding Machine Learning: Can We Learn About
the World Using Data?
6. Supervised Learning: Understanding and Applications
7. Linear Regression: An Introduction
8. Unsupervised Learning: An Introduction
9. Deep Learning and Neural Networks: An In-Depth
Overview
10.Large Language Models (LLMs) and GPT: A Detailed
Overview
11.Additional Resources
12.Key Interview Questions for Data Analytics Preparation

13.Logic Puzzles for Analytics Interviews

Analytics Compendium by bITeSys – IIM Shillong

1. Statistical Learning: Introduction to Statistics

Outline

1. Why Statistics?
2. Statistical Methods
3. Types of Statistics: Descriptive and Inferential Statistics
4. Data Sources and Types of Datasets
5. Attributes of Datasets

Why Statistics is So Important?

Statistics play a pivotal role in modern analytical decision-making processes, driven by three
significant events:

Event 1: Technological Developments

• Revolution of Internet and Social Networks: The explosion of internet usage and
social media platforms has led to an unprecedented amount of data being generated.
Platforms like Facebook, Twitter, and Instagram produce massive amounts of user-
generated content daily. Each post, comment, like, and share contributes to a growing
dataset that can provide deep insights into human behaviour and social trends.
• Mobile Phones and Electronic Devices: The proliferation of smartphones and other
electronic devices contributes significantly to data generation. Every interaction,
search, and transaction produce data. For example, GPS data from mobile phones can
track movement patterns, while app usage data reveals user preferences and habits.
• Insight Discovery: Organizations harness this data to uncover patterns and trends.
These insights help in improving profitability, understanding customer expectations,
and appropriately pricing products. For example, e-commerce companies analyse
browsing and purchasing behaviour to recommend products, while social media
platforms use data to personalize user feeds. This strategic use of data allows
companies to gain a competitive advantage in the marketplace.

Event 2: Advances in Computing Power

• Massive Data Processing: Enhanced computing capabilities now allow for the
processing and analysis of large datasets that were previously unmanageable. This
includes advancements in both hardware (e.g., GPUs, TPUs) and software (e.g.,
distributed computing frameworks like Hadoop and Spark) that enable complex
calculations to be performed quickly.
• Sophisticated Algorithms: The development of faster and more efficient algorithms
has significantly improved problem-solving capabilities. Algorithms can now handle
large volumes of data and provide more accurate predictions and analyses. For
example, machine learning algorithms like deep learning can process vast amounts of
unstructured data, such as images and text, to identify patterns and make predictions.

Analytics Compendium by bITeSys – IIM Shillong

• Data Visualization: Improved data visualization techniques have bolstered Business
Intelligence (BI) and Artificial Intelligence (AI) efforts. Visualization tools help in
understanding complex data through graphical representations, making it easier to
identify trends and patterns. Tools like Tableau, Power BI, and D3.js allow users to
create interactive and dynamic visualizations that facilitate better decision-making.

Event 3: Data Storage and Computing Innovations

• Large Data Storage: Advances in storage technology, such as cloud storage

solutions, enable the handling of vast amounts of data. Organizations can now store
and retrieve large datasets with ease. Technologies like Amazon S3, Google Cloud
Storage, and Microsoft Azure offer scalable storage solutions that can grow with an
organization’s needs.
• Parallel and Cloud Computing: These technologies, coupled with improved
computer hardware, allow for solving large-scale problems more efficiently. Parallel
computing enables multiple processes to be executed simultaneously, while cloud
computing provides scalable resources on demand. For example, platforms like
Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure
offer services that support parallel processing and distributed computing.
• Efficiency in Problem Solving: These innovations ensure that large-scale problems
are solved faster than ever before without sacrificing accuracy or performance. This is
particularly important in fields like genomics, climate modelling, and financial
forecasting, where timely and accurate analysis of large datasets is critical.

Big Data

Big Data refers to datasets that cannot be managed, processed, or analysed with traditional
software or algorithms within a reasonable timeframe. The concept of Big Data is defined by
the following characteristics:

• Volume: The sheer amount of data being generated is enormous.

• Velocity: The speed at which data is generated and processed.
• Variety: The different types of data, which can be structured (databases), semi-
structured (XML files), or unstructured (text, images, videos).
• Value: The potential insights derived from the data. Big Data analytics can uncover
valuable information that can lead to better decision-making.
• Veracity: The accuracy and trustworthiness of the data. Ensuring data quality is
crucial for reliable analysis.

Examples:

• Walmart: Handles over one million purchase transactions per hour, generating
massive amounts of transactional data. This data is used to optimize inventory
management, improve supply chain efficiency, and personalize marketing efforts.
• Facebook: Processes more than 250 million picture uploads per day, showcasing the
volume and variety of data. Analysing this data helps Facebook improve user
experience through targeted ads, personalized content, and enhanced security
measures.

Analytics Compendium by bITeSys – IIM Shillong

Statistical Methods

Classification:

• Purpose: Segments customers into groups based on key characteristics. This helps in
targeting specific customer segments with tailored marketing strategies.
• Applications:
o Customer Segmentation: Organizations can segment customers into Long
Term Customers, Medium Term Customers, and Brand Switchers. This
segmentation helps in designing loyalty programs, targeted promotions, and
personalized communications.
o Buyers and Non-Buyers: Classification models can differentiate between
customers who are likely to make a purchase and those who are not. This
helps in optimizing marketing spend by focusing efforts on high-potential
customers.
• Benefits: Helps professionals understand customer behaviour, allowing them to better
position their products and brands. For example, a company could develop different
marketing strategies for long-term loyal customers versus occasional buyers. By
understanding the characteristics of each segment, businesses can tailor their offerings
and communications to better meet customer needs.

Pattern Recognition:

• Purpose: Reveals hidden patterns in data that might not be immediately obvious.
• Techniques:
o Histogram: Visualizes the distribution of data. For example, a histogram of
customer incomes might reveal a bell curve or skewed distribution, providing
insights into income inequality and purchasing power.
o Box Plot: Identifies outliers and provides a summary of the data distribution.
Box plots are useful for comparing distributions across different groups or
time periods.
o Scatter Plot: Captures relationships between two variables, such as age and
expenditure. Scatter plots can help identify correlations and trends that inform
business decisions.
• Benefits: Visual analytics provide clear insights that can be leveraged by retail
professionals. For instance, recognizing spending patterns among different age groups
can inform targeted promotions. By visualizing data, businesses can quickly identify
and act on opportunities and challenges.

Association:

• Purpose: Determines relationships between items, identifying which items are

frequently bought together.
• Applications:
o Market Basket Analysis: Identifies probabilities of items being bought
together. For example, customers buying coffee might also buy bread. This
information can be used to design product bundles and promotions that
increase sales.

Analytics Compendium by bITeSys – IIM Shillong

• Benefits: Helps in store layout decisions, bundling items, and planning discounts and
promotions. For example, placing related items next to each other in a store can
increase sales. Understanding associations between products can also inform
inventory management and supply chain decisions, ensuring that frequently bought
together items are always in stock.

Predictive Modelling:

• Purpose: Predicts future outcomes and facilitates customer segmentation.

• Techniques:
o Regression: Predicts expenditure based on input variables like income, age,
and gender. Regression models can identify key drivers of behavior and
quantify their impact.
o Advanced Models: Logistic Regression and Neural Networks predict target
variables and classify consumers (e.g., buyers vs. non-buyers, defaulters vs.
non-defaulters). These models can handle complex relationships and
interactions between variables, providing more accurate predictions.
• Benefits: Identifies and targets the most profitable customers, aiding in strategic
decision-making. For example, predicting which customers are likely to default on a
loan can help in risk management. Predictive models can also be used to forecast
demand, optimize pricing, and improve customer retention.

Types of Statistics

Descriptive Statistics:

• Purpose: Summarizes and organizes data so it can be easily understood.

• Techniques:
o Measures of Central Tendency: Mean, median, and mode describe the centre
of the data. These measures provide a snapshot of the data's typical values.
o Measures of Variability: Range, variance, and standard deviation indicate the
spread of the data. Understanding variability helps in assessing the consistency
and reliability of the data.
o Visualization: Charts, graphs, and tables help present data clearly.
Visualizations make it easier to communicate findings and support decision-
making.

Inferential Statistics:

• Purpose: Makes inferences and predictions about a population based on a sample of

data.
• Techniques:
o Hypothesis Testing: Determines the likelihood that a hypothesis about a data
set is true. Hypothesis tests help in making data-driven decisions by assessing
the strength of evidence against a null hypothesis.
o Confidence Intervals: Provides a range of values that is likely to contain the
population parameter. Confidence intervals quantify the uncertainty associated
with sample estimates, providing a measure of precision.

Analytics Compendium by bITeSys – IIM Shillong

o Regression Analysis: Examines the relationship between variables to make
predictions. Regression models can identify key factors that influence
outcomes and estimate their effects.

Data Sources and Types of Datasets

Data Sources:

• Primary Data: Collected directly from first-hand experience. Examples include

surveys, experiments, and observations. Primary data is tailored to specific research
questions but can be time-consuming and expensive to collect.
• Secondary Data: Collected from existing sources like books, articles, and databases.
Examples include government reports, academic studies, and commercial datasets.
Secondary data is readily available and cost-effective but may not perfectly match the
research needs.

Types of Datasets:

• Structured Data: Organized in rows and columns (e.g., spreadsheets, SQL

databases). Structured data is easy to analyse using traditional statistical methods and
tools.
• Unstructured Data: Not organized in a pre-defined manner (e.g., text documents,
images, videos). Unstructured data requires advanced techniques, such as natural
language processing (NLP) and image recognition, to extract meaningful insights.
• Semi-Structured Data: Contains both structured and unstructured elements (e.g.,
JSON, XML files). Semi-structured data bridges the gap between structured and
unstructured data, allowing for more flexible and comprehensive analysis.

Attributes of Datasets

• Quality: Accuracy and reliability of the data. High-quality data is free from errors
and inconsistencies, providing a solid foundation for analysis.
• Relevance: The importance of the data in relation to the problem being analysed.
Relevant data addresses specific research questions and objectives.
• Timeliness: How up to date the data is. Timely data reflects current conditions and
trends, ensuring that analysis and decisions are based on the latest information.
• Completeness: The extent to which all required data is present. Complete data
includes all necessary variables and observations, reducing the risk of bias and
missing information.
• Consistency: The uniformity of the data across different sources. Consistent data
maintains the same formats and definitions, facilitating integration and comparison
across datasets.

Analytics Compendium by bITeSys – IIM Shillong

2. Statistical Learning: Measures of Central Tendency, Dispersion, and
Correlation

Outline

1. Raw Data
2. Frequency Distribution - Histograms
3. Cumulative Frequency Distribution
4. Measures of Central Tendency
5. Mean, Median, Mode
6. Measures of Dispersion
7. Range, IQR, Standard Deviation, Coefficient of Variation
8. Normal Distribution, Chebyshev Rule
9. Five Number Summary, Boxplots, QQ Plots, Quantile Plot, Scatter Plot
10. Scatter Plot Matrix
11. Correlation Analysis

Data versus Information

When analysts encounter a plethora of data that initially seems nonsensical, they seek
methods to classify and organize this data to convey meaningful insights. The objective is to
transform raw data into information that aids in drawing accurate conclusions.

Raw Data

Raw Data represents numbers and facts in their original format as collected. This data needs
to be processed and converted into information for effective decision-making.

Frequency Distribution

Frequency Distribution is a summarized table where raw data is arranged into classes with
corresponding frequencies. This technique classifies raw data into usable information and is
widely used in descriptive statistics.

Histogram

A Histogram (or frequency histogram) is a graphical representation of a frequency

distribution. The X-axis represents the classes, and the Y-axis represents the frequencies as
bars. It visually depicts the pattern of distribution for the measured characteristic.

Analytics Compendium by bITeSys – IIM Shillong

Cumulative Frequency Distribution

A Cumulative Frequency Distribution shows how many observations fall below the upper
boundary of each class.

Measures of Central Tendency

Central tendency measures describe the centre point of a data set. Common measures include
the mean, median, and mode.

Arithmetic Mean:

The mean is the sum of all observations divided by the number of observations.

Median:

The median is the middle value when data is arranged in ascending order. It divides the data
set into two equal parts.

Mode:

The mode is the value that occurs most frequently in a data set.

Comparison of Mean, Median, Mode

Measure Description Affected by Extreme Algebraic

Values Treatment
Mean Arithmetic average of all Yes Yes
observations
Median Middle value in the ordered No No
data set
Mode Most frequently occurring No No
value

Analytics Compendium by bITeSys – IIM Shillong

Measures of Dispersion

Dispersion measures indicate the spread of data around the central tendency. They help in
understanding the variability within a data set.

Range:

The range is the difference between the maximum and minimum values in a data set.

Inter-Quartile Range (IQR):

The IQR is the range of the middle 50% of observations, calculated as the difference between
the third quartile (Q3) and the first quartile (Q1).

Standard Deviation and Variance:

Standard deviation measures the average deviation of each data point from the mean.
Variance is the square of the standard deviation.

Coefficient of Variation (CV):

CV is the ratio of the standard deviation to the mean, expressed as a percentage.

The Empirical Rule and Chebyshev's Rule

Empirical Rule:

In a bell-shaped distribution:

• Approximately 68% of data is within 1 standard deviation of the mean.

• Approximately 95% of data is within 2 standard deviations of the mean.
• Approximately 99.7% of data is within 3 standard deviations of the mean.

Analytics Compendium by bITeSys – IIM Shillong

Chebyshev's Rule:

Regardless of the data distribution, at least of values fall within k

standard deviations of the mean.

Five Number Summary and Boxplots

Five Number Summary:

The five-number summary consists of:

Boxplot:

A boxplot graphically displays the five-number summary and shows the distribution shape,
spread, and potential outliers.

Analytics Compendium by bITeSys – IIM Shillong

Scatter Plot Matrix

Scatter Plot:

A scatter plot shows the relationship between two variables, helping to identify clusters,
outliers, and correlations.

Scatter Plot Matrix:

A scatter plot matrix displays multiple scatter plots for pairs of variables, providing a
comprehensive view of relationships within the data set.

Analytics Compendium by bITeSys – IIM Shillong

Correlation Analysis

Correlation Analysis for Nominal Data:

The Chi-Square test assesses the association between categorical variables.

Correlation Analysis for Numeric Data:

Scatter plots and correlation coefficients (r) quantify the strength and direction of the
relationship between numeric variables.

Analytics Compendium by bITeSys – IIM Shillong

3. Data Processing and Visualisation techniques

Outline

1. Data Cleaning
2. Data Handling
3. Deep Wrangling
4. Data Repositories
5. Types of data repositories
6. Types of charts and data visualisation graph

Data Cleaning:
This involves identifying and correcting errors, inconsistencies, duplication, missing values,
and outliers in the dataset. Techniques include:
Removing duplicate records: Ensures data uniqueness.
Imputation: Filling in missing values using statistical methods or machine learning
algorithms.
Formatting corrections: Standardizing date formats, correcting typos, etc.
Handling outliers: Identifying and addressing data points that significantly deviate from the
norm.

Data Handling:
This encompasses the overall management of data from collection to storage, including:
Data Collection: Gathering data from various sources such as databases, web scraping, and
APIs.
Data Storage: Using databases, data lakes, and warehouses to store large datasets efficiently.
Data Security: Ensuring data privacy and protection through encryption, access controls, and
compliance with regulations.

Data Wrangling:
This is the process of transforming and mapping data from one "raw" form into another
format to make it more appropriate and valuable format for analysis or machine learning. It's
also known as data munging, data cleaning or data remediation. Techniques include:
Data Transformation: Converting data into a suitable format for analysis, such as
normalization, scaling, and encoding categorical variables.

Analytics Compendium by bITeSys – IIM Shillong

Data Integration: Combining data from different sources to create a unified dataset.
Data Reduction: Reducing data volume while maintaining its integrity using techniques like
principal component analysis (PCA) and feature selection.

Data Repositories
The data repository is just like the infrastructure of databases that collect, manage and store
data sets accordingly. Some other names of data repository are data archive, data library.
Why do we need Data repositories?
1. Centralized data management: Ability to store all the data in one central location,
making it easier to access, manage and analyse.
2. Data Consistency: Consistency and accuracy of data across different departments
and systems.
3. Improved Decision Making: Comprehensive data enables informed decisions
regarding risk management, customer service, product development and more.
4. Regulatory Compliance: A single source of truth aids in auditing and reporting
purposes.
5. Cost Efficiency: Consolidating data reduce cost in terms of data storage, maintenance
and integration.

Types of data repository

• Relational Databases: Such as MySQL, PostgreSQL, which store structured data in
tables with predefined schemas.
• NoSQL Databases: Such as MongoDB, Cassandra, which store unstructured or semi-
structured data and are scalable for large data volumes.
• Data Warehouses: It’s a data repository that stores structured historical data.
• With the help of an ETL (extract, transform, load) engine, data is transferred to the
data warehouse from different data sources such as transactional databases, data
banks, the web, log files, and data lakes. Eg:- Amazon Redshift, Google BigQuery
• Data Lakes: A data lake is a central repository that can store any kind of raw data
including structured, semi- structured, unstructured, and binary coming from different
sources. Raw data in data lakes can be the perfect grounds for conducting all kinds of
analytical research that wouldn’t be possible with the help of only a traditional data
warehouse. Eg:- Hadoop, Azure Data Lake
• Data Mart: It is a subset of a data warehouse that is focused on a particular subject,
department, and business area. Datamart makes available specific data to a defined
group of users that helps them to swiftly access the insights without spending much
time searching in an entire data warehouse.
• Data Cube: A Data cube refers to a multi-dimensional data structure. It is used to
reflect the information that is to be retrieved from a huge set of complex data.

Analytics Compendium by bITeSys – IIM Shillong

Data Data Data Lake Data Mart Data Cube
Repository Warehouse
Supported Structured Structured, Highly Highly
data types semi-structured, structured structured
unstructured,
binary
Purpose Historical data Big data Department- OLAP, insight
analytics, BI analytics, focused data generation
reporting and advanced analysis
data analytics, data
visualisation discovery, data
storage, data
archive
Data Quality Curated data Raw data (Not Highly curated Highly curated
(Ready for use) ready for use) data data
Data sources Relational Relational and Relational Relational
databases, non-relational databases, data databases,
transactional databases, IoT warehouse, transactional
systems, devices, social internal and systems, and
internal and media, images external other
external etc. corporate operational
corporate system databases
system

Types of Charts and Data Visualization Graphs

There are multiple charts present to visualise the data, let’s understand them more in order to
know which visualisation suits your dataset best.

Pie Charts: Show parts/percentages as a part of a whole. Useful for proportional data when
you want to illustrate the proportion of each category in the dataset. Ideal to use if you have
less than 7-8 categories otherwise the chart may loose clarity.
Bar Charts: A bar chart visually represents data using rectangular bars or columns. Here, the
length of each bar corresponds proportionally to its value. It is used for comparing quantities
of different categories by showing their relative sizes. A horizontal bar chart is better to use
when the text of bar is lengthy.

Analytics Compendium by bITeSys – IIM Shillong

Line Charts: Ideal for displaying trends, patterns over time. It is best to use a line graph
when showing changes or trends over time, like project timelines, production cycles, or
population growth.
Scatter Plots: It shows the relationship between two variables. It is best for identifying
trends, correlations or potential clusters in the data.
Heatmaps: Heatmaps are a type of map data visualisation that displays data intensity using
colour. A heatmap is commonly used to establish relationships between two variables across a
grid visualizing data intensity across two dimensions.
Box Plots: Display the distribution of data based on a five-number summary. Good for
identifying outliers when comparing distributions and identifying outliers.
Histograms: A histogram visualizes the frequency distribution of data. It is used for showing
frequency distributions of a single variable.

Analytics Compendium by bITeSys – IIM Shillong

4. Understanding AI, Machine Learning, and Deep Learning

Outline

7. Artificial Intelligence (AI)

8. Machine Learning (ML)
9. Deep Learning (DL)
10. How They Relate to Each Other

Artificial Intelligence (AI)

Artificial Intelligence (AI) is a broad field of computer science focused on creating systems
capable of performing tasks that typically require human intelligence. These tasks include
reasoning, learning, problem-solving, perception, language understanding, and more. AI aims
to create machines that can mimic human cognitive functions.

Key Concepts:

• General AI: AI systems that possess the ability to perform any intellectual task that a
human can do. This remains largely theoretical.
• Narrow AI: AI systems designed for specific tasks, such as speech recognition,
image recognition, and language translation. This is where most current AI
applications lie.

Machine Learning (ML)

Machine Learning (ML) is a subset of AI that focuses on the development of algorithms

and statistical models that enable computers to learn from and make predictions or decisions
based on data. Instead of being explicitly programmed to perform a task, ML algorithms use
data to identify patterns and improve their performance over time.

Key Concepts:

• Supervised Learning: Algorithms are trained on labelled data, meaning the input
data is paired with the correct output. The model learns to map inputs to outputs.
• Unsupervised Learning: Algorithms are trained on unlabelled data and must find
hidden patterns or intrinsic structures within the input data.
• Reinforcement Learning: Algorithms learn by interacting with an environment and
receiving rewards or penalties based on their actions.

Analytics Compendium by bITeSys – IIM Shillong

Deep Learning (DL)

Deep Learning (DL) is a subset of machine learning that utilizes neural networks with many
layers (hence "deep") to model complex patterns in large amounts of data. These neural
networks, inspired by the human brain, are capable of learning hierarchical representations of
data, which makes them particularly effective for tasks like image and speech recognition.

Key Concepts:

• Neural Networks: Composed of layers of interconnected neurons, where each neuron

processes inputs and passes the output to the next layer.
• Convolutional Neural Networks (CNNs): Specialized neural networks for
processing grid-like data such as images.
• Recurrent Neural Networks (RNNs): Specialized neural networks for sequential
data, such as time series or natural language.

How They Relate to Each Other

1. AI as the Umbrella Term:

o AI is the overarching field that encompasses all efforts to create machines that
can perform tasks requiring human intelligence. It includes a wide range of
techniques and technologies, from rule-based systems to advanced neural
networks.
2. Machine Learning within AI:
o Machine Learning is a subset of AI that focuses on algorithms and models that
can learn from data. ML represents a significant advancement in AI by
moving away from hard-coded rules and towards data-driven decision making.
o Example: An AI system for fraud detection might use ML algorithms to
analyse transaction data and identify suspicious patterns.
3. Deep Learning within Machine Learning:
o Deep Learning is a further specialization within ML that uses neural networks
with many layers to handle vast amounts of data and model complex patterns.
DL has been particularly successful in areas such as image and speech
recognition due to its ability to learn intricate representations of data.
o Example: An ML system for image recognition might use a deep learning
model like a CNN to classify images based on the objects they contain.

Analytics Compendium by bITeSys – IIM Shillong

5. Understanding Machine Learning: Can We Learn About the World
Using Data?

Overview

1. Introduction to Model Building from Data

2. Challenges in Data
3. Overfitting vs. Underfitting
4. Machine Learning Tasks
o Supervised Learning
o Unsupervised Learning
5. Tools and Techniques

Introduction to Model Building from Data

Model Building from Data:

• Data as Input: Machine learning models take data as input to find patterns.
• Finding Patterns: The goal is to identify and summarize patterns in a mathematically
precise way.
• Automating Model Building: Machine learning automates the process of model
building, making it more efficient and scalable.

Challenges in Data

Data Contains Noise:

• Information + Noise: Data typically comprises both useful information and

irrelevant noise.
• Challenge: The main challenge is to distil the information content while filtering out
the noise.
• Train and Test Approach: Machine learning employs a train and test approach to
address this challenge.

Overfitting vs. Underfitting

• Overfitting: When a model captures the noise along with the information, it is
overfitting. Overfitting leads to poor prediction performance on new, unseen data.
• Underfitting: When a model fails to capture all the relevant information, it is
underfitting. Underfitting also results in poor prediction performance.

Analytics Compendium by bITeSys – IIM Shillong

• Ideal Model: The best model strikes a balance by capturing all relevant information
and ignoring the noise, ensuring good performance on testing data.

Machine Learning Tasks

1. Supervised Learning:

• Definition: Building a mathematical model using data that contains both inputs and
desired outputs (ground truth).
• Examples:
o Image classification (e.g., determining if an image contains a horse).
o Loan default prediction.
o Employee turnover prediction.
• Evaluation: Model performance can be evaluated by comparing predictions to the
actual desired outputs.

2. Unsupervised Learning:

• Definition: Building a mathematical model using data that contains only inputs and
no desired outputs.
• Purpose: To find structure in the data, such as grouping or clustering data points to
discover patterns.
• Example: An advertising platform segments the population into groups with similar
demographics and purchasing habits, aiding in targeted advertising.
• Evaluation: Since no labels are provided, there is no straightforward way to compare
model performance.

Tools and Techniques

Supervised Learning:

• Regression: Desired output is a continuous number.

o Example: Predicting house prices based on features like size and location.
• Classification: Desired output is a category.
o Example: Determining if an email is spam or not.

Unsupervised Learning:

• Clustering: Grouping data points based on similarity.

o Example: Customer segmentation for marketing purposes.
• Dimensionality Reduction: Compressing data to reduce the number of features.
o Example: Principal Component Analysis (PCA) to reduce the dimensions of a
dataset.
• Association Rule Learning: Discovering interesting relations between variables.
o Example: Market basket analysis to identify products frequently bought
together.

Analytics Compendium by bITeSys – IIM Shillong

6. Supervised Learning: Understanding and Applications

Introduction to Supervised Learning

Supervised Learning is a type of machine learning where the model is trained using labeled
data. This means the training dataset includes both the input features (what we use to make
predictions) and the output labels (the actual outcomes). The goal of supervised learning is to
learn a mapping from inputs to outputs, allowing the model to make accurate predictions on
new, unseen data.

Key Concepts

1. Inputs (Features): The variables or attributes that are used to make predictions. For
example, in predicting house prices, features might include the size of the house, the
number of bedrooms, and the location.
2. Outputs (Labels): The target variable or outcome that the model aims to predict.
Continuing with the house price example, the output would be the actual price of the
house.
3. Training Phase: The process where the model learns the relationship between inputs
and outputs by being exposed to the training data.
4. Prediction Phase: After training, the model uses the learned relationships to predict
the output for new, unseen inputs.

Types of Supervised Learning

1. Regression:

• Purpose: Used when the output variable is continuous.

• Examples:
o Predicting house prices based on features like size and location.
o Forecasting sales figures based on historical data.

2. Classification:

• Purpose: Used when the output variable is categorical.

• Examples:
o Determining if an email is spam or not spam.
o Classifying images of animals into categories like cats and dogs.

Analytics Compendium by bITeSys – IIM Shillong

Applications of Supervised Learning

1. Image Classification:

• Usage: Identifying objects in images.

• Example: Determining if an image contains a specific object, such as a cat or a dog.
This is widely used in social media platforms for tagging photos.

2. Spam Detection:

• Usage: Classifying emails as spam or not spam.

• Example: Email services like Gmail use supervised learning to filter out spam emails
from your inbox.

3. Sentiment Analysis:

• Usage: Analysing text data to determine the sentiment expressed.

• Example: Analysing customer reviews to gauge whether they are positive, negative,
or neutral. Companies use this to understand customer satisfaction and improve their
products.

4. Fraud Detection:

• Usage: Identifying fraudulent activities.

• Example: Banks use supervised learning to detect fraudulent transactions by
analysing patterns in transaction data.

5. Medical Diagnosis:

• Usage: Predicting diseases based on patient data.

• Example: Predicting whether a patient has a particular disease based on symptoms
and medical history. This helps doctors make more accurate diagnoses.

6. Stock Price Prediction:

• Usage: Forecasting future stock prices.

• Example: Financial analysts use supervised learning to predict stock prices based on
historical data and market indicators.

How Supervised Learning Works

1. Data Collection:
o Collect a dataset that includes both the inputs (features) and the corresponding
outputs (labels).
2. Data Preparation:
o Clean and preprocess the data to make it suitable for training. This might
include handling missing values, normalizing the data, and converting
categorical variables into numerical ones.

Analytics Compendium by bITeSys – IIM Shillong

3. Model Training:
o Use the prepared data to train a machine learning model. The model learns the
relationship between the features and the labels during this phase.
4. Model Evaluation:
o Evaluate the model’s performance using a separate validation dataset. This
helps ensure that the model generalizes well to new data.
5. Prediction:
o Use the trained model to make predictions on new, unseen data.

Challenges in Supervised Learning

1. Overfitting:
o Occurs when the model learns not only the underlying patterns but also the
noise in the training data, leading to poor performance on new data.
2. Underfitting:
o Happens when the model is too simple to capture the underlying patterns in
the data, resulting in poor performance on both training and test data.
3. Data Quality:
o The quality of the training data significantly impacts the model’s performance.
High-quality, labelled data is crucial.
4. Computational Resources:
o Training complex models on large datasets requires significant computational
power.
5. Bias-Variance Trade-off:
o Balancing the complexity of the model to minimize both bias (error due to
overly simplistic models) and variance (error due to overly complex models)
is a key challenge.

Analytics Compendium by bITeSys – IIM Shillong

7. Linear Regression: An Introduction

Outline

1. Introduction to Linear Regression

2. Understanding the Linear Regression Model
3. Applications of Linear Regression
4. Steps in Building a Linear Regression Model
5. Evaluation of Linear Regression Models
6. Challenges in Linear Regression

Introduction to Linear Regression

Linear Regression is one of the most fundamental and widely used techniques in supervised
learning. It is used to model the relationship between a dependent variable (target) and one or
more independent variables (features) by fitting a linear equation to the observed data. The
goal is to predict the target variable based on the values of the features.

Understanding the Linear Regression Model

In linear regression, the relationship between the dependent variable Y and the independent
variable(s) X is modelled by a linear equation:

•
• Y: Dependent variable (what you are trying to predict)
• X: Independent variable (the feature used for prediction)
• β0: Intercept (the value of Y when X is 0)
• β1: Slope (the change in Y for a one-unit change in X)
• ϵ: Error term (the difference between the predicted and actual values)

Key Concepts:

• Intercept (β0 ): Represents the starting point of the line when X is zero.
• Slope (β1 ): Indicates how much Y changes for a unit change in X.
• Error Term (ϵ): Captures the variation in Y that cannot be explained by the linear
relationship with X.

Analytics Compendium by bITeSys – IIM Shillong

Applications of Linear Regression

1. Predicting House Prices:

• Usage: Estimating the price of a house based on features like size, number of
bedrooms, and location.

2. Sales Forecasting:

• Usage: Predicting future sales figures based on historical sales data and market
trends.

3. Risk Management:

• Usage: Assessing financial risks by predicting the probability of default based on

factors like credit score and income.

4. Medical Outcomes:

• Usage: Predicting patient outcomes (e.g., blood pressure) based on medical history
and lifestyle factors.

5. Market Analysis:

• Usage: Estimating the impact of advertising spend on sales revenue.

Steps in Building a Linear Regression Model

1. Data Collection:

• Gather a dataset that includes both the dependent variable and independent variables.

2. Data Preparation:

• Clean and preprocess the data, handle missing values, and ensure the data is suitable
for analysis.

3. Exploratory Data Analysis (EDA):

• Analyse the data to understand its structure, identify patterns, and check for
relationships between variables.

4. Model Training:

• Split the data into training and testing sets.

• Use the training data to fit the linear regression model, estimating the parameters (β0
and β1).

Analytics Compendium by bITeSys – IIM Shillong

5. Model Evaluation:

• Evaluate the model’s performance using the testing data.

• Check the goodness of fit, residuals, and other metrics to ensure the model is accurate.

6. Prediction:

• Use the trained model to make predictions on new, unseen data.

Evaluation of Linear Regression Models

1. R-squared (R²):

• Represents the proportion of the variance for the dependent variable that's explained
by the independent variables.
• R^2 ranges from 0 to 1, with higher values indicating better model performance.

2. Mean Absolute Error (MAE):

• Measures the average magnitude of the errors in a set of predictions, without

considering their direction.

3. Mean Squared Error (MSE):

• Measures the average of the squares of the errors.

4. Root Mean Squared Error (RMSE):

• The square root of the average of squared differences between prediction and actual
observation.

Challenges in Linear Regression

1. Assumption Violations:

Analytics Compendium by bITeSys – IIM Shillong

• Linear regression assumes a linear relationship between the dependent and
independent variables. If this assumption is violated, the model may not perform well.

2. Outliers:

• Outliers can significantly affect the parameters of the linear regression model, leading
to biased predictions.

3. Multicollinearity:

• When independent variables are highly correlated, it becomes difficult to determine

the individual effect of each variable on the dependent variable.

4. Overfitting:

• If the model is too complex, it may fit the training data too closely, capturing noise
rather than the underlying pattern.

5. Underfitting:

• If the model is too simple, it may not capture the underlying pattern in the data,
leading to poor predictive performance.

Analytics Compendium by bITeSys – IIM Shillong

8. Unsupervised Learning: An Introduction

Outline

1. Introduction to Unsupervised Learning

2. Types of Unsupervised Learning
3. Applications of Unsupervised Learning
4. Steps in Building Unsupervised Learning Models
5. Challenges in Unsupervised Learning

Introduction to Unsupervised Learning

Unsupervised Learning is a type of machine learning where the model is trained using data
that consists only of input features and no corresponding output labels. The goal is to identify
patterns, structures, or relationships in the data without any prior knowledge of the results.

Key Concepts

1. Inputs (Features): The variables or attributes that are used to find patterns in the
data. Unlike supervised learning, there are no predefined labels or outcomes.
2. Clustering: The process of grouping similar data points together based on their
features. The aim is to maximize intra-cluster similarity and minimize inter-cluster
similarity.
3. Dimensionality Reduction: The process of reducing the number of random variables
under consideration, by obtaining a set of principal variables.
4. Association Rule Learning: The process of discovering interesting relations between
variables in large databases.

Types of Unsupervised Learning

1. Clustering:

Clustering techniques group data points into clusters based on their similarities.

• Examples:
o Customer segmentation in marketing.
o Grouping similar documents for topic modelling.

Common Algorithms:

• K-Means Clustering

Analytics Compendium by bITeSys – IIM Shillong

• Hierarchical Clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• Gaussian Mixture Models (GMM)

2. Dimensionality Reduction:

Dimensionality reduction techniques reduce the number of features in the data while retaining
its essential characteristics.

• Examples:
o Reducing the complexity of data for visualization.
o Simplifying data for improved model performance.

Common Algorithms:

• Principal Component Analysis (PCA)

• t-Distributed Stochastic Neighbour Embedding (t-SNE)
• Uniform Manifold Approximation and Projection (UMAP)
• Linear Discriminant Analysis (LDA)

3. Association Rule Learning:

Association rule learning techniques identify interesting relationships between variables in

large datasets.

• Examples:
o Market basket analysis in retail.
o Recommender systems.

Common Algorithms:

• Apriori Algorithm
• Eclat Algorithm
• FP-Growth Algorithm

Applications of Unsupervised Learning

1. Customer Segmentation:

• Usage: Grouping customers based on purchasing behaviour and demographics to

tailor marketing strategies.
• Example: An online retailer uses clustering to identify customer segments for
personalized marketing campaigns.

2. Anomaly Detection:

• Usage: Identifying unusual data points that do not fit the general pattern.

Analytics Compendium by bITeSys – IIM Shillong

• Example: Detecting fraudulent transactions in banking by identifying outliers in
transaction data.

3. Data Visualization:

• Usage: Reducing the dimensionality of data for better visualization and interpretation.
• Example: Using PCA to visualize high-dimensional gene expression data in a 2D
plot.

4. Recommender Systems:

• Usage: Finding associations between items to recommend products or content to

users.
• Example: Amazon uses association rule learning to suggest products frequently
bought together.

5. Topic Modelling:

• Usage: Identifying topics within a large collection of documents.

• Example: Grouping news articles by topics to provide a summary of daily news.

6. Image Compression:

• Usage: Reducing the size of image files while preserving important information.
• Example: Using autoencoders to compress and reconstruct images with minimal loss
of quality.

7. Genomics:

• Usage: Analysing genetic data to identify patterns and associations.

• Example: Clustering genes with similar expression patterns to understand their
functions.

Steps in Building Unsupervised Learning Models

1. Data Collection:

• Gather a dataset that includes the features you want to analyse.

2. Data Preparation:

• Clean and preprocess the data, handle missing values, and normalize the data.

3. Exploratory Data Analysis (EDA):

• Analyse the data to understand its structure and identify any patterns or anomalies.

4. Model Training:

Analytics Compendium by bITeSys – IIM Shillong

• Choose an appropriate unsupervised learning algorithm and apply it to the data.

5. Model Evaluation:

• Evaluate the model’s performance using metrics specific to the chosen technique
(e.g., silhouette score for clustering).

6. Interpretation:

• Interpret the results to gain insights and make informed decisions.

Challenges in Unsupervised Learning

1. No Ground Truth:

• Without predefined labels, it is challenging to evaluate the model’s performance

objectively.

2. Choosing the Right Algorithm:

• Selecting the appropriate algorithm for a specific task requires expertise and
experimentation.

3. Determining the Number of Clusters:

• Deciding the optimal number of clusters in clustering algorithms can be difficult and
often requires domain knowledge.

4. High Dimensionality:

• High-dimensional data can complicate the analysis and may require dimensionality
reduction techniques to simplify.

5. Interpretability:

• The results of unsupervised learning can be difficult to interpret, especially with

complex models like t-SNE or UMAP.

6. Computational Complexity:

• Some unsupervised learning algorithms can be computationally intensive, especially

on large datasets.

Analytics Compendium by bITeSys – IIM Shillong

9. Deep Learning and Neural Networks: An In-Depth Overview

Outline

1. Introduction to Deep Learning

2. Components of a Neural Network
o Neurons
o Layers
o Activation Functions
3. How a Neural Network Functions
4. Example of Linear Regression
5. Requirements for a Neural Network

Introduction to Deep Learning

Deep Learning is a subset of machine learning that uses neural networks with many layers to
analyse various kinds of data. These layers of networks enable computers to perform tasks
like image recognition, language translation, and playing games. The "deep" in deep learning
refers to the number of layers through which the data is transformed.

Deep learning models are inspired by the human brain and are designed to simulate how we
learn and process information. This allows machines to perform complex tasks by
understanding intricate patterns in data, which traditional machine learning models might
struggle with.

Components of a Neural Network

Neurons

Neurons are the basic units of a neural network. Each neuron receives inputs, processes them,
and produces an output. Think of neurons as tiny decision-makers that look at the data they
receive and decide whether to pass it on to the next layer or not.

Layers

Neurons are organized into layers, each serving a different purpose:

1. Input Layer: This is where the network receives the raw data. For example, if you
are feeding in images, the input layer would receive pixel values.
2. Hidden Layers: These are intermediate layers where neurons process the inputs from
the previous layer. A deep learning model can have many hidden layers, which helps
it learn more complex patterns.

Analytics Compendium by bITeSys – IIM Shillong

3. Output Layer: This layer provides the final output of the network. For example, it
might give the probability that an image contains a cat.

Activation Functions

Activation functions determine whether a neuron should be activated or not. They introduce
non-linearity into the network, which allows it to learn more complex patterns. Common
activation functions include:

1. ReLU (Rectified Linear Unit): Outputs the input directly if it is positive; otherwise,
it outputs zero. This helps the network deal with the problem of vanishing gradients,
where gradients (used to update the model) become too small for effective learning.
2. Sigmoid: Outputs a value between 0 and 1, which is useful for binary classification
tasks.
3. Tanh (Hyperbolic Tangent): Outputs values between -1 and 1 and is often used in
hidden layers.

How a Neural Network Functions

1. Input Data:
o The process begins with input data being fed into the network. This could be
images, text, or any other type of data.
2. Forward Propagation:
o The data passes through the layers of the network. Each neuron in a layer
processes the data, applies weights (which determine the importance of
inputs), adds a bias (a constant value to adjust the output), and then applies an
activation function to produce an output.
o This process repeats as the data moves through each layer, transforming and
combining the information to learn complex features and patterns.
3. Output Generation:
o The final layer produces the output. For a classification task, this could be the
probability of different classes (e.g., cat vs. dog). For a regression task, it
might be a predicted value (e.g., house price).

Analytics Compendium by bITeSys – IIM Shillong

4. Learning (Training):
o The network learns by comparing its output to the actual target (known as
ground truth) and adjusting its weights and biases to minimize the difference
(error).
o This is typically done using an optimization algorithm like gradient descent,
which updates the weights and biases in small steps to reduce the error.
5. Backpropagation:
o This is the process of updating the weights and biases in the network. The
error is calculated and then propagated back through the network, adjusting
the weights and biases to minimize the error.

Example of Linear Regression

Linear regression is a fundamental technique often used to introduce the concepts of neural
networks and machine learning. Here’s how it works and an example of its application:

Purpose: Linear regression aims to model the relationship between a dependent variable
(target) and one or more independent variables (features) by fitting a linear equation to the
observed data.

Example Scenario: Predicting house prices based on features like size, number of rooms,
and location.

Steps in Linear Regression:

1. Input Data:
o Collect data on house prices and their corresponding features (size, number of
rooms, location).
2. Model Representation:
o Represent the relationship between the house price (Y) and its features (X1,

X2, X3) using a linear equation:

o Here, w1, w2, w3 are the weights (coefficients) for each feature, and b is the
bias (intercept).
3. Training the Model:
o Use historical data to find the best values for the weights and bias that
minimize the difference between the predicted and actual house prices.
o This is done by minimizing the error (e.g., mean squared error) between the
predicted prices and the actual prices in the training data.
4. Making Predictions:
o Once trained, the model can predict house prices for new data points by
plugging in the feature values into the linear equation.

Example Data:

• Assume we have data for three houses with the following features:
o House 1: Size = 1500 sq. ft, Rooms = 3, Location = 2 (coded value)
o House 2: Size = 2000 sq. ft, Rooms = 4, Location = 3 (coded value)

Analytics Compendium by bITeSys – IIM Shillong

o House 3: Size = 1200 sq. ft, Rooms = 2, Location = 1 (coded value)
• Corresponding house prices are $300,000, $400,000, and $250,000 respectively.

Model Training:

• The linear regression model learns the weights and bias from the data:

Making Predictions:

• For a new house with Size = 1800 sq. ft, Rooms = 3, Location = 2:

• The predicted price for this house would be $230,150.

This simplified example illustrates how linear regression can be used to make predictions
based on the relationship learned from the data.

Requirements for a Neural Network

1. Data:
o High-quality, labelled data is crucial for training neural networks. The more
data, the better the network can learn.
2. Computational Power:
o Training deep networks requires significant computational resources, typically
involving GPUs or specialized hardware like TPUs.
3. Frameworks and Libraries:
o Popular frameworks such as TensorFlow, PyTorch, and Keras provide tools
and functions to build and train neural networks efficiently.
4. Hyperparameters:
o These are settings that must be specified before training begins, such as the
number of layers, number of neurons per layer, learning rate, and batch size.
5. Training Time:
o Deep networks can take a long time to train, depending on the complexity of
the model and the size of the data.

Analytics Compendium by bITeSys – IIM Shillong

10.Large Language Models (LLMs) and GPT: A Detailed Overview

Outline

1. Introduction to Large Language Models (LLMs)

2. Understanding GPT (Generative Pre-trained Transformer)
3. How GPT Works
4. Applications of GPT
5. Challenges and Considerations

Introduction to Large Language Models (LLMs)

Large Language Models (LLMs) are advanced AI systems designed to process and generate
human language. They are trained on vast amounts of text data and can understand and
produce text that is coherent and contextually relevant. These models have revolutionized the
field of natural language processing (NLP) by enabling machines to perform tasks that
require a deep understanding of language.

Understanding GPT (Generative Pre-trained Transformer)

GPT (Generative Pre-trained Transformer) is a specific type of LLM developed by

OpenAI. It is based on the Transformer architecture, which has become the foundation for
many state-of-the-art NLP models. GPT models are trained in two main stages: pre-training
and fine-tuning.

Key Features:

• Generative: GPT can generate text that continues from a given prompt, making it
useful for tasks like writing essays, composing emails, and creating dialogue.
• Pre-trained: GPT is initially trained on a large corpus of text data, learning the
nuances of language without any specific task in mind.
• Transformer Architecture: This architecture enables GPT to handle long-range
dependencies in text, making it highly effective at understanding and generating
language.

How GPT Works

1. Pre-training Phase

During the pre-training phase, GPT is exposed to a massive dataset containing diverse text
from the internet. The model learns to predict the next word in a sentence, given the previous

Analytics Compendium by bITeSys – IIM Shillong

words. This task, known as language modelling, helps the model understand grammar, facts
about the world, and some level of reasoning.

• Objective: Predict the next word in a sentence.

• Data: A large and diverse corpus of text data from the internet.
• Learning: The model learns to generate coherent text by understanding the context
and relationships between words.

2. Fine-tuning Phase

After pre-training, GPT is fine-tuned on a more specific dataset, often with human feedback,
to adjust its performance for tasks. This phase helps the model specialize in tasks like
question answering, summarization, or sentiment analysis.

• Objective: Improve performance on specific tasks.

• Data: Task-specific datasets with labelled examples.
• Learning: The model refines its knowledge and adapts to the nuances of the task.

3. Input Processing

When GPT receives an input, it processes the text through its layers. Each layer consists of
neurons that transform the input data. The model pays attention to different parts of the input
text using mechanisms called attention heads. This helps GPT understand which words are
important and how they relate to each other.

• Tokenization: The input text is broken down into tokens (words or sub words).
• Embedding: Tokens are converted into numerical vectors that the model can process.
• Attention Mechanism: The model uses self-attention to focus on relevant parts of the
text, enhancing its understanding of context and relationships.

4. Generating Output

Based on the processed input, GPT generates a relevant and coherent response. It uses the
learned patterns and knowledge from the training phases to produce text that fits the context
of the input.

• Decoding: The model generates the output token by token, each time considering the
previous tokens.
• Beam Search: A technique that helps in generating the most likely sequence of
words.
• Output: The final generated text is assembled from the individual tokens.

Example:

• Input: "What is the capital of France?"

• Output: "The capital of France is Paris."

Applications of GPT

Analytics Compendium by bITeSys – IIM Shillong

1. Customer Support:

• Usage: Automated chatbots use GPT to handle customer inquiries, provide

information, and resolve issues.
• Example: A customer asks about the status of their order, and the GPT-powered
chatbot provides the latest tracking information.

2. Content Creation:

• Usage: GPT can assist writers by generating ideas, drafting articles, or even writing
entire pieces based on prompts.
• Example: A marketer uses GPT to generate a blog post about the benefits of a new
product.

3. Translation:

• Usage: Translating text from one language to another while preserving the meaning
and context.
• Example: GPT translates an English document into Spanish for a global audience.

4. Education and Training:

• Usage: Creating interactive learning materials, tutoring, and answering questions.

• Example: A student uses GPT to get explanations on complex topics in their
coursework.

5. Personal Assistants:

• Usage: Virtual assistants like Siri and Alexa use GPT to understand and respond to
user commands.
• Example: A user asks their virtual assistant to set a reminder for a meeting, and it
schedules the reminder accordingly.

Challenges and Considerations

1. Bias and Fairness:

• LLMs like GPT can inadvertently learn and reproduce biases present in the training
data. It is crucial to continuously monitor and mitigate these biases to ensure fair and
ethical use.

2. Data Privacy:

• Ensuring that the data used for training and the interactions with the model respect
user privacy is vital. Users should be informed about how their data is used and
protected.

3. Resource Intensive:

Analytics Compendium by bITeSys – IIM Shillong

• Training and running LLMs require significant computational resources, making them
expensive to develop and deploy.

4. Interpretability:

• Understanding why a model like GPT generates a particular response can be

challenging. Enhancing the interpretability of these models is an ongoing area of
research.

5. Ethical Use:

• It is essential to use LLMs responsibly, ensuring they are not employed for malicious
purposes, such as generating fake news or harmful content.

Analytics Compendium by bITeSys – IIM Shillong

11.Additional Resources
Online Courses
• Coursera - Data Science Specialization - Offered by Johns Hopkins University

• edX - MicroMasters Program in Data Science - Offered by University of California,

San Diego

• Udacity - Machine Learning Engineer Nanodegree

Books
• "Data Science for Business" by Foster Provost and Tom Fawcett
• "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien
Géron
• "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Blogs and Websites

• Towards Data Science
• KDnuggets
• Analytics Vidhya

Research Papers and Journals

• Journal of Machine Learning Research
• arXiv.org - Machine Learning Section
• IEEE Transactions on Neural Networks and Learning Systems

Community and Forums

• Kaggle
• Reddit - r/datascience
• Stack Overflow

Tutorials and Guides

• Google's Machine Learning Crash Course
• Fast.ai - Practical Deep Learning for Coders
• DataCamp - Data Science and Machine Learning Courses

Analytics Compendium by bITeSys – IIM Shillong

12.Key Interview Questions for Data Analytics Preparation

1. What is market mix modelling?

o Answer: Marketing Mix Modelling uses predictive modelling to quantify the impact
of various marketing activities on any Key Performance Indicator (KPI) such as sales,
revenues, number of customers, number of installs, etc. With such insights,
companies can allocate marketing budgets optimally for maximum ROI.
2. Have you heard of Artificial Intelligence and automation? What is the difference
between those in your perspective?
o Answer: Automation executes predefined tasks, reducing manual intervention and
enhancing efficiency. AI, incorporating machine learning and advanced algorithms,
learns from data, adapts, and makes decisions without explicit programming.
3. Explain the working of APIs with a practical example.
o Answer: Imagine you’re a customer at a restaurant. The waiter (the API) functions as
an intermediary between customers like you (the user) and the kitchen (web server).
You tell the waiter your order (API call), and the waiter requests it from the kitchen.
Finally, the waiter will provide you with what you ordered. In this metaphor, the
waiter is an abstraction of the API. Similarly, an API abstracts the web server,
allowing applications to request data which is then displayed to the user.
4. Questions tailored to previous work experience on the lines of ML, Statistical
Modelling, Data Visualization tools, SQL.
5. Questions on the lines of Agile, Kanban, and SDLC Models.
6. What is the difference between OLAP and OLTP?
o Answer: OLAP systems are tailored for in-depth analysis and reporting, efficiently
handling large volumes of historical data for business analysts and decision-makers.
OLTP databases are optimized for transactional processing, supporting real-time
online operations like sales recording and inventory management. OLTP systems
prioritize data integrity and fast transaction processing through normalized data
structures.
7. What is ERP?
o Answer: ERP stands for Enterprise Resource Planning. It's an integrated software
suite that organizations use to manage and streamline their core business processes,
including finance, human resources, supply chain management, manufacturing,
inventory management, and customer relationship management (CRM). ERP systems
provide a centralized platform for real-time data collection, storage, and analysis,
enabling better decision-making, improved efficiency, and enhanced collaboration
across the organization.
8. What is SQL?
o Answer: SQL stands for Structured Query Language. It is the standard language used
to maintain relational databases and perform various data manipulation operations.
9. What are the subsets of SQL?
o Answer:
▪ Data Definition Language (DDL): Defines data structure with commands
like CREATE, ALTER, DROP, etc.
▪ Data Manipulation Language (DML): Manipulates existing data with
commands like SELECT, UPDATE, INSERT, etc.
▪ Data Control Language (DCL): Controls access to data with commands
like GRANT and REVOKE.
▪ Transaction Control Language (TCL): Manages transaction operations
with commands like COMMIT, ROLLBACK, SET TRANSACTION,
SAVEPOINT, etc.

Analytics Compendium by bITeSys – IIM Shillong

10. What is DBMS?
o Answer: DBMS stands for Database Management System. It is software that
provides an interface between the database and the end-user, managing data, the
database engine, and the database schema to facilitate data organization and
manipulation.
11. What is RDBMS?
o Answer: RDBMS stands for Relational Database Management System. It is a DBMS
based on a relational model, which stores data in tables and links those tables using
relational operators.
12. What is a primary key?
o Answer: A primary key is a field or combination of fields that uniquely identify each
record in a table. It cannot be null or empty and ensures unique values in a column. A
table can have only one primary key.
13. What is a foreign key?
o Answer: A foreign key is used to link one or more tables together. It is related to the
primary key of another table, ensuring referential integrity by uniquely identifying
each row in the other table.
14. What is an LLM (Large Language Model)?
o Answer: LLMs are advanced AI models trained on vast amounts of text data to
understand, generate, and interact with human language. They are used for various
NLP tasks, such as text generation, translation, and summarization.
15. What is GPT (Generative Pre-trained Transformer)?
o Answer: GPT is a type of LLM developed by OpenAI. It uses the Transformer
architecture to generate human-like text based on given prompts. GPT is pre-trained
on diverse text data and fine-tuned for specific tasks, making it versatile for
applications like chatbots, content creation, and more.
16. Explain the CRISP-DM process.
o Answer: CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It
is a widely used methodology for data mining projects, which includes six phases:
Business Understanding, Data Understanding, Data Preparation, Modelling,
Evaluation, and Deployment.
17. How do you handle missing data in a dataset?
o Answer: Missing data can be handled by various methods, such as:
▪ Imputation: Filling missing values with mean, median, mode, or using
predictive models.
▪ Deletion: Removing rows or columns with missing values, if the dataset is
large enough.
▪ Using algorithms: That can handle missing values inherently, like certain
machine learning algorithms.
18. What is A/B testing and how is it used?
o Answer: A/B testing is a statistical method used to compare two versions of a
variable to determine which one performs better. It is commonly used in marketing to
test changes to a web page or app against the current design and measure the impact
on user behaviour.
19. Explain the concept of a confusion matrix.
o Answer: A confusion matrix is a table used to evaluate the performance of a
classification algorithm. It summarizes the results by comparing the actual versus
predicted classifications and includes metrics such as True Positives, False Positives,
True Negatives, and False Negatives.
20. What is the difference between supervised and unsupervised learning?
o Answer: Supervised learning involves training a model on labelled data, where the
output is known. Unsupervised learning, on the other hand, deals with unlabelled data
and the model tries to identify patterns and relationships within the data.

Analytics Compendium by bITeSys – IIM Shillong

13.Logic Puzzles for Analytics Interviews

1. Measuring 4 Liters from a 7-Liter and a 5-Liter Jar

Puzzle: How do you measure exactly 4 Liters using only a 7-liter jar and a 5-liter jar, with no
markings to measure intermediate amounts?

Solution:

1. Fill the 7-liter jar completely.

2. Pour the water from the 7-liter jar into the 5-liter jar until the 5-liter jar is full. This leaves you
with 2 Liters in the 7-liter jar.
3. Empty the 5-liter jar.
4. Pour the 2 Liters from the 7-liter jar into the 5-liter jar.
5. Fill the 7-liter jar again.
6. Pour the water from the 7-liter jar into the 5-liter jar until the 5-liter jar is full. Since the 5-
liter jar already has 2 Liters, you can only add 3 more Liters.
7. This leaves you with exactly 4 Liters in the 7-liter jar.

2. The Two Rope Problem

Puzzle: You have two ropes and a lighter. Each rope takes exactly one hour to burn, but they
burn at inconsistent rates along their length. How can you measure exactly 45 minutes?

Solution:

1. Light both ends of the first rope and one end of the second rope simultaneously.
2. The first rope will burn completely in 30 minutes because it is burning from both ends.
3. When the first rope is completely burned, light the other end of the second rope.
4. The second rope will take another 15 minutes to burn completely from both ends.
5. In total, this process will measure exactly 45 minutes.

3. The Three Switches Puzzle

Puzzle: You are in a room with three light switches, all of which are off. Each switch
controls one of three light bulbs in another room. You cannot see the bulbs from where you
are. How can you determine which switch controls which bulb if you can only enter the room
with the bulbs once?

Solution:

1. Turn on the first switch and leave it on for a few minutes.

2. Turn off the first switch and turn on the second switch.
3. Immediately go to the room with the bulbs.
4. The bulb that is off but warm is controlled by the first switch.
5. The bulb that is on is controlled by the second switch.

Analytics Compendium by bITeSys – IIM Shillong

6. The bulb that is off and cold is controlled by the third switch.

4. The Wolf, Goat, and Cabbage Puzzle

Puzzle: A farmer needs to transport a wolf, a goat, and a cabbage across a river using a boat.
The boat can only carry the farmer and one other item. If left alone, the wolf will eat the goat,
and the goat will eat the cabbage. How can the farmer get all three across the river safely?

Solution:

1. Take the goat across the river and leave it on the other side.
2. Go back alone and take the wolf across the river.
3. Leave the wolf on the other side but take the goat back with you.
4. Leave the goat on the starting side and take the cabbage across the river.
5. Leave the cabbage with the wolf on the other side and go back alone.
6. Finally, take the goat across the river.

5. The Bridge and Torch Problem

Puzzle: Four people need to cross a narrow bridge at night. They have only one torch, and
the bridge is too dangerous to cross without it. The bridge can hold a maximum of two people
at a time. The four people take 1, 2, 7, and 10 minutes respectively to cross. When two people
cross together, they must move at the slower person's pace. How can they all get across the
bridge in the least amount of time?

Solution:

1. Person 1 and Person 2 cross together (2 minutes).

2. Person 1 returns with the torch (1 minute, total 3 minutes).
3. Person 7 and Person 10 cross together (10 minutes, total 13 minutes).
4. Person 2 returns with the torch (2 minutes, total 15 minutes).
5. Person 1 and Person 2 cross together again (2 minutes, total 17 minutes).

Analytics Compendium by bITeSys – IIM Shillong

Big Data Analytics Seminar Report 2020-21
71% (21)
Big Data Analytics Seminar Report 2020-21
21 pages
Real Estate Market Analysis
100% (3)
Real Estate Market Analysis
28 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
33 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Book Big Data Technology
No ratings yet
Book Big Data Technology
87 pages
Big Data Analytics_AAM_Unit 1
No ratings yet
Big Data Analytics_AAM_Unit 1
178 pages
IoT NOtes
No ratings yet
IoT NOtes
34 pages
IDAV_Unit-1
No ratings yet
IDAV_Unit-1
20 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Analytics-Report
No ratings yet
Big Data Analytics-Report
7 pages
Chapter - 01 - Introduction To Big Data
No ratings yet
Chapter - 01 - Introduction To Big Data
23 pages
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
No ratings yet
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
22 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Slides Data Analytics
No ratings yet
Slides Data Analytics
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
37 pages
Introduction to Business Analytics - Copy
No ratings yet
Introduction to Business Analytics - Copy
63 pages
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
No ratings yet
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
121 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Session 1
No ratings yet
Session 1
37 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Unit 1 - From Big Data Analytics PDF
No ratings yet
Unit 1 - From Big Data Analytics PDF
5 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
BDA1-4 bunits
No ratings yet
BDA1-4 bunits
113 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Big Data
No ratings yet
Big Data
13 pages
Total Lecture Hours Physical and Computer Models To Lectu Re, Visit To Industry, Min of 2 Lectures by Industry Experts
No ratings yet
Total Lecture Hours Physical and Computer Models To Lectu Re, Visit To Industry, Min of 2 Lectures by Industry Experts
2 pages
Reviewed Big Data Assignment
No ratings yet
Reviewed Big Data Assignment
6 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
UNIT Two Emerging Technology
No ratings yet
UNIT Two Emerging Technology
43 pages
Data Management & Data Architecture
No ratings yet
Data Management & Data Architecture
21 pages
Big Data - Iv Bda
No ratings yet
Big Data - Iv Bda
143 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
117769
No ratings yet
117769
20 pages
big data analysis
No ratings yet
big data analysis
39 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
BA ppt
No ratings yet
BA ppt
17 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
35 pages
Big Data
No ratings yet
Big Data
35 pages
GROUP_4
No ratings yet
GROUP_4
10 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
BDT-Unit_I.pptx (1)
No ratings yet
BDT-Unit_I.pptx (1)
107 pages
Big Data
No ratings yet
Big Data
16 pages
Unit 1 Data Science and Big Data
No ratings yet
Unit 1 Data Science and Big Data
23 pages
Introduction to Data
No ratings yet
Introduction to Data
34 pages
Kel 1 Big Data in Finance
No ratings yet
Kel 1 Big Data in Finance
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
7 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
Introduction To Big Data Unit - 2
No ratings yet
Introduction To Big Data Unit - 2
75 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Advance Animal Science
No ratings yet
Advance Animal Science
3 pages
Sagar Resume
No ratings yet
Sagar Resume
1 page
AAA Improving The Prediction of Asset Returns With Machine Learning by Using A Custom Loss
No ratings yet
AAA Improving The Prediction of Asset Returns With Machine Learning by Using A Custom Loss
25 pages
Demand Forecasting Tiago Tata Motors
No ratings yet
Demand Forecasting Tiago Tata Motors
13 pages
Inference-Read Aloud
No ratings yet
Inference-Read Aloud
5 pages
29966
No ratings yet
29966
66 pages
Knowledge-Based Systems: Alfonso Hernández Medrano
No ratings yet
Knowledge-Based Systems: Alfonso Hernández Medrano
10 pages
4 January Personality - Google Search PDF
No ratings yet
4 January Personality - Google Search PDF
1 page
My CV
No ratings yet
My CV
10 pages
1-s2.0-S0957417424021808-main
No ratings yet
1-s2.0-S0957417424021808-main
10 pages
The Peculiar Longevity of Things Not So Bad
No ratings yet
The Peculiar Longevity of Things Not So Bad
6 pages
Predictive Maintenance With Root Cause Analysis
No ratings yet
Predictive Maintenance With Root Cause Analysis
16 pages
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
No ratings yet
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
56 pages
Best Football Prediction Site & Betting Tips Wi… 2
No ratings yet
Best Football Prediction Site & Betting Tips Wi… 2
1 page
G-7 Amount of Detergent Criterion-B C.final
No ratings yet
G-7 Amount of Detergent Criterion-B C.final
23 pages
2.672 Project Laboratory: Mit Opencourseware
No ratings yet
2.672 Project Laboratory: Mit Opencourseware
5 pages
Uricchio Recommended For You Prediction
No ratings yet
Uricchio Recommended For You Prediction
4 pages
Generic Crash Pulses Representing Future Accident Scenarios of Highly Automated Vehicles
No ratings yet
Generic Crash Pulses Representing Future Accident Scenarios of Highly Automated Vehicles
27 pages
Dynamic Scale Inferenceby Entropy Minimization
No ratings yet
Dynamic Scale Inferenceby Entropy Minimization
10 pages
OPM
No ratings yet
OPM
25 pages
Drilling Geomechanics Services Ps
No ratings yet
Drilling Geomechanics Services Ps
2 pages
Aptitude and AchievementTests - The Curious Case of Theinde
No ratings yet
Aptitude and AchievementTests - The Curious Case of Theinde
13 pages
David Binder - Article Cousellor at Law
No ratings yet
David Binder - Article Cousellor at Law
59 pages
A New Classification Approach For Prediction of Flyrock Throw in Surface Mines
No ratings yet
A New Classification Approach For Prediction of Flyrock Throw in Surface Mines
11 pages
AI-2020-Soltanali-A Comparative Study of Statistical and Soft Computing
No ratings yet
AI-2020-Soltanali-A Comparative Study of Statistical and Soft Computing
16 pages
Predict
No ratings yet
Predict
196 pages
Quantifying and Analyzing The Performance of Cricket Player Using Machine Learning
No ratings yet
Quantifying and Analyzing The Performance of Cricket Player Using Machine Learning
7 pages
Psych 205 Book Notes
No ratings yet
Psych 205 Book Notes
19 pages