Data Analytics Notes
Data Analytics Notes
Descriptive Statistic:-
Whenever we deal with some piece of data no matter whether it is small or stored in
huge databases statistics is the key that helps us to analyze this data and provide
insightful points to understand the whole data without going through each of the data
pieces in the complete dataset at hand. In this article, we will learn about Descriptive
Statistics and how actually we can use it as a tool to explore the data we have.
Mean
It is the sum of observations divided by the total number of observations. It is
also defined as average which is the sum divided by count.
xˉ=∑xn xˉ=n∑x
where,
x = Observations
n = number of terms
Mode
It is the value that has the highest frequency in the given data set. The data set may have
no mode if the frequency of all data points is the same. Also, we can have more than one
mode if we encounter two or more data points having the same frequency.
Python
Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even
then the median would be the average of two central elements.
Range
Variance
Standard deviation
Range
The range describes the difference between the largest and smallest data point in our
data set. The bigger the range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Variance
It is defined as an average squared deviation from the mean. It is calculated by finding
the difference between every data point and the average which is also known as the
mean, squaring them, adding all of them, and then dividing by the number of data points
present in our data set.
σ2=∑(x−μ)2/N-1
where,
x -> Observation under consideration
N -> number of terms
mu -> Mean
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean, then
subtracting each number from the Mean which is also known as the average, and
squaring the result. Adding all the values and then dividing by the no of terms followed
by the square root.
σ=∑(x−μ)2Nσ=N∑(x−μ)2
where,
x = Observation under consideration
N = number of terms
mu = Mean
Example 1:
Exam Scores Suppose you have the following scores of 20 students on an exam:
85, 90, 75, 92, 88, 79, 83, 95, 87, 91, 78, 86, 89, 94, 82, 80, 84, 93, 88, 81
Mean: Add up all the scores and divide by the number of scores. Mean = (85 + 90 + 75
+ 92 + 88 + 79 + 83 + 95 + 87 + 91 + 78 + 86 + 89 + 94 + 82 + 80 + 84 + 93 + 88 +
81) / 20 = 1770 / 20 = 88.5
Median: Arrange the scores in ascending order and find the middle value. Median = 86
(middle value)
Range: Calculate the difference between the highest and lowest scores. Range = 95 - 75
= 20
Variance: Calculate the average of the squared differences from the mean. Variance =
[(85-88.5)^2 + (90-88.5)^2 + ... + (81-88.5)^2] / 20 = 33.25
Standard Deviation: Take the square root of the variance. Standard Deviation = √33.25 =
5.77
Probability Distribution
image paste
What is Probability Distribution?
In Probability Distribution, A Random Variable’s outcome is uncertain. Here, the
outcome’s observation is known as Realization. It is a Function that maps Sample
Space into a Real number space, known as State Space. They can be Discrete or
Continuous.
Probability Distribution Definition
The probability Distribution of a Random Variable (X) shows how the Probabilities of
the events are distributed over different values of the Random Variable. When all
values of a Random Variable are aligned on a graph, the values of its probabilities
generate a shape. The Probability distribution has several properties (for example:
Expected value and Variance) that can be measured. It should be kept in mind that the
Probability of a favorable outcome is always greater than zero and the sum of all the
probabilities of all the events is equal to 1.
Probability Distribution is basically the set of all possible outcomes of any random
experiment or event.
Random Variables
Random Variable is a real-valued function whose domain is the sample space of the
random experiment. It is represented as X(sample space) = Real number.
We need to learn the concept of Random Variables because sometimes we are just only
interested in the probability of the event but also in the number of events associated
with the random experiment. The importance of random variables can be better
understood by the following example:
Why do we need Random Variables?
Let’s take an example of the coin flips. We’ll start with flipping a coin and finding out
the probability. We’ll use H for ‘heads’ and T for ‘tails’.
So now we flip our coin 3 times, and we want to answer some questions.
1. What is the probability of getting exactly 3 heads?
2. What is the probability of getting less than 3 heads?
3. What is the probability of getting more than 1 head?
Then our general way of writing would be:
1. P(Probability of getting exactly 3 heads when we flip a coin 3 times)
In a different scenario, suppose we are tossing two dice, and we are interested in
knowing the probability of getting two numbers such that their sum is 6.
So, in both of these cases, we first need to know the number of times the desired event
is obtained i.e. Random Variable X in sample space which would be then further used
to compute the Probability P(X) of the event. Hence, Random Variables come to our
rescue. First, let’s define what is random variable mathematically.
A random variable is a real valued function whose domain is the sample space of a
random experiment.
A Discrete Random Variable can only take a finite number of values. To further
understand this, let’s see some examples of discrete random variables:
1. X = {sum of the outcomes when two dice are rolled}. Here, X can only take
values like {2, 3, 4, 5, 6….10, 11, 12}.
2. X = {Number of Heads in 100 coin tosses}. Here, X can take only integer values
from [0,100].
A Continuous Random Variable can take infinite values in a continuous domain. Let’s
see an example of a dart game.
Suppose, we have a dart game in which we throw a dart where the dart can fall
anywhere between [-1,1] on the x-axis. So if we define our random variable as the x-
coordinate of the position of the dart, X can take any value from [-1,1]. There are
infinitely many possible values that X can take. (X = {0.1, 0.001, 0.01, 1,2, 2.112121
…. and so on}
Event Probability
x1 P(X = x1)
x2 P(X = x2)
x3 P(X = x3)
The Probability Function of a discrete random variable X is the function p(x) satisfying
P(x) = P(X = x)
P(X=x)
It should be noted here that each value of P(X=x) is greater than zero and the sum of
all P(X=x) is equal to 1.
P(X) = nCxaxbn-x
i.e it is weighted average of all values which X can take, weighted by the probability of
each value.
To see it more intuitively, let’s take a look at this graph below,
Now in the above figure, we can see both the Random Variables have the almost same
‘mean’, but does that mean that they are equal? No. To fully describe the
properties/behavior of a random variable, we need something more, right?
We need to look at the dispersion of the probability distribution, one of them is
concentrated, but the other is very spread out near a single value. So we need a metric
to measure the dispersion in the graph.
Probabilities
E[X] =
Also, E[X2]
=
Thus, Var(X) = E[X2] – (E[X])2
=
So, therefore mean is and variance is
**********************************************************************
Example: An urn contains 8 red balls and 10 black balls. We draw six balls from the
urn successively. You have to tell whether or not the trials of drawing balls are
Bernoulli trials when after each draw, the ball drawn is:
1. replaced
2. not replaced in the urn.
Answer:
1. We know that the number of trials are finite. When drawing is done with
replacement, probability of success (say, red ball) is p =8/18 which will be same
for all of the six trials. So, drawing of balls with replacements are Bernoulli trials.
2. If drawing is done without replacement, probability of success (i.e., red ball) in
the first trial is 8/18 , in 2nd trial is 7/17 if first ball drawn is red or, 10/18 if first
ball drawn is black, and so on. Clearly, probabilities of success are not same for all
the trials, Therefore, the trials are not Bernoulli trials.
Binomial Distribution
It is a random variable that represents the number of successes in “N” successive
independent trials of Bernoulli’s experiment. It is used in a plethora of instances
including the number of heads in “N” coin flips, and so on.
Let P and Q denote the success and failure of the Bernoulli Trial respectively. Let’s
assume we are interested in finding different ways in which we have 1 success in all
six trials.
Clearly, six cases are available as listed below:
PQQQQQ, QPQQQQ, QQPQQQ, QQQPQQ, QQQQPQ, QQQQQP
We already know, n = 10
p = 1/2
So, P(X = x) = nCx pn-x (1-p)x , x= 0,1,2,3,….n
P(X = x) = 10Cxp10-x(1-p)x
When x = 6,
(i) P(x = 6) = 10C6 p4 (1-p)6
=
(ii) P(at least 6 heads) = P(X >= 6) = P(X = 6) + P(X=7) + P(X=8)+ P(X=9) +
P(X=10)
= 10C6 p4 (1-p)6 + 10C7 p3 (1-p)7 + 10C8 p2 (1-p)8 + 10C9 p1(1-p)9 + 10C10 (1-p)10 =
P(X) P1 P2 P3 P4 …. Pn
o Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of
Big Data which are also termed as the characteristics of Big Data as follows:
1. Volume:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If
the volume of data is very large, then it is actually considered as a ‘Big
Data’. This means whether a particular data can actually be considered as a
Big Data or not, is dependent upon the volume of data.
Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
Example: In the year 2016, the estimated global mobile traffic was 6.2
Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will have
almost 40000 Exabytes of data.
2. Velocity:
Velocity refers to the high speed of accumulation of data.
In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to meet
the demands.
Sampling data can help in dealing with the issue like ‘velocity’.
Example: There are more than 3.5 billion searches per day are made on
Google. Also, Facebook users are increasing by 22%(Approx.) year by year.
The term big data drivers refers to the key factors that contribute to the growth and adoption of big
data technologies and practices. These drivers are the forces or trends that cause the volume,
4. Real-Time and Predictive Analytics: The demand for real-time data processing and
predictive insights pushes the need for big data solutions that can handle rapid data flow and
complex analysis.
5. Cloud Computing: The availability of cloud services enables organizations to store, scale,
and analyze large datasets without heavy upfront infrastructure investment.
6. Social and Economic Factors: Changing consumer behaviors and regulatory requirements
(e.g., GDPR) also contribute to the growing importance of big data in businesses,
governments, and industries.
Helical Insight is like a magic wand for your data. It helps you turn your messy
numbers into clear, easy-to-understand insights. Below highlighted are some of the
prominent features of Helical Insight BI product
self service interface for creating reports, dashboards, info-graphs and map
based analytics
Plenty of visualization options with drill down, drill through and inter panel
communication options
NLP (GenAI) based data analysis under development
2. Apache Spark
Apache Spark is a unified analytics engine known for its speed and ease of use. It
extends the MapReduce model to efficiently use more types of computations,
including interactive queries and stream processing. Key features include:
3. Druid
Interactive querying: Executes queries with low latency, even over large
datasets.
Mobile Business Intelligence (Mobile BI) refers to the delivery and access of business intelligence
(BI) data on mobile devices like smartphones and tablets. It enables users to monitor business
performance, view reports, dashboards, and analytics, and make data-driven decisions on the go.
Mobile BI enhances accessibility and real-time decision-making by providing data insights
wherever and whenever needed.
Tools of Mobile BI
Several tools and platforms are designed to support Mobile BI, offering capabilities like data
visualization, reporting, and analytics. Some popular Mobile BI tools include:
Tableau Mobile:
Power BI (Microsoft):
Mobile version allows for creating and viewing reports and dashboards on mobile.
Integrates seamlessly with other Microsoft services like Azure and Excel.
Domo:
Sisense Mobile:
Challenges of Mobile BI
Data Security:
Presenting complex data and reports on small mobile screens can be challenging.
Offline Access:
Ensuring data synchronization and caching for offline use is a significant challenge.
Performance:
Ensuring fast and smooth interaction with large datasets requires optimized mobile applications.
Complex BI features may need simplification for mobile users without compromising functionality.
Mobile BI platforms must integrate with various data sources and enterprise systems.
Ensuring compatibility across different mobile operating systems (iOS, Android) can add
complexity.
Mobile devices have limited battery life and may face connectivity issues.
Crowdsourcing Analytics in Big Data refers to the process of leveraging a large group of people,
often through an open call on the internet, to analyze large datasets or solve complex analytical
problems. It combines the principles of crowdsourcing—where tasks are distributed to a wide
audience—and analytics, which involves interpreting and deriving insights from data. In the context
of Big Data, this approach is particularly useful for handling the vast, complex, and unstructured
datasets that require significant human input and diverse perspectives to process effectively.
Human-In-The-Loop:
In many cases, automated algorithms struggle with certain tasks (e.g., image recognition, sentiment
analysis, and data labeling). Crowdsourcing allows humans to step in where AI and machine
learning models fall short.
Distributed Workforces:
Tasks like data cleaning, labeling, or even feature identification can be distributed to a large,
diverse group of people across the globe, enabling parallel processing of large datasets.
Collective Intelligence:
Crowdsourcing takes advantage of the diverse knowledge and expertise of a large pool of
individuals. This can lead to more creative, accurate, or comprehensive insights compared to
relying solely on algorithms or a small team of data scientists.
Since the workload is distributed among many participants, it allows for faster processing of large
datasets, often much quicker than what a single team could achieve.
Cost-Effective:
Datasets, especially for machine learning models, often need to be labeled (e.g., images, texts).
Crowdsourcing platforms can assign these labeling tasks to many individuals to create high-quality,
labeled data.
Sentiment Analysis:
For companies analyzing social media data or customer feedback, crowdsourcing can help interpret
sentiments in a more nuanced way than automated sentiment analysis tools.
Pattern Recognition:
In cases where visual pattern recognition (such as identifying objects in satellite images) is
required, crowdsourcing can leverage human perception, which is often better at identifying
patterns in complex, unstructured data.
Crowdsourcing analytics platforms (like Kaggle) allow individuals and teams to tackle complex big
data problems, often in competitions, leading to innovative approaches and insights.
Data Cleaning:
Quality Control:
Since the work is distributed to many individuals with varying levels of expertise, ensuring the
accuracy and quality of results can be challenging. Multiple checks or consensus mechanisms are
often required to validate the output.
Crowd participants may introduce their biases or may interpret data inconsistently. Careful
instruction and sample tasks are necessary to standardize how tasks are performed.
When sensitive or proprietary data is involved, sharing data with a large group of participants raises
privacy and security concerns. Anonymizing data or restricting access can mitigate some risks.
Task Complexity:
Some analytical tasks might be too complex to easily distribute to non-experts. In such cases,
crowdsourcing might not be an effective solution compared to more specialized teams or AI-driven
approaches.
Maintaining engagement and ensuring that participants remain motivated can be difficult, especially
for repetitive or tedious tasks. Incentive structures are crucial for sustained participation.
Kaggle:
A platform for data science competitions where crowdsourcing analytics is applied to solve
complex data problems.
A crowdsourcing marketplace that allows businesses to distribute small tasks (like data labeling and
classification) to a large number of workers.
Zooniverse:
A platform that enables crowdsourcing for scientific research, such as identifying galaxies,
classifying animals in images, and analyzing historical documents.
Conclusion
Crowdsourcing analytics is a powerful tool in the Big Data space, offering a way to distribute
complex tasks to a broad audience, resulting in faster and often more innovative solutions.
However, challenges such as ensuring data quality and managing biases must be carefully
addressed for effective implementation.
Data processing means to processing of data i.e. to convert its format. As we all know data is the
very useful and when it is well presented, and it becomes informative and useful. Data processing
process system is also referred as information system. It is also right to say that data processing
becomes the process of converting information into data and also vice-versa.
Processing data definition involves defining and managing the structure, characteristics, and
specifications of data within an organization.
Processed data definition typically refers to the refined and finalized specifications and attributes
associated with data after it has undergone various processing steps.
Data processing process involves a series of stages to transform raw data into meaningful
information. Here are the six fundamental stages of data processing process:
1. Collection
The process begins with the collection of raw data from various sources. The stage establishes the
foundation for subsequent processing, ensuring a comprehensive pool of data relevant to the
intended analysis. It could include surveys, sensors, databases, or any other means of gathering
relevant information.
3. Input
During the data input stage, the prepared data is entered into a computer system. This can be
achieved through manual entry or automated methods, depending on the nature of the data and the
systems in place.
4.Data Processing
The core of data processing involves manipulating and analyzing the prepared data. Operations
such as sorting, summarizing, calculating, and aggregating are performed to extract meaningful
insights and patterns.
5. Data Output
The results of data processing are presented in a comprehensible format during the data output
stage. This could include reports, charts, graphs, or other visual representations that facilitate
understanding and decision-making based on the analyzed data.
6. Data Storage
The final stage entails storing the processed data for future reference and analysis. This is crucial
for maintaining a historical record, enabling efficient retrieval, and supporting ongoing or future
data-related initiatives. Proper data storage ensures the longevity and accessibility of valuable
information.
Mechanical data processing involves the use of machines, like punch cards or mechanical
calculators, to handle data. It represents an intermediate stage between manual and electronic
processing, offering increased efficiency over manual methods but lacking the speed and
sophistication of electronic systems. This method was prominent before the widespread adoption of
computers.
Electronic data processing leverages computers and digital technology to perform data-related
tasks. It has revolutionized the field by significantly enhancing processing speed, accuracy, and
capacity. Electronic data processing encompasses various techniques, including batch processing,
real-time processing, and online processing, making it a cornerstone of modern information
management and analysis.
In this type, data is processed by humans without the use of machines or electronic devices. It
involves tasks such as manual calculations, sorting, and recording, making it a time-consuming
process.
This type utilizes mechanical devices, such as punch cards or mechanical calculators, to process
data. While more efficient than manual processing, it lacks the speed and capabilities of electronic
methods.
Batch processing involves grouping data into batches and processing them together at a scheduled
time. It is suitable for non-time-sensitive tasks and is efficient for large-scale data processing.
Real-time processing deals with data immediately as it is generated. It is crucial for time-sensitive
applications, providing instant responses and updates, often seen in applications like financial
transactions and monitoring systems.
Online Data Processing (OLTP) involves processing data directly while it is being collected. It is
interactive and supports concurrent transactions, making it suitable for applications that require
simultaneous user interaction and data updates.
Automatic Data Processing (ADP) refers to the use of computers and software to automate data
processing tasks. It encompasses various methods, including batch processing and real-time
processing, to efficiently handle large volumes of data with minimal human intervention.
Highly efficient
Time-saving
High speed
Reduces errors
Wastage of memory
There are mainly 2 major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the integrated
data. The data is extracted from various sources, transformed and loaded into a data warehouse.
Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level,
such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual data
elements or records. Data is integrated in a loosely coupled manner, meaning that the data is
integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across
multiple data sources.
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
And the data only remains in the actual source databases.
Issues in Data Integration:
Data extraction is defined as the process of retrieving data from various sources. This step in the
data handling process involves gathering and converting different forms of data into a more usable
or accessible format. The primary goal of data extraction is to collect data from disparate sources
for further processing, analysis, or storage in a centralized location.
In essence, data extraction is a critical first step in the data journey, setting the stage for value
creation through data analysis and interpretation. By efficiently extracting relevant data,
organizations can unlock a wealth of opportunities for innovation, efficiency, and competitive
advantage.
Depending on the nature and format of the data, different extraction methods are employed to
efficiently retrieve valuable insights.
Each type of data extraction presents its own set of challenges and opportunities.
The extract, transform, and load (ETL) process is a cornerstone of data warehousing and business
intelligence. It involves extracting data from various sources, transforming it into a format suitable
for analysis, and loading it into a destination system, such as a data warehouse. Data extraction is
the first and arguably most critical step in this process, as it involves identifying and retrieving
relevant data from internal and external sources.
Data extraction fits into the ETL process as the foundational phase that determines the quality and
usability of the data being fed into the subsequent stages. Without effective data extraction, the
transform and load phases cannot perform optimally, potentially compromising the integrity and
value of the final dataset. This stage sets the tone for the efficiency of the entire ETL pipeline,
highlighting the importance of employing robust data extraction techniques and tools.
From a logical and physical standpoint, the projected amount of data to be extracted and the stage in
the ETL process (initial load or data maintenance) may also influence how to extract. Essentially,
you must decide how to conceptually and physically extract data.
Full Extraction
The data is fully pulled from the source system. There's no need to keep track of data source
changes because this Extraction reflects all of the information saved on the source system after the
last successful Extraction.
The source data will be delivered in its current state, with no further logical information (such as
timestamps) required on the source site. An export file of a specific table or a remote SQL query
scanning the entire source table is two examples of full extractions.
Incremental Extraction
Only data that has changed since a particular occurrence in the past will be extracted at a given
time. This event could be the end of the extraction process or a more complex business event such
as the last day of a fiscal period's bookings. To detect this delta change, there must be a way to
identify all the changed information since this precise time event.
This information can be provided by the source data itself, such as an application column indicating
the last-changed timestamp, or by a changing table in which a separate mechanism keeps track of
the modifications in addition to the originating transactions. Using the latter option, in most
situations, entails adding extraction logic to the source system.
As part of the extraction process, many data warehouses do not apply any change-capture
algorithms. Instead, full tables from source systems are extracted to the data warehouse or staging
area, and these tables are compared to a previous source system extract to detect the changed data.
Physically extracting the data can be done in two ways, depending on the chosen logical extraction
method and the source site's capabilities and limits. The data can be extracted online from the
source system or offline from a database. An offline structure like this could already exist or be
created by an extraction routine.
Online Extraction
The information is taken directly from the source system. The extraction procedure can link directly
to the source system to access the source tables or connect to an intermediate system to store the
data in a predefined format (for example, snapshot logs or change tables). It's worth noting that the
intermediary system doesn't have to be physically distinct from the source system.
It would be best to evaluate whether the distributed transactions use source objects or prepared
source objects when using online extractions.
Offline Extraction
The data is staged intentionally outside the source system rather than extracted straight from it. The
data was either created by an extraction method or already had a structure (redo logs, archive logs,
or transportable tablespaces).
Flat files are files that have a predefined, generic format. For further processing, further
information about the source item is required.
Oracle-specific format for dump files The containing items' information is included.
A separate, supplemental dump file contains the information.
Tablespaces that can be moved
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
6. Normalization: Data normalization involves converting all data variables into a given range.
Advantages of Data Transformation in Data Mining
1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data from
multiple sources, which can improve the accuracy and completeness of the data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis and
modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or to
remove sensitive information from the data, which can help to increase data security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve the
performance of data mining algorithms by reducing the dimensionality of the data and scaling
the data to a common range of values.
Disadvantages of Data Transformation in Data Mining
1. Time-consuming: Data transformation can be a time-consuming process, especially when
dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized skills
and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not properly
understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: Data transformation can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to the extent
that it negatively impacts the performance of the model on new unseen data.