Unit-1.1-Introduction To Big Data
Unit-1.1-Introduction To Big Data
Data
Ms. Yashi Rastogi
Assistant Professor,
School of Computing
DIT University
Basic: Data
Big Data analytics tools can predict outcomes accurately, thereby, allowing
businesses and organizations to make better decisions, while simultaneously
optimizing their operational efficiencies and reducing risks.
Big Data provides insights into the customer pain points and allows companies
to improve upon their products and services.
Big Data analytics could help companies generate more sales leads which would
naturally mean a boost in revenue.
Big Data insights allow you to learn customer behavior to understand the
customer trends and provide a highly ‘personalized’ experience to them.
Applications of Big Data
Healthcare
With the help of predictive analytics, medical professionals are able to provide personalized
healthcare services to individual patients. Apart from that, fitness wearables, telemedicine,
remote monitoring – all powered by Big Data and AI – are helping change lives for the better.
Academia
Education is no more limited to the physical bounds of the classroom – there are numerous
online educational courses to learn from. Academic institutions are investing in digital
courses powered by Big Data technologies to aid the all-round development of budding
learners.
Banking
The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently
detect fraudulent acts in real-time such as misuse of credit/debit cards, archival of
inspection tracks, faulty alteration in customer stats, etc.
Manufacturing
According to TCS Global Trend Study, the most significant benefit of Big Data in
manufacturing is improving the supply strategies and product quality. In the manufacturing
sector, Big data helps create a transparent infrastructure, thereby, predicting uncertainties
and in competencies that can affect the business adversely.
Conti…
IT
One of the largest users of Big Data, IT companies around the world are using Big Data
to optimize their functioning, enhance employee productivity, and minimize risks in
business operations. By combining Big Data technologies with ML and AI, the IT sector
is continually powering innovation to find solutions even for the most complex of
problems.
Transportation
Big Data Analytics holds immense value for the transportation industry. In countries
across the world, both private and government-run transportation companies use Big
Data technologies to optimize route planning, control traffic, manage road congestion,
and improve services.
Retail
Big Data has changed the way of working in traditional brick and mortar retail stores.
Over the years, retailers have collected vast amounts of data from local demographic
surveys, POS scanners, RFID, customer loyalty cards, store inventory, and so on. Now,
they’ve started to leverage this data to create personalized customer experiences,
boost sales, increase revenue, and deliver outstanding customer service.
Retailers are even using smart sensors and Wi-Fi to track the movement of customers,
the most frequented aisles, for how long customers linger in the aisles, among other
things.
Big Data Case studies
Walmart leverages Big Data and Data Mining to create personalized
product recommendations for its customers. With the help of these two
emerging technologies, Walmart can uncover valuable patterns showing
the most frequently bought products, most popular products, and even the
most popular product bundles (products that complement each other and
are usually purchased together).
Based on these insights, Walmart creates attractive and customized
recommendations for individual users. By effectively implementing Data
Mining techniques, the retail giant has successfully increased the
conversion rates and improved its customer service substantially.
Furthermore, Walmart uses Hadoop and NoSQL technologies to allow
customers to access real-time data accumulated from disparate sources.
Conti…
Uber is one of the major cab service providers in the world. It leverages
customer data to track and identify the most popular and most used
services by the users. Once this data is collected, Uber uses data
analytics to analyze the usage patterns of customers and determine
which services should be given more emphasis and importance.
Apart from this, Uber uses Big Data in another unique way. Uber closely
studies the demand and supply of its services and changes the cab fares
accordingly. It is the surge pricing mechanism that works something like
this – suppose when you are in a hurry, and you have to book a cab from
a crowded location, Uber will charge you double the normal amount.
Conti…
Today, Netflix has become so vast that it is even creating unique content
for users. Data is the secret ingredient that fuels both its recommendation
engines and new content decisions. The most pivotal data points used by
Netflix include titles that users watch, user ratings, genres preferred, and
how often users stop the playback, to name a few. Hadoop, Hive, and Pig
are the three core components of the data structure used by Netflix.
Characteristics of Big Data
Big data is characterized by the three Vs: volume, velocity, and variety. Additionally, some
definitions also include other characteristics such as veracity, value, and variability.
1. Volume: Refers to the sheer size of the data generated or collected. Big data involves massive
amounts of data that exceed the capacity of traditional database systems.
2. Velocity: Describes the speed at which data is generated, processed, and analyzed. Big data often
involves real-time or near-real-time data processing to extract meaningful insights quickly.
3. Variety: Encompasses the diverse types of data, including structured, semi-structured, and
unstructured data. Big data includes text, images, videos, social media posts, sensor data, and more.
4. Veracity: Relates to the quality and accuracy of the data. Big data sources may include imperfect or
uncertain data, and dealing with the reliability of the information becomes a significant challenge.
5. Value: The ultimate goal of big data is to derive value and actionable insights from the data.
Analyzing and interpreting large datasets should lead to meaningful business decisions, improved
processes, and innovation.
6. Variability: Refers to the inconsistency of the data flow. Data may come in different formats, structures,
and from various sources, making it challenging to handle and analyze.
7. Volatility: Describes the temporal nature of data. Big data can be dynamic, with patterns and trends
changing over time. The ability to adapt and analyze data changes is crucial in a big data
environment.
Types of Big Data
Following are the types of Big Data:
Structured
Unstructured
Semi-structured
According to Merrill Lynch, 80–90% of business data is either unstructured or semi-
structured.
Gartner also estimates that unstructured data constitutes 80% of the whole enterprise
data.
Structured Data
Any data that can be stored, accessed and processed in the form of
fixed format is termed as a ‘structured’ data.
1021 bytes = 1 zettabyte
or one billion terabytes forms a zettabyte.
Data stored in a relational database management system is one
example of a ‘structured’ data.
Structured is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program.
Relationships exist between entities of data, such as classes and their
objects.
Data stored in databases is an example of structured data.
Cont..
Structured Data Come from…
Structured V/s Semi-structured Data
Structured Data
Structured Data Retrieval
Structured Data Example
An ‘Employee’ table in a database is an example of Structured Data.
Unstructured Data
Any data with unknown form or structure is classified as unstructured
data.
<rec><name>Anil
Dubey</name><sex>Male</sex><age>33</age></rec>
<rec><name>Rohit
Rastogi</name><sex>Male</sex><age>43</age></rec>
<rec><name>Sikha</name><sex>Female</sex><age>31</age></rec
>
Analytic Processes and tools:
Analytic processes and tools refers to a set of methods and techniques used in data
analysis to uncover insights and patterns in large amounts of data. The goal of these
processes and tools is to turn raw data into useful information that can be used to inform
decision-making, evaluate performance and identify trends and relationships.
What is
What has Why did it What will the
happened happen? happen? solution?
Descriptive analytics
This process involves using data to identify the root cause of a problem.
This analytics is characterized by techniques such as drill-down, data
mining and data discovery.
Organizations go for diagnostic analytics as it gives in-depth insights
into a particular problem.
Use Case: An e-commerce company’s report shows that their sales
have gone down, although customers are adding products to their carts.
This can be due to various reasons like the form didn’t load correctly,
the shipping fee is too high, or there are not enough payment options
available. This is where you can use diagnostic analytics to find the
reason.
Predictive Analytics
This process involves using data and models to recommend actions and
decisions.
Prescriptive analytics works with both descriptive and predictive
analytics.
Most of the times prescriptive analytics relies on machine learning and
artificial intelligence.
Business rules, algorithms and computational modelling procedures are
used in prescriptive analytics.
Use Case: Prescriptive analytics can be used to maximize an airline’s
profit. This type of analytics is used to build an algorithm that will
automatically adjust the flight fares based on numerous factors,
including customer demand, weather, destination, holiday seasons, and
oil prices.
The Lifecycle Phases of Big Data Analytics
• Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a
business case, which defines the reason and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here
to remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same fields across different
datasets are integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to
discover useful information.
• Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.
• Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle,
where the final results of the analysis are made available to business stakeholders who will
take action.
Big Data Analytics Tools
Here are some of the key big data analytics tools :
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of data
• Spark - used for real-time processing and analyzing large amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-tolerant
storage
Analysis vs Reporting
Analysis and reporting are related but distinct activities that are commonly used in
business and other organizations.
Analysis refers to the process of examining and interpreting data in order to draw
conclusions, identify trends and make informed decisions. This involves the use of
statistical and mathematical techniques to evaluate data and identify patterns,
relationships and other meaningful insights.
Reporting, on the other hand, refers to the process of presenting the results of an
analysis in a clear and concise manner. Reports may include charts, graphs, tables, and
written summaries that present the findings of an analysis in an easily accessible format.
Reports are usually presented to stakeholders, such as managers, clients or customers and
are used to support decision-making, evaluate performance and measure progress.
In summary, analysis is process of evaluating data to make informed decisions, while
reporting is the process of presenting the results of an analysis to stakeholders. Both are
critical to effective decision-making and effective communication of data and results.
Statistical Concepts: Sampling
Distributions
The distribution of a statistic (such as the mean, median, or proportion) computed
from multiple random samples drawn from the same population.
• Sampling Distribution of the Mean: When you take multiple samples from a
population and calculate the mean of each sample, the distribution of those
sample means forms the sampling distribution of the mean.
• Central Limit Theorem (CLT): This theorem states that, for a sufficiently large
sample size, the sampling distribution of the sample mean will be approximately
normally distributed, regardless of the original population distribution. This is
crucial for making inferences about population parameters.
• Standard Error: The standard deviation of the sampling distribution is called the
standard error. It quantifies the variability of the sample mean from the true
population mean.
Statistical Concepts: Re-Sampling
Techniques used to repeatedly sample from a dataset to estimate the sampling distribution of a
statistic. Re-sampling methods are especially useful when theoretical distributions are complex or
unknown.
Techniques
• Bootstrapping:
• Process: Draw multiple samples with replacement from the original dataset. Each bootstrap sample is
the same size as the original dataset but may contain repeated observations.
• Use: Estimate the sampling distribution of a statistic (e.g., mean, median) and calculate confidence
intervals and standard errors.
• Example: To estimate the confidence interval for the mean income, you create thousands of bootstrap
samples, compute the mean for each, and then analyze the distribution of these means.
• Jackknife:
• Process: Systematically leave out one observation at a time from the dataset and recalculate the
statistic of interest.
• Use: Assess the variability and bias of a statistic by examining how the estimate changes when individual
data points are omitted.
• Example: To estimate the variance of the mean income, calculate the mean for each subset of the data
obtained by omitting one observation at a time, and analyze the variability.
Statistical Concepts: Statistical Inference
The process of drawing conclusions about a population based on sample data. It includes estimation and
hypothesis testing.
Components
• Estimation:
• Point Estimation: Provides a single value estimate for a population parameter (e.g., sample mean as an estimate
of the population mean).
• Interval Estimation: Provides a range within which the population parameter is expected to lie with a certain
level of confidence (e.g., 95% confidence interval).
• Example: In big data analytics, you might use sample data to estimate the average customer spending and
provide a confidence interval around that estimate.
• Hypothesis Testing:
• Null Hypothesis (H₀): A statement of no effect or no difference. It is tested against an alternative hypothesis
(H₁).
• P-Value: The probability of observing the sample data, or something more extreme, assuming the null hypothesis
is true. A small p-value indicates strong evidence against the null hypothesis.
• Type I Error (α): Rejecting the null hypothesis when it is true (false positive).
• Type II Error (β): Failing to reject the null hypothesis when it is false (false negative).
• Example: In a big data context, hypothesis testing might be used to determine if a new marketing strategy
Statistical Concepts: Prediction Error
The discrepancy between the actual values and the values predicted by a model. It measures the accuracy
of the model’s predictions.
Metrics
• Mean Absolute Error (MAE):
• Formula: MAE=
• Use: Measures the average magnitude of errors in predictions, without considering their direction.
• Mean Squared Error (MSE):
• Formula:
• Use: Measures the average of the squares of the errors, giving more weight to larger errors.
• Root Mean Squared Error (RMSE):
• Formula: RMSE=
• Use: Provides error metrics in the same units as the data, making it easier to interpret.
• R-Squared (R²):
• Formula: R2 =1-
• Use: Represents the proportion of variance in the dependent variable that is predictable from the independent
variables. Higher values indicate better model fit.
THANK YOU