0% found this document useful (0 votes)
35 views

Unit-1.1-Introduction To Big Data

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Unit-1.1-Introduction To Big Data

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to Big

Data
Ms. Yashi Rastogi
Assistant Professor,
School of Computing
DIT University
Basic: Data

 The quantities, characters, or symbols on which operations are


performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
 Data can be defined as figures or facts that can be stored in or can
be used by a computer.
Basic: Big Data

 Data which are very large in size is called Big Data.


 Normally we work on data of size MB (Word Doc, Excel) or maximum GB
(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data.
 It is stated that almost 90% of today's data has been generated in the past 3
years.
 Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.
 It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
 “Big data” is high-volume, velocity, and variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.”
Example of Big Data

 New York Stock Exchange is an example of Big Data that generates


about one terabyte of new trade data per day.
 The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is
mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
 A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day, generation
of data reaches up to many Petabytes.
Sources of Big Data

 Social networking sites: Facebook, Google, LinkedIn all these sites


generates huge amount of data on a day to day basis as they have
billions of users worldwide.
 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
 Weather Station: All the weather station and satellite gives very huge
data which are stored and manipulated to forecast weather.
 Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the
data of its million users.
 Share Market: Stock exchange across the world generates huge
amount of data through its daily transaction.
Advantages of Big Data

 Big Data analytics tools can predict outcomes accurately, thereby, allowing
businesses and organizations to make better decisions, while simultaneously
optimizing their operational efficiencies and reducing risks.

 Big Data provides insights into the customer pain points and allows companies
to improve upon their products and services.

 Big Data analytics could help companies generate more sales leads which would
naturally mean a boost in revenue.

 Big Data insights allow you to learn customer behavior to understand the
customer trends and provide a highly ‘personalized’ experience to them.
Applications of Big Data
Healthcare
 With the help of predictive analytics, medical professionals are able to provide personalized
healthcare services to individual patients. Apart from that, fitness wearables, telemedicine,
remote monitoring – all powered by Big Data and AI – are helping change lives for the better.
Academia
 Education is no more limited to the physical bounds of the classroom – there are numerous
online educational courses to learn from. Academic institutions are investing in digital
courses powered by Big Data technologies to aid the all-round development of budding
learners.
Banking
 The banking sector relies on Big Data for fraud detection. Big Data tools can efficiently
detect fraudulent acts in real-time such as misuse of credit/debit cards, archival of
inspection tracks, faulty alteration in customer stats, etc.
Manufacturing
 According to TCS Global Trend Study, the most significant benefit of Big Data in
manufacturing is improving the supply strategies and product quality. In the manufacturing
sector, Big data helps create a transparent infrastructure, thereby, predicting uncertainties
and in competencies that can affect the business adversely.
Conti…
IT
 One of the largest users of Big Data, IT companies around the world are using Big Data
to optimize their functioning, enhance employee productivity, and minimize risks in
business operations. By combining Big Data technologies with ML and AI, the IT sector
is continually powering innovation to find solutions even for the most complex of
problems.
Transportation
 Big Data Analytics holds immense value for the transportation industry. In countries
across the world, both private and government-run transportation companies use Big
Data technologies to optimize route planning, control traffic, manage road congestion,
and improve services.
Retail
 Big Data has changed the way of working in traditional brick and mortar retail stores.
Over the years, retailers have collected vast amounts of data from local demographic
surveys, POS scanners, RFID, customer loyalty cards, store inventory, and so on. Now,
they’ve started to leverage this data to create personalized customer experiences,
boost sales, increase revenue, and deliver outstanding customer service.
 Retailers are even using smart sensors and Wi-Fi to track the movement of customers,
the most frequented aisles, for how long customers linger in the aisles, among other
things.
Big Data Case studies
 Walmart leverages Big Data and Data Mining to create personalized
product recommendations for its customers. With the help of these two
emerging technologies, Walmart can uncover valuable patterns showing
the most frequently bought products, most popular products, and even the
most popular product bundles (products that complement each other and
are usually purchased together).
 Based on these insights, Walmart creates attractive and customized
recommendations for individual users. By effectively implementing Data
Mining techniques, the retail giant has successfully increased the
conversion rates and improved its customer service substantially.
Furthermore, Walmart uses Hadoop and NoSQL technologies to allow
customers to access real-time data accumulated from disparate sources.
Conti…
 Uber is one of the major cab service providers in the world. It leverages
customer data to track and identify the most popular and most used
services by the users. Once this data is collected, Uber uses data
analytics to analyze the usage patterns of customers and determine
which services should be given more emphasis and importance.
 Apart from this, Uber uses Big Data in another unique way. Uber closely
studies the demand and supply of its services and changes the cab fares
accordingly. It is the surge pricing mechanism that works something like
this – suppose when you are in a hurry, and you have to book a cab from
a crowded location, Uber will charge you double the normal amount.
Conti…

 Netflix is one of the most popular on-demand online video content


streaming platform used by people around the world. Netflix is a major
proponent of the recommendation engine. It collects customer data to
understand the specific needs, preferences, and taste patterns of users.
Then it uses this data to predict what individual users will like and create
personalized content recommendation lists for them.

 Today, Netflix has become so vast that it is even creating unique content
for users. Data is the secret ingredient that fuels both its recommendation
engines and new content decisions. The most pivotal data points used by
Netflix include titles that users watch, user ratings, genres preferred, and
how often users stop the playback, to name a few. Hadoop, Hive, and Pig
are the three core components of the data structure used by Netflix.
Characteristics of Big Data
 Big data is characterized by the three Vs: volume, velocity, and variety. Additionally, some
definitions also include other characteristics such as veracity, value, and variability.
1. Volume: Refers to the sheer size of the data generated or collected. Big data involves massive
amounts of data that exceed the capacity of traditional database systems.
2. Velocity: Describes the speed at which data is generated, processed, and analyzed. Big data often
involves real-time or near-real-time data processing to extract meaningful insights quickly.
3. Variety: Encompasses the diverse types of data, including structured, semi-structured, and
unstructured data. Big data includes text, images, videos, social media posts, sensor data, and more.
4. Veracity: Relates to the quality and accuracy of the data. Big data sources may include imperfect or
uncertain data, and dealing with the reliability of the information becomes a significant challenge.
5. Value: The ultimate goal of big data is to derive value and actionable insights from the data.
Analyzing and interpreting large datasets should lead to meaningful business decisions, improved
processes, and innovation.
6. Variability: Refers to the inconsistency of the data flow. Data may come in different formats, structures,
and from various sources, making it challenging to handle and analyze.
7. Volatility: Describes the temporal nature of data. Big data can be dynamic, with patterns and trends
changing over time. The ability to adapt and analyze data changes is crucial in a big data
environment.
Types of Big Data
 Following are the types of Big Data:
 Structured
 Unstructured
 Semi-structured
 According to Merrill Lynch, 80–90% of business data is either unstructured or semi-
structured.
 Gartner also estimates that unstructured data constitutes 80% of the whole enterprise
data.
Structured Data
 Any data that can be stored, accessed and processed in the form of
fixed format is termed as a ‘structured’ data.
1021 bytes = 1 zettabyte
or one billion terabytes forms a zettabyte.
 Data stored in a relational database management system is one
example of a ‘structured’ data.
 Structured is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program.
 Relationships exist between entities of data, such as classes and their
objects.
 Data stored in databases is an example of structured data.
Cont..
Structured Data Come from…
Structured V/s Semi-structured Data
Structured Data
Structured Data Retrieval
Structured Data Example
An ‘Employee’ table in a database is an example of Structured Data.
Unstructured Data
 Any data with unknown form or structure is classified as unstructured
data.

 In addition to the size being huge, un-structured data poses multiple


challenges in terms of its processing for deriving value out of it.

 A typical example of unstructured data is a heterogeneous data source


containing a combination of simple text files, images, videos etc.
 Unstructured data refers to the data that lacks any specific form or
structure whatsoever.

 This makes it very difficult and time-consuming to process and analyze


unstructured data.

 Email is an example of unstructured data.


Unstructured data
Unstructured data come from…
Store Unstructured data
Cont..
Extract information from
Unstructured data
Cont..
Unstructured Data: Example
 The output returned by ‘Google Search’
Semi-structured Data

 Semi-structured data can contain both the forms of data.


 We can see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
 Example of semi-structured data is a data represented in an XML file.
 Semi-structured data pertains to the data containing both the
formats structured and unstructured data.
 To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information
or tags that segregate individual elements within the data.
Semi-structured data
Semi-structured data come from…
Manage Semi-structured Data
Store Semi-structured Data
Conti…
Extract Information from Semi-
structured data
Cont..
Semi-structured Data example

 Personal data stored in an XML file

<rec><name>Anil
Dubey</name><sex>Male</sex><age>33</age></rec>
<rec><name>Rohit
Rastogi</name><sex>Male</sex><age>43</age></rec>
<rec><name>Sikha</name><sex>Female</sex><age>31</age></rec
>
Analytic Processes and tools:
 Analytic processes and tools refers to a set of methods and techniques used in data
analysis to uncover insights and patterns in large amounts of data. The goal of these
processes and tools is to turn raw data into useful information that can be used to inform
decision-making, evaluate performance and identify trends and relationships.

Big data analytics

Descriptive Diagnostic Predictive Prescriptive


Analytics Analytics Analytics Analytics

What is
What has Why did it What will the
happened happen? happen? solution?
Descriptive analytics

 This process involves summarizing and describing the data to help in


understanding the data.
 This helps in creating reports, like a company’s revenue, profit, sales, and so on.
 Tabulation of social media metrics like Facebook likes and tweets are done using
descriptive analytics.
 Use Case: The Dow Chemical Company analyzed its past data to
increase facility utilization across its office and lab space.
Using descriptive analytics, Dow was able to identify underutilized
space. This space consolidation helped the company save nearly US $4
million annually.
Diagnostic analytics

 This process involves using data to identify the root cause of a problem.
 This analytics is characterized by techniques such as drill-down, data
mining and data discovery.
 Organizations go for diagnostic analytics as it gives in-depth insights
into a particular problem.
 Use Case: An e-commerce company’s report shows that their sales
have gone down, although customers are adding products to their carts.
This can be due to various reasons like the form didn’t load correctly,
the shipping fee is too high, or there are not enough payment options
available. This is where you can use diagnostic analytics to find the
reason.
Predictive Analytics

 This process involves using statistical algorithms and models to predict


future events based on past data.
 This type of analytics uses data mining, artificial intelligence and
machine learning to analyze current data to make predictions about
future.
 It works on predicting the customers trends, market trends and so on.
This analysis works on probability.
 Use Case: PayPal determines what kind of precautions they have to take
to protect their clients against fraudulent transactions. Using predictive
analytics, the company uses all the historical payment data and user
behavior data and builds an algorithm that predicts fraudulent activities.
Prescriptive Analytics

 This process involves using data and models to recommend actions and
decisions.
 Prescriptive analytics works with both descriptive and predictive
analytics.
 Most of the times prescriptive analytics relies on machine learning and
artificial intelligence.
 Business rules, algorithms and computational modelling procedures are
used in prescriptive analytics.
 Use Case: Prescriptive analytics can be used to maximize an airline’s
profit. This type of analytics is used to build an algorithm that will
automatically adjust the flight fares based on numerous factors,
including customer demand, weather, destination, holiday seasons, and
oil prices.
The Lifecycle Phases of Big Data Analytics

• Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a
business case, which defines the reason and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
• Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here
to remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same fields across different
datasets are integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to
discover useful information.
• Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.
• Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle,
where the final results of the analysis are made available to business stakeholders who will
take action.
Big Data Analytics Tools
 Here are some of the key big data analytics tools :
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of data
• Spark - used for real-time processing and analyzing large amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-tolerant
storage
Analysis vs Reporting
 Analysis and reporting are related but distinct activities that are commonly used in
business and other organizations.
 Analysis refers to the process of examining and interpreting data in order to draw
conclusions, identify trends and make informed decisions. This involves the use of
statistical and mathematical techniques to evaluate data and identify patterns,
relationships and other meaningful insights.
 Reporting, on the other hand, refers to the process of presenting the results of an
analysis in a clear and concise manner. Reports may include charts, graphs, tables, and
written summaries that present the findings of an analysis in an easily accessible format.
Reports are usually presented to stakeholders, such as managers, clients or customers and
are used to support decision-making, evaluate performance and measure progress.
 In summary, analysis is process of evaluating data to make informed decisions, while
reporting is the process of presenting the results of an analysis to stakeholders. Both are
critical to effective decision-making and effective communication of data and results.
Statistical Concepts: Sampling
Distributions
 The distribution of a statistic (such as the mean, median, or proportion) computed
from multiple random samples drawn from the same population.
• Sampling Distribution of the Mean: When you take multiple samples from a
population and calculate the mean of each sample, the distribution of those
sample means forms the sampling distribution of the mean.
• Central Limit Theorem (CLT): This theorem states that, for a sufficiently large
sample size, the sampling distribution of the sample mean will be approximately
normally distributed, regardless of the original population distribution. This is
crucial for making inferences about population parameters.
• Standard Error: The standard deviation of the sampling distribution is called the
standard error. It quantifies the variability of the sample mean from the true
population mean.
Statistical Concepts: Re-Sampling
 Techniques used to repeatedly sample from a dataset to estimate the sampling distribution of a
statistic. Re-sampling methods are especially useful when theoretical distributions are complex or
unknown.
 Techniques
• Bootstrapping:
• Process: Draw multiple samples with replacement from the original dataset. Each bootstrap sample is
the same size as the original dataset but may contain repeated observations.
• Use: Estimate the sampling distribution of a statistic (e.g., mean, median) and calculate confidence
intervals and standard errors.
• Example: To estimate the confidence interval for the mean income, you create thousands of bootstrap
samples, compute the mean for each, and then analyze the distribution of these means.
• Jackknife:
• Process: Systematically leave out one observation at a time from the dataset and recalculate the
statistic of interest.
• Use: Assess the variability and bias of a statistic by examining how the estimate changes when individual
data points are omitted.
• Example: To estimate the variance of the mean income, calculate the mean for each subset of the data
obtained by omitting one observation at a time, and analyze the variability.
Statistical Concepts: Statistical Inference
 The process of drawing conclusions about a population based on sample data. It includes estimation and
hypothesis testing.
 Components
• Estimation:
• Point Estimation: Provides a single value estimate for a population parameter (e.g., sample mean as an estimate
of the population mean).
• Interval Estimation: Provides a range within which the population parameter is expected to lie with a certain
level of confidence (e.g., 95% confidence interval).
• Example: In big data analytics, you might use sample data to estimate the average customer spending and
provide a confidence interval around that estimate.
• Hypothesis Testing:
• Null Hypothesis (H₀): A statement of no effect or no difference. It is tested against an alternative hypothesis
(H₁).
• P-Value: The probability of observing the sample data, or something more extreme, assuming the null hypothesis
is true. A small p-value indicates strong evidence against the null hypothesis.
• Type I Error (α): Rejecting the null hypothesis when it is true (false positive).
• Type II Error (β): Failing to reject the null hypothesis when it is false (false negative).
• Example: In a big data context, hypothesis testing might be used to determine if a new marketing strategy
Statistical Concepts: Prediction Error
 The discrepancy between the actual values and the values predicted by a model. It measures the accuracy
of the model’s predictions.
 Metrics
• Mean Absolute Error (MAE):
• Formula: MAE=
• Use: Measures the average magnitude of errors in predictions, without considering their direction.
• Mean Squared Error (MSE):
• Formula:

• Use: Measures the average of the squares of the errors, giving more weight to larger errors.
• Root Mean Squared Error (RMSE):
• Formula: RMSE=
• Use: Provides error metrics in the same units as the data, making it easier to interpret.
• R-Squared (R²):
• Formula: R2 =1-
• Use: Represents the proportion of variance in the dependent variable that is predictable from the independent
variables. Higher values indicate better model fit.
THANK YOU

You might also like