0% found this document useful (0 votes)
106 views

What Is Big Data? Characteristics of Big Data and Significance

Big data has found applications in government, social media analytics, technology, and fraud detection. In government, big data analysis helped Obama's re-election campaign and India's general election. Social media insights provide real-time understanding of market responses. Technology companies like eBay and Amazon use big data for search, recommendations, and operations. Fraud detection can now analyze transactions in real-time to identify anomalies and prevent fraud.

Uploaded by

Anish Ghui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

What Is Big Data? Characteristics of Big Data and Significance

Big data has found applications in government, social media analytics, technology, and fraud detection. In government, big data analysis helped Obama's re-election campaign and India's general election. Social media insights provide real-time understanding of market responses. Technology companies like eBay and Amazon use big data for search, recommendations, and operations. Fraud detection can now analyze transactions in real-time to identify anomalies and prevent fraud.

Uploaded by

Anish Ghui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

1. What is big data? Characteristics of big data and significance.

Ans. “Big data” is high-volume, velocity, and variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.”
Characteristics of Big Data

Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data –
Variety, Velocity, and Volume. Let’s discuss the characteristics of big data.

These characteristics, isolatedly, are enough to know what is big data. Let’s
look at them in depth:

1) Variety

Variety of Big Data refers to structured, unstructured, and semistructured


data that is gathered from multiple sources. While in the past, data could
only be collected from spreadsheets and databases, today data comes in
an array of forms such as emails, PDFs, photos, videos, audios, SM posts,
and so much more. Variety is one of the important characteristics of big
data.

2) Velocity

Velocity essentially refers to the speed at which data is being created in


real-time. In a broader prospect, it comprises the rate of change, linking of
incoming data sets at varying speeds, and activity bursts.

3) Volume
Volume is one of the characteristics of big data. We already know that Big
Data indicates huge ‘volumes’ of data that is being generated on a daily
basis from various sources like social media platforms, business
processes, machines, networks, human interactions, etc. Such a large
amount of data are stored in data warehouses. Thus comes to the end of
characteristics of big data.
The importance of big data does not revolve around how much data a
company has but how a company utilises the collected data. Every
company uses data in its own way; the more efficiently a company uses its
data, the more potential it has to grow. The company can take data from
any source and analyse it to find answers which will enable:

Significance of big data

1. Cost Savings : Some tools of Big Data like​ ​Hadoop​ and Cloud-Based
Analytics can bring cost advantages to business when large amounts
of data are to be stored and these tools also help in identifying more
efficient ways of doing business.
2. Time Reductions :The high speed of tools like Hadoop and
in-memory analytics can easily identify new sources of data which
helps businesses analyzing data immediately and make quick
decisions based on the learnings.
3. Understand the market conditions : By analyzing big data you can
get a better understanding of current market conditions. For example,
by analyzing customers’ purchasing behaviors, a company can find
out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis.
Therefore, you can get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of
your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and
Retention

The customer is the most important asset any business depends on.
There is no single business that can claim success without first
having to establish a solid customer base. However, even with a
customer base, a business cannot afford to disregard the high
competition it faces. If a business is slow to learn what customers are
looking for, then it is very easy to begin offering poor quality products.
In the end, loss of clientele will result, and this creates an adverse
overall effect on business success. The use of big data allows
businesses to observe various customer related patterns and trends.
Observing customer behaviour is important to trigger loyalty.

6. Using Big Data Analytics to Solve Advertisers Problem and Offer


Marketing Insights

Big data analytics can help change all business operations. This
includes the ability to match customer expectation, changing
company’s product line and of course ensuring that the marketing
campaigns are powerful.

7. Big Data Analytics As a Driver of Innovations and Product


Development

Another huge advantage of big data is the ability to help companies


innovate and redevelop their products.
2. What is data analytics? What are the different types of analytics
explain in brief with examples (adv and disadvantages)
Ans. types:
1. Descriptive analytics
2. Diagnostic analytics
3. predictive analytics
4. prescriptive analytics

1. Descriptive analytics​ : Descriptive analytics is introductory,


retrospective, and answers the question “what happened?” It accounts for
roughly 80 percent of​ ​business analytics​ today, making it the most common
type of data analysis.

Example of descriptive analytics

Let’s say website traffic numbers fell just short of its goal in 2018. That’s
enough reason to run a descriptive analysis to see what went wrong.
The analysis tells us:

● Website traffic fell drastically in Q3.


● It picked back up in early Q4.
● Remained steady through the rest of the year.

2. Diagnostic analytics​: Diagnostic analytics is retrospective as well, but


instead, it seeks “why” the problem that was laid out in the descriptive
analysis occurred.

Example of diagnostic analytics

Using our previous example, we now understand where the problem


occurred, but exactly why did website traffic plummet so sharply?

The analysis tells us:

● Website traffic fell during a search engine algorithm update.


● There was a 25 percent decrease in published web content.
● A record amount of backlinks were lost in Q3.
3. Predictive analytics​: Predictive analytics, unlike the previous two
analyses, looks to ahead the future and is a bit more proactive with its
findings. It attempts to forecast what is likely to happen next, and is
one-half of what is considered “advanced analytics.”

Example of predictive analytics

The diagnostic analysis showed us a variety of issues, now it’s time


to predict next steps so an accurate website traffic number can be
generated for the next few quarters. Here’s what that estimate may
look like:

4. Prescriptive analytics​: Prescriptive analytics is the final type of


advanced analysis. It takes the information that’s been predicted and
prescribes calculated next steps to take.
Example of prescriptive analytics

Now that we have an idea where website traffic should be headed, what
are some actionable items to get it there? Prescriptive models should
unveil a variety of answers.

The analysis tells us:

● Publish double the amount of web content to reach traffic goals.


● Sales content will generate the highest amount of traffic.
● Email marketing content is the easiest backlink win.

3. Why not big data analytics?


Ans:
4. What are the digital classifications of big data.
Ans:
5. Applications of big data analytics.
Ans: Big data has found many applications in various fields today. The
major fields where big data is being used are as follows.

● Government

Big data analytics has proven to be very useful in the government sector.
Big data analysis played a large role in Barack Obama’s successful 2012
re-election campaign. Also most recently, Big data analysis was majorly
responsible for the BJP and its allies to win a highly successful Indian
General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to
government action, as well as ideas for policy augmentation.

● Social Media Analytics

The advent of social media has led to an outburst of big data. Various
solutions have been built in order to analyze social media activity like IBM’s
Cognos Consumer Insights, a point solution running on IBM’s BigInsights
Big Data platform, can make sense of the chatter. Social media can provide
valuable real-time insights into how the market is responding to products
and campaigns. With the help of these insights, the companies can adjust
their pricing, promotion, and campaign placements accordingly. Before
utilizing the big data there needs to be some preprocessing to be done on
the big data in order to derive some intelligent and valuable results. Thus to
know the consumer mindset the application of intelligent decisions derived
from big data is necessary.

● Technology

The technological applications of big data comprise of the following


companies which deal with huge amounts of data every day and put them
to use for business decisions as well. For example, eBay.com uses two
data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop
cluster for search, consumer recommendations, and merchandising. Inside
eBay‟s 90PB data warehouse. Amazon.com handles millions of back-end
operations every day, as well as queries from more than half a million
third-party sellers. The core technology that keeps Amazon running is
Linux-based and as of 2005, they had the world’s three largest Linux
databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook
handles 50 billion photos from its user base. Windermere Real Estate uses
anonymous GPS signals from nearly 100 million drivers to help new home
buyers determine their typical drive times to and from work throughout
various times of the day.

● Fraud detection

For businesses whose operations involve any type of claims or transaction


processing, fraud detection is one of the most compelling Big Data
application examples. Historically, fraud detection on the fly has proven an
elusive goal. In most cases, fraud is discovered long after the fact, at which
point the damage has been done and all that’s left is to minimize the harm
and adjust policies to prevent it from happening again. Big Data platforms
that can analyze claims and transactions in real time, identifying large-scale
patterns across many transactions or detecting anomalous behavior from
an individual user, can change the fraud detection game.

● Call Center Analytics

Now we turn to the customer-facing Big Data application examples, of


which call center analytics are particularly powerful. What’s going on in a
customer’s call center is often a great barometer and influencer of market
sentiment, but without a Big Data solution, much of the insight that a call
center can provide will be overlooked or discovered too late. Big Data
solutions can help identify recurring problems or customer and staff
behavior patterns on the fly not only by making sense of time/quality
resolution metrics but also by capturing and processing call content itself.

● Banking

The use of customer data invariably raises privacy issues. By uncovering


hidden connections between seemingly unrelated pieces of data, big data
analytics could potentially reveal sensitive personal information. Research
indicates that 62% of bankers are cautious in their use of big data due to
privacy issues. Further, outsourcing of data analysis activities or distribution
of customer data across departments for the generation of richer insights
also amplifies security risks. Such as customers’ earnings, savings,
mortgages, and insurance policies ended up in the wrong hands. Such
incidents reinforce concerns about data privacy and discourage customers
from sharing personal information in exchange for customized offers.

● Agriculture

A biotechnology firm uses sensor data to optimize crop efficiency. It plants


test crops and runs simulations to measure how plants react to various
changes in condition. Its data environment constantly adjusts to changes in
the attributes of various data it collects, including temperature, water levels,
soil composition, growth, output, and gene sequencing of each plant in the
test bed. These simulations allow it to discover the optimal environmental
conditions for specific gene types.

● Marketing

Marketers have begun to use facial recognition software to learn how well
their advertising succeeds or fails at stimulating interest in their products. A
recent study published in the Harvard Business Review looked at what
kinds of advertisements compelled viewers to continue watching and what
turned viewers off. Among their tools was “a system that analyses facial
expressions to reveal what viewers are feeling.” The research was
designed to discover what kinds of promotions induced watchers to share
the ads with their social network, helping marketers create ads most likely
to “go viral” and improve sales.

● Smart Phones

Perhaps more impressive, people now carry facial recognition technology


in their pockets. Users of I Phone and Android smartphones have
applications at their fingertips that use facial recognition technology for
various tasks. For example, Android users with the remember app can
snap a photo of someone, then bring up stored information about that
person based on their image when their own memory lets them down a
potential boon for salespeople.

● Telecom

Now a day’s big data is used in different fields. In telecom also it plays a
very good role. Operators face an uphill challenge when they need to
deliver new, compelling, revenue-generating services without overloading
their networks and keeping their running costs under control. The market
demands new set of data management and analysis capabilities that can
help service providers make accurate decisions by taking into account
customer, network context and other critical aspects of their businesses.
Most of these decisions must be made in real time, placing additional
pressure on the operators. Real-time predictive analytics can help leverage
the data that resides in their multitude systems, make it immediately
accessible and help correlate that data to generate insight that can help
them drive their business forward.
● Healthcare

Traditionally, the healthcare industry has lagged behind other industries in


the use of big data, part of the problem stems from resistance to change
providers are accustomed to making treatment decisions independently,
using their own clinical judgment, rather than relying on protocols based on
big data. Other obstacles are more structural in nature. This is one of the
best place to set an example for Big Data Application.Even within a single
hospital, payor, or pharmaceutical company, important information often
remains siloed within one group or department because organizations lack
procedures for integrating data and communicating findings.

Health care stakeholders now have access to promising new threads of


knowledge. This information is a form of “big data,” so called not only for its
sheer volume but for its complexity, diversity, and timelines.
Pharmaceutical industry experts, payers, and providers are now beginning
to analyze big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the
files are enormous and often have different database structures and
technical characteristics.

6. Applcns of unstructure data.


Ans:
7. Mpp versus smp. What is cap theorem
Ans: MPP Databases

MPP database searches are performed by each processor on the


computers where segments of the database are stored. MPP databases
can be expanded by adding new CPUs. MPP databases are a form of
linear scalable database or parallel database. Spreading data across more
systems in thinner slices results in faster database searches. Performance
of an MPP system is linear, increasing roughly in proportion to the number
of nodes. MPP nodes are managed as a single computer. SQL originated
as a means of processing data across MPP databases. Cognos Business
Intelligence and Teradata software run on MPP databases.

SMP Databases

SMP databases share software, input / output resources and memory


disks. Symmetric Multiprocessor databases generally use one CPU to
perform database searches. While Symmetrical Multiprocessors can have
hundreds of CPUs, they are most commonly configured with 2, 4, 8 or 16.
Memory is the primary constraint on SMP databases. SMP databases can
run on more than one server, though they will share other resources; this is
known as a called a clustered configuration. SMP databases assign tasks
to a single CPU, regardless of how many are in the database. SMP
databases have lower fault tolerance and efficiency due to their reliance on
shared resources. SMP databases have lower administrative costs than
MPP. Oracle and Sybase run on SMP databases.

MPP vs SMP Databases

An MPP database sends the same query to each CPU in the MPP where it
searches the data. When two MPP databases are connected, the search
time will be almost half that of a similarly sized SMP database. The search
time is not exactly half since there are delays as data travels between the
MPP nodes. High speed processors used in an SMP database can be cost
competitive with MPP systems.

Uses

When a company runs its payroll, records labor time card entries or saves
product data in a drawing database on a single server, it is using an SMP
database. SMP databases are used for hosting small Web sites and email
servers. MPP databases are commonly used for data warehousing. MPP
databases are also used for large scale data processing and data mining.

CAP theorem

The CAP theorem, also known as Brewer’s theorem, states that it is


impossible for any​ ​distributed database system to provide more than two of
the following properties together:

● Consistency
● Availability
● Partition tolerance
CAP theorem

With the advances in parallel processing and distributed systems, it is more


common to expand​ ​horizontally​ or have more machines, and the CAP
theorem is the backbone of such architecture. Let’s explore the
characteristics of the CAP theorem in detail.

Consistency

A consistent system is one in which all nodes see the same data at the
same time. In other words, if we perform read operations after multiple
write operations, then a consistent system should return the same value for
all the read operations and the most recent write operation.

Note that consistency, as defined in the CAP theorem, is quite different


from the consistency guaranteed in​ ​ACID​ ​database transactions​.
Availability

A highly available distributed system is one that remains operational 100%


of the time. Every request made should be accepted and receive a
(non-error) response. Note: It is not necessary for the response to contain
the most recent write value (i.e. the system does not need to be consistent,
but it should be available all the time).

Partition Tolerance

It states that a system should continue to run even if the connection


between nodes delays or breaks. Note: This doesn’t mean nodes have
gone down. Nodes are up but can’t communicate.

Let’s say that we two nodes (N1 and N2) and both are connected. Now
assume that the network connecting both the nodes goes down (network
gets partitioned). Both nodes N1 and N2 are up and running fine, but the
updates happening at node N1 can no longer reach node N2 and vice
versa.

Partition tolerance is more of a necessity than an option in modern


distributed systems, hence we cannot avoid the “P” in CAP. So we have to
choose either consistency or availability.
8. Advantages and disadvantages of smp over mpp
Ans:

9. Hadoop architecture with read and write anatomy.


Ans: Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with
HDFS, the name node, and the data nodes with the help of a diagram.
Consider the figure:

Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File
System).

Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in
the file. For each block, the name node returns the addresses of the data
nodes that have a copy of that block. The DFS returns an
FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the
data node and name node I/O.

Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the
file, then connects to the primary (closest) data node for the primary block
in the file.

Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of view
is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve
the data node locations for the next batch of blocks as needed.

Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.

Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider the figure 1.2
to get a better understanding of the concept.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).

Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to
start out writing data to.

Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline, and
here we’ll assume the replication level is three, so there are three nodes in
the pipeline. The DataStreamer streams the packets to the primary data
node within the pipeline, which stores each packet and forwards it to the
second data node within the pipeline.

Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.

Step 5: The DFSOutputStream sustains an internal queue of packets that


are waiting to be acknowledged by data nodes, called an “ack queue”.

Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name
node to signal whether the file is complete or not.

HDFS follows Write Once Read Many models. So, we can’t edit files that
are already stored in HDFS, but we can include it by again reopening the
file. This design allows HDFS to scale to a large number of concurrent
clients because the data traffic is spread across all the data nodes in the
cluster. Thus, it increases the availability, scalability, and throughput of the
system.

10. Processing data with hadoop


Ans:
11. Analysing hadoop map reduce with weather data example
Ans:
12. Hadoop ecosystem with neat diagram.
Ans:
Hadoop Ecosystem

Overview: Apache Hadoop is an open source framework intended to make


interaction with​ ​big data​ easier, However, for those who are not acquainted
with this technology, one question arises that what is big data ? Big data is
a term given to the data sets which can’t be processed in an efficient
manner with the help of traditional methodology such as RDBMS. Hadoop
has made its place in the industries and companies that need to work on
large data sets which are sensitive and needs efficient handling. Hadoop is
a framework that enables processing of large data sets which reside in the
form of clusters. Being a framework, Hadoop is made up of several
modules that are supported by a large ecosystem of technologies.

Introduction: Hadoop Ecosystem is a platform or a suite which provides


various services to solve the big data problems. It includes Apache projects
and various commercial tools and solutions. There are four major elements
of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:

● HDFS: Hadoop Distributed File System


● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database
● Mahout, Spark MLLib: Machine Learning​ ​algorithm libraries
● Solar, Lucene: Searching and Indexing
● Zookeeper: Managing cluster
● Oozie: Job Scheduling

13. Flow of big data Analytics.


Ans:

You might also like