What Is Big Data? Characteristics of Big Data and Significance
What Is Big Data? Characteristics of Big Data and Significance
Ans. “Big data” is high-volume, velocity, and variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.”
Characteristics of Big Data
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data –
Variety, Velocity, and Volume. Let’s discuss the characteristics of big data.
These characteristics, isolatedly, are enough to know what is big data. Let’s
look at them in depth:
1) Variety
2) Velocity
3) Volume
Volume is one of the characteristics of big data. We already know that Big
Data indicates huge ‘volumes’ of data that is being generated on a daily
basis from various sources like social media platforms, business
processes, machines, networks, human interactions, etc. Such a large
amount of data are stored in data warehouses. Thus comes to the end of
characteristics of big data.
The importance of big data does not revolve around how much data a
company has but how a company utilises the collected data. Every
company uses data in its own way; the more efficiently a company uses its
data, the more potential it has to grow. The company can take data from
any source and analyse it to find answers which will enable:
1. Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts
of data are to be stored and these tools also help in identifying more
efficient ways of doing business.
2. Time Reductions :The high speed of tools like Hadoop and
in-memory analytics can easily identify new sources of data which
helps businesses analyzing data immediately and make quick
decisions based on the learnings.
3. Understand the market conditions : By analyzing big data you can
get a better understanding of current market conditions. For example,
by analyzing customers’ purchasing behaviors, a company can find
out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis.
Therefore, you can get feedback about who is saying what about your
company. If you want to monitor and improve the online presence of
your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and
Retention
The customer is the most important asset any business depends on.
There is no single business that can claim success without first
having to establish a solid customer base. However, even with a
customer base, a business cannot afford to disregard the high
competition it faces. If a business is slow to learn what customers are
looking for, then it is very easy to begin offering poor quality products.
In the end, loss of clientele will result, and this creates an adverse
overall effect on business success. The use of big data allows
businesses to observe various customer related patterns and trends.
Observing customer behaviour is important to trigger loyalty.
Big data analytics can help change all business operations. This
includes the ability to match customer expectation, changing
company’s product line and of course ensuring that the marketing
campaigns are powerful.
Let’s say website traffic numbers fell just short of its goal in 2018. That’s
enough reason to run a descriptive analysis to see what went wrong.
The analysis tells us:
Now that we have an idea where website traffic should be headed, what
are some actionable items to get it there? Prescriptive models should
unveil a variety of answers.
● Government
Big data analytics has proven to be very useful in the government sector.
Big data analysis played a large role in Barack Obama’s successful 2012
re-election campaign. Also most recently, Big data analysis was majorly
responsible for the BJP and its allies to win a highly successful Indian
General Election 2014. The Indian Government utilizes numerous
techniques to ascertain how the Indian electorate is responding to
government action, as well as ideas for policy augmentation.
The advent of social media has led to an outburst of big data. Various
solutions have been built in order to analyze social media activity like IBM’s
Cognos Consumer Insights, a point solution running on IBM’s BigInsights
Big Data platform, can make sense of the chatter. Social media can provide
valuable real-time insights into how the market is responding to products
and campaigns. With the help of these insights, the companies can adjust
their pricing, promotion, and campaign placements accordingly. Before
utilizing the big data there needs to be some preprocessing to be done on
the big data in order to derive some intelligent and valuable results. Thus to
know the consumer mindset the application of intelligent decisions derived
from big data is necessary.
● Technology
● Fraud detection
● Banking
● Agriculture
● Marketing
Marketers have begun to use facial recognition software to learn how well
their advertising succeeds or fails at stimulating interest in their products. A
recent study published in the Harvard Business Review looked at what
kinds of advertisements compelled viewers to continue watching and what
turned viewers off. Among their tools was “a system that analyses facial
expressions to reveal what viewers are feeling.” The research was
designed to discover what kinds of promotions induced watchers to share
the ads with their social network, helping marketers create ads most likely
to “go viral” and improve sales.
● Smart Phones
● Telecom
Now a day’s big data is used in different fields. In telecom also it plays a
very good role. Operators face an uphill challenge when they need to
deliver new, compelling, revenue-generating services without overloading
their networks and keeping their running costs under control. The market
demands new set of data management and analysis capabilities that can
help service providers make accurate decisions by taking into account
customer, network context and other critical aspects of their businesses.
Most of these decisions must be made in real time, placing additional
pressure on the operators. Real-time predictive analytics can help leverage
the data that resides in their multitude systems, make it immediately
accessible and help correlate that data to generate insight that can help
them drive their business forward.
● Healthcare
SMP Databases
An MPP database sends the same query to each CPU in the MPP where it
searches the data. When two MPP databases are connected, the search
time will be almost half that of a similarly sized SMP database. The search
time is not exactly half since there are delays as data travels between the
MPP nodes. High speed processors used in an SMP database can be cost
competitive with MPP systems.
Uses
When a company runs its payroll, records labor time card entries or saves
product data in a drawing database on a single server, it is using an SMP
database. SMP databases are used for hosting small Web sites and email
servers. MPP databases are commonly used for data warehousing. MPP
databases are also used for large scale data processing and data mining.
CAP theorem
● Consistency
● Availability
● Partition tolerance
CAP theorem
Consistency
A consistent system is one in which all nodes see the same data at the
same time. In other words, if we perform read operations after multiple
write operations, then a consistent system should return the same value for
all the read operations and the most recent write operation.
Partition Tolerance
Let’s say that we two nodes (N1 and N2) and both are connected. Now
assume that the network connecting both the nodes goes down (network
gets partitioned). Both nodes N1 and N2 are up and running fine, but the
updates happening at node N1 can no longer reach node N2 and vice
versa.
Let’s get an idea of how data flows between the client interacting with
HDFS, the name node, and the data nodes with the help of a diagram.
Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File
System).
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in
the file. For each block, the name node returns the addresses of the data
nodes that have a copy of that block. The DFS returns an
FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the
data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the
file, then connects to the primary (closest) data node for the primary block
in the file.
Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of view
is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve
the data node locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider the figure 1.2
to get a better understanding of the concept.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to
start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline, and
here we’ll assume the replication level is three, so there are three nodes in
the pipeline. The DataStreamer streams the packets to the primary data
node within the pipeline, which stores each packet and forwards it to the
second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.
Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name
node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that
are already stored in HDFS, but we can include it by again reopening the
file. This design allows HDFS to scale to a large number of concurrent
clients because the data traffic is spread across all the data nodes in the
cluster. Thus, it increases the availability, scalability, and throughput of the
system.