Introduction To Bda
Introduction To Bda
➢ Introduction to Hadoop
➢ Open-source technologies
• Structured / Unstructured
❑ According to Gartner – It is huge-volume, fast-
velocity, and different variety information assets that
demand innovative platform for enhanced insights
and decision making.
Deals with a mechanism to bring out the correct meaning out of data. First of
all, you need to mine the data, i.e., a process to turn raw data into useful data.
Then, an analysis is done on the data that you have cleaned or retrieved out of
the raw data. Then, you need to make sure whatever analysis you have done
benefits your business such as in finding out insights, results, etc. which were
not possible earlier.
Big Data: Veracity
• The trustworthiness and quality of data.
• It is necessary that the veracity of the data is maintained. For example, think
about Facebook posts, with hashtags, abbreviations, images, videos, etc., which
make them unreliable and hamper the quality of their content.
• Collecting loads and loads of data is of no use if the quality and trustworthiness
of the data is not up to the mark.
Applications Of Big Data
Finance
Banking:
o Since there is a massive amount of data that is gushing in from innumerable
sources, banks need to find uncommon and unconventional ways in order to
manage big data.
o It’s also essential to examine customer requirements, render services according
to their specifications, and reduce risks while sustaining regulatory compliance.
Stock Exchange:
o NYSE generates about one terabyte of new trade data every single day.
o So imagine, if one terabyte of data is generated every day, in a whole
year how much data there would be to process.
Applications of Big Data:
Social Network
• Social media in the current scenario is considered as the largest data generator.
• The stats have shown that around 500+ terabytes of new data get generated into the
databases of social media every day, particularly in the case of Facebook.
• The data generated mainly consist of videos, photos, message exchanges, etc. A
single activity on any social media site generates a lot of data which is again stored
and gets processed whenever required.
• Since the data stored is in terabytes, it would take a lot of time for processing if it is
done by our legacy systems. Big Data is a solution to this problem.
Applications of Big Data:
Healthcare
• Nowadays, doctors rely mostly on patients’ clinical records, which means that a lot
of data needs to be gathered, that too for different patients.
• Obviously, it is not possible for old or traditional data storage methods to store
this data.
• Since there is a large amount of data coming from different sources, in various
formats, the need to handle this large amount of data is increased
Applications of Big Data:
E-Commerce
• Maintaining customer relationships is the most important in the e-commerce industry.
• E-commerce websites have different marketing ideas to retail their merchandise to their
customers, to manage transactions, and to implement better tactics of using innovative
ideas with Big Data to improve businesses.
• Flipkart:
▪ Flipkart is a huge e-commerce website dealing with lots of traffic on a daily basis.
▪ But, when there is a pre-announced sale on Flipkart, traffic grows exponentially that
actually crashes the website.
▪ So, to handle this kind of traffic and data, Flipkart uses Big Data.
▪ Big Data can actually help in organizing and analyzing the data for further use.
Applications of Big Data:
Education
Big Data
Business Technology
Need Advances
Analytical
Platforms
Technologies for Big Data
(and Analytics)
Data warehouses
Appliances
Analytical sandboxes
In-memory analytics
In-database analytics
Columnar databases
Technologies for Big Data (and Analytics)
Hadoop Since machine learning algorithms are complex and time-consuming, mahout
breaks down work such that it gets executed on MapReduce running on many
machines.
Ecosystem ZooKeeper
• Apache Zookeeper is an independent component which is used by various
distributed frameworks such as HDFS, HBase, Kafka, YARN. It is used for the
coordination between various components. It provides a distributed configuration
service, synchronization service, and naming registry for large distributed systems.
Flume
• Flume makes it possible to continuously pump the unstructured data from many
sources to a central source such as HDFS.
• If you have many machines continuously generating data such as Webserver Logs,
you can use flume to aggregate data at a central place such as HDFS.
Hadoop 2.X
core
components
• Hadoop 2.0 feature HDFS Federation allows horizontal
HADOOP 2.X scaling for Hadoop distributed file system (HDFS). This is
one of the many sought after features by enterprise class
CORE Hadoop users such as Amazon and eBay. HDFS Federation
supports multiple NameNodes and namespaces.
COMPONENTS • Hadoop 2.x has the following three Major Components:
• HDFS
• YARN
• MapReduce
Hadoop 2.x Architecture
HDFS
Hadoop 3.X
core
components
Why Hadoop 3.x
• With Java 7 attaining end of life in 2015, there was a need to revise the minimum runtime version
to Java 8 with a new Hadoop release so that the new release is supported by Oracle with security
fixes and also will allow hadoop to upgrade its dependencies to modern versions.
• With Hadoop 2.0 shell scripts were difficult to understand as hadoop developers had to read almost
all the shell scripts to understand what is the correct environment variable to set an option and how
to set it whether it is java.library.path or java classpath or GC options.
• With support for only 2 NameNodes, Hadoop 2 did not provide maximum level of fault tolerance
but with the release of Hadoop 3.x there will be additional fault tolerance as it offers multiple
NameNodes.
• Replication is a costly affair in Hadoop 2 as it follows a 3x replication scheme leading to 200%
additional storage space and resource overhead. Hadoop 3.0 will incorporate Erasure Coding in
place of replication consuming comparatively less storage space whilst providing same level of fault
tolerance.
Hadoop 3.x Architecture
Data Replication in 3.x
Difference between 1.x, 2,x
and 3,x Hadoop components
Difference between Hadoop 1.x, 2,x
Difference between Hadoop 2.x, 3,x
Open Source
Technologies
Open-source technologies
• Mobile business intelligence is software that extends desktop business intelligence (BI) applications so they can
be used on a mobile device.
• Business intelligence is a composition of system software that helps in generating meaningful and useful
information that enables the user to understand the keen insight of the company and know about the trends,
patterns, technologies, and reports.
• Big data is always assumed as huge and unstructured data, whereas big data is not just huge but it is also about
the composition of data, operation of days, and value-added in developing data.
Cloud and Big data mobile business
intelligence
• The cloud can help you process and analyze your big data faster,
leading to insights that can improve your products and business.
• Crowdsourcing, a combination of “crowd” and “outsourcing”
• first authored by Wired magazine in 2005
• It is an amazing sourcing model that use the profundity of
experience and thoughts of an open gathering instead of an
associations claim representatives.
• Crowdsourcing taps into the global world of ideas, helping
companies work through a rapid design process.
Crowd • You outsource to large crowds in an effort to make sure your
products or services are right.
sourcing • The upsides of utilizing crowdsourcing are professed to