0% found this document useful (0 votes)
4 views

Big Data Architecture

Big data architecture provides a comprehensive framework for managing large volumes of data, detailing the components and methods for ingestion, processing, storage, and analysis based on company needs. It encompasses various workload types, including batch processing, real-time processing, and analytics, utilizing diverse data sources and storage solutions. Effective big data solutions require tailored development and maintenance to meet specific organizational demands.

Uploaded by

sujana s
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Big Data Architecture

Big data architecture provides a comprehensive framework for managing large volumes of data, detailing the components and methods for ingestion, processing, storage, and analysis based on company needs. It encompasses various workload types, including batch processing, real-time processing, and analytics, utilizing diverse data sources and storage solutions. Effective big data solutions require tailored development and maintenance to meet specific organizational demands.

Uploaded by

sujana s
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Big Data Architecture

Big data tools and techniques demand special big data tools and techniques. When it
comes to managing large quantities of data and performing complex operations on that
massive data, big data tools and techniques must be used. The big data ecosystem and
its sphere are what we refer to when we say using big data tools and techniques. There
is no solution that is provided for every use case and that requires and has to be
created and made in an effective manner according to company demands. A big data
solution must be developed and maintained in accordance with company demands so
that it meets the needs of the company. A stable big data solution can be constructed
and maintained in such a way that it can be used for the requested problem.

Introduction

Big data architecture is a comprehensive solution to deal with an enormous amount of


data. It details the blueprint for providing solutions and infrastructure for dealing with
big data based on a company’s demands. It clearly defines the components, layers, and
methods of communication. The reference point is the ingestion, processing, storing,
managing, accessing, and analysing of the data

There is more than one workload type involved in big data systems, and they are
broadly classified as follows:

1. Merely batching data where big data-based sources are at rest is a data
processing situation.
2. Real-time processing of big data is achievable with motion-based processing.
3. The exploration of new interactive big data technologies and tools.
4. The use of machine learning and predictive analysis.
 Data Sources: All of the sources that feed into the data extraction pipeline are
subject to this definition, so this is where the starting point for the big data
pipeline is located. Data sources, open and third-party, play a significant role in
architecture. Relational databases, data warehouses, cloud-based data
warehouses, SaaS applications, real-time data from company servers and
sensors such as IoT devices, third-party data providers, and also static files
such as Windows logs, comprise several data sources. Both batch processing
and real-time processing are possible. The data managed can be both batch
processing and real-time processing.

 Data Storage: There is data stored in file stores that are distributed in nature
and that can hold a variety of format-based big files. It is also possible to store
large numbers of different format-based big files in the data lake. This consists
of the data that is managed for batch built operations and is saved in the file
stores. We provide HDFS, Microsoft Azure, AWS, and GCP storage, among
other blob containers.

 Batch Processing: Each chunk of data is split into different categories using
long-running jobs, which filter and aggregate and also prepare data for
analysis. These jobs typically require sources, process them, and deliver the
processed files to new files. Multiple approaches to batch processing are
employed, including Hive jobs, U-SQL jobs, Sqoop or Pig and custom map
reducer jobs written in any one of the Java or Scala or other languages such as
Python.
 Real Time-Based Message Ingestion: A real-time streaming system that
caters to the data being generated in a sequential and uniform fashion is a batch
processing system. When compared to batch processing, this includes all real-
time streaming systems that cater to the data being generated at the time it is
received. This data mart or store, which receives all incoming messages and
discards them into a folder for data processing, is usually the only one that
needs to be contacted. Message-based ingestion stores such as Apache Kafka,
Apache Flume, Event hubs from Azure, and others, on the other hand, must be
used if message-based processing is required. The delivery process, along with
other message queuing semantics, is generally more reliable.

 Stream Processing: Real-time message ingest and stream processing are


different. The latter uses the ingested data as a publish-subscribe tool, whereas
the former takes into account all of the ingested data in the first place and then
utilises it as a publish-subscribe tool. Stream processing, on the other hand,
handles all of that streaming data in the form of windows or streams and writes
it to the sink. This includes Apache Spark, Flink, Storm, etc.

 Analytics-Based Datastore: In order to analyze and process already processed


data, analytical tools use the data store that is based on HBase or any other
NoSQL data warehouse technology. The data can be presented with the help of
a hive database, which can provide metadata abstraction, or interactive use of a
hive database, which can provide metadata abstraction in the data store.
NoSQL databases like HBase or Spark SQL are also available.

 Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis
tools that utilize embedded technology and a solution to produce useful graphs,
analysis, and insights that are beneficial to the businesses. For example,
Cognos, Hyperion, and others.

 Orchestration: Data-based solutions that utilise big data are data-related tasks
that are repetitive in nature, and which are also contained in workflow chains
that can transform the source data and also move data across sources as well as
sinks and loads in stores. Sqoop, oozie, data factory, and others are just a few
examples.
AWS BIG DATA ARCHITECTURE

You might also like