0% found this document useful (0 votes)
1 views

Lecture_4

The document discusses Big Data, its characteristics, and the tools used for its management, particularly focusing on the Hadoop Distributed File System (HDFS). It outlines the five v's of Big Data: Volume, Variety, Veracity, Velocity, and Value, explaining how they define the nature of Big Data. Additionally, it describes the attributes of HDFS, emphasizing its role in managing large datasets and supporting big data analytics.

Uploaded by

jouf00008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lecture_4

The document discusses Big Data, its characteristics, and the tools used for its management, particularly focusing on the Hadoop Distributed File System (HDFS). It outlines the five v's of Big Data: Volume, Variety, Veracity, Velocity, and Value, explaining how they define the nature of Big Data. Additionally, it describes the attributes of HDFS, emphasizing its role in managing large datasets and supporting big data analytics.

Uploaded by

jouf00008
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

LECTURE 4

Selected Topics in Computer Science CS409


By
Dr.Ashraf Hendam
OUTLINE
• Big Data
• Who is generating Big Data?
• Characteristics of Big Data
• Big Data Open Source Tools
• Big data management tools
• Hadoop Distributed File System (HDFS)
• HDFS attributes
Big Data
“Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that
unlock new sources of business value.
Requires new data architectures, analytic sandboxes
New tools
New analytical methods
Integrating multiple skills into new role of data scientist
Organizations are deriving business benefit from
analyzing ever larger and more complex data sets that
increasingly require real-time or near-real time
capabilities
Who is generating Big Data?
Characteristics of Big Data
• Big Data contains a large amount of data that is not
being processed by traditional data storage or the
processing unit.
• It is used by many multinational companies to process
the data and business of many organizations.
• There are five v's of Big Data that explains the
characteristics.
- Volume
- Variety
- Veracity
- Velocity
- Value
Characteristics of Big Data
Characteristics of Big Data
Volume
• The name Big Data itself is related to an enormous size.
• Big Data is a vast 'volumes' of data generated from
many sources daily, such as business processes,
machines, social media platforms, networks, human
interactions, and many more.
• Facebook can generate approximately a billion
messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle
large amounts of data.
Characteristics of Big Data
Volume
Characteristics of Big Data
Variety
Data will only be collected from databases and sheets
in the past, But these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
Different Types:
• Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
• Graph Data
• Social Network, Semantic Web (RDF), ...
• Streaming Data
Characteristics of Big Data
Variety
Characteristics of Big Data
Variety
Big Data can be structured, unstructured, and semi-
structured that are being collected from different
sources.
Characteristics of Big Data
Variety
Structured
• Structured data, is the data that can be
processed, stored, and retrieved in a fixed format.
• It refers to highly organized information that
can be readily and seamlessly stored and
accessed from a database by simple search engine
algorithms.
• For instance, the employee table in a company
database will be structured as the employee details,
their job positions, their salaries, etc., will be present
in an organized manner.
Characteristics of Big Data
Variety
Structured
Characteristics of Big Data
Variety
Semi-structured
Semi-structured data pertains to the data containing
both the formats mentioned before, that is,
structured and unstructured data.
• It refers to the data that although has not been
classified under a particular repository
(database).
• Contains vital information or tags that segregate
individual elements within the data.
• Example: XML data files that are self describing
and defined by an xml schema
Characteristics of Big Data
Variety
Semi-structured
Characteristics of Big Data
Variety
Unstructured
• Unstructured data refers to the data that lacks any
specific form or structure whatsoever.
• This makes it very difficult and time-consuming to
process and analyze unstructured data.
• Email content is an example of unstructured data.
Structured and unstructured are two important
types of big data.
Characteristics of Big Data
Variety
Unstructured
Characteristics of Big Data
Variety
• Quasi structured

• Textual data with erratic data formats, can be

formatted with effort, tools, and time

• Example: Web clickstream data that may contain


some inconsistencies in data values and formats
Characteristics of Big Data
Variety
Characteristics of Big Data
Veracity
• Veracity means how much the data is reliable.
• It has many ways to filter or translate the data.
• Veracity is the process of being able to handle and
manage data efficiently.
• Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Characteristics of Big Data
Velocity
• Velocity essentially refers to the speed at which
data is being created in real-time.
• It comprises the rate of change, linking of incoming
data sets at varying speeds, and activity bursts.
• Velocity deals with the speed at the data flows
from sources like application logs, business
processes, networks, and social media sites, sensors,
mobile devices, etc.
Characteristics of Big Data
Value
• Value is an essential characteristic of big data.
• It is not the data that we process or store. It is
valuable and reliable data that we store, process,
and also analyze.
• What makes Big Data valuable is developing
better models which leads to higher precision
Big Data Open Source Tools
Big data management tools
Hadoop Distributed File System (HDFS)
• HDFS is the primary storage system used by
Hadoop applications.
• This open source framework works by rapidly
transferring data between nodes.
• HDFS is a key component of many Hadoop
systems, as it provides a means for
managing big data, as well as supporting
big data analytics.
Hadoop Distributed File System (HDFS)
• HDFS operates as a distributed file system designed to run on
commodity hardware.
• HDFS provides high throughput data access to application data
and is suitable for applications that have large data sets and
enables streaming access to file system data in Apache
Hadoop.
• So, what is Hadoop? And how does it vary from HDFS?
A core difference between Hadoop and HDFS is that Hadoop is
the open source framework that can store, process and
analyze data, while HDFS is the file system of Hadoop that
provides access to data.
• This essentially means that HDFS is a module of Hadoop.
HDFS attributes
Several attributes of HDFS are:
1. Hardware Failure
• Hardware failure is the norm rather than the exception.
• An HDFS instance may consist of hundreds or thousands of
server machines, each storing part of the file system’s data.
• The fact that there are a huge number of components and
that each component has a non-trivial probability of failure
means that some component of HDFS is always non-functional.
• Detection of faults and quick, automatic recovery from them
is a core architectural goal of HDFS.
HDFS architecture
Several attributes of HDFS are:
2. Streaming Data Access
• Applications that run on HDFS need streaming access to their
data sets.
• HDFS is designed more for batch processing rather than
interactive use by users.
• The emphasis is on high throughput of data access rather
than low latency of data access.
3. Large Data Sets
• Applications that run on HDFS have large data sets.
• A typical file in HDFS is gigabytes to terabytes in size.
• It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster.
• It should support tens of millions of files in a single instance.
HDFS architecture
Several attributes of HDFS are:
4. Simple Coherency Model
• Data is considered coherent if the data in your
workbook is in sync with the data saved in the
datasource.
• HDFS applications need a write-once-read-many access
model for files.
• A file once created, written, and closed need not be changed.
• A MapReduce application or a web crawler application fits
perfectly with this model.
HDFS architecture
Several attributes of HDFS are:
5. “Moving Computation is Cheaper than Moving Data”
• A computation requested by an application is much more
efficient if it is executed near the data it operates on.
• This is especially true when the size of the data set is huge.
• This minimizes network congestion and increases the overall
throughput of the system.
• HDFS provides interfaces for applications to move
themselves closer to where the data is located.
6. Portability Across Heterogeneous Hardware and Software
Platforms
• HDFS has been designed to be easily portable from one
platform to another.
• This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.
File System and Storage
• Efficient file handling to support parallel
and distributed operations of large data-
sets needed an entirely new file system
format.
• Researchers and computing vendors have
come up with suitable storage solutions to
achieve optimal performance in cloud like
highper formance environment.

You might also like