The document discusses Big Data, its characteristics, and the tools used for its management, particularly focusing on the Hadoop Distributed File System (HDFS). It outlines the five v's of Big Data: Volume, Variety, Veracity, Velocity, and Value, explaining how they define the nature of Big Data. Additionally, it describes the attributes of HDFS, emphasizing its role in managing large datasets and supporting big data analytics.
The document discusses Big Data, its characteristics, and the tools used for its management, particularly focusing on the Hadoop Distributed File System (HDFS). It outlines the five v's of Big Data: Volume, Variety, Veracity, Velocity, and Value, explaining how they define the nature of Big Data. Additionally, it describes the attributes of HDFS, emphasizing its role in managing large datasets and supporting big data analytics.
By Dr.Ashraf Hendam OUTLINE • Big Data • Who is generating Big Data? • Characteristics of Big Data • Big Data Open Source Tools • Big data management tools • Hadoop Distributed File System (HDFS) • HDFS attributes Big Data “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require real-time or near-real time capabilities Who is generating Big Data? Characteristics of Big Data • Big Data contains a large amount of data that is not being processed by traditional data storage or the processing unit. • It is used by many multinational companies to process the data and business of many organizations. • There are five v's of Big Data that explains the characteristics. - Volume - Variety - Veracity - Velocity - Value Characteristics of Big Data Characteristics of Big Data Volume • The name Big Data itself is related to an enormous size. • Big Data is a vast 'volumes' of data generated from many sources daily, such as business processes, machines, social media platforms, networks, human interactions, and many more. • Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is recorded, and more than 350 million new posts are uploaded each day. Big data technologies can handle large amounts of data. Characteristics of Big Data Volume Characteristics of Big Data Variety Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc. Different Types: • Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) • Graph Data • Social Network, Semantic Web (RDF), ... • Streaming Data Characteristics of Big Data Variety Characteristics of Big Data Variety Big Data can be structured, unstructured, and semi- structured that are being collected from different sources. Characteristics of Big Data Variety Structured • Structured data, is the data that can be processed, stored, and retrieved in a fixed format. • It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. • For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner. Characteristics of Big Data Variety Structured Characteristics of Big Data Variety Semi-structured Semi-structured data pertains to the data containing both the formats mentioned before, that is, structured and unstructured data. • It refers to the data that although has not been classified under a particular repository (database). • Contains vital information or tags that segregate individual elements within the data. • Example: XML data files that are self describing and defined by an xml schema Characteristics of Big Data Variety Semi-structured Characteristics of Big Data Variety Unstructured • Unstructured data refers to the data that lacks any specific form or structure whatsoever. • This makes it very difficult and time-consuming to process and analyze unstructured data. • Email content is an example of unstructured data. Structured and unstructured are two important types of big data. Characteristics of Big Data Variety Unstructured Characteristics of Big Data Variety • Quasi structured
• Textual data with erratic data formats, can be
formatted with effort, tools, and time
• Example: Web clickstream data that may contain
some inconsistencies in data values and formats Characteristics of Big Data Variety Characteristics of Big Data Veracity • Veracity means how much the data is reliable. • It has many ways to filter or translate the data. • Veracity is the process of being able to handle and manage data efficiently. • Big Data is also essential in business development. For example, Facebook posts with hashtags. Characteristics of Big Data Velocity • Velocity essentially refers to the speed at which data is being created in real-time. • It comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts. • Velocity deals with the speed at the data flows from sources like application logs, business processes, networks, and social media sites, sensors, mobile devices, etc. Characteristics of Big Data Value • Value is an essential characteristic of big data. • It is not the data that we process or store. It is valuable and reliable data that we store, process, and also analyze. • What makes Big Data valuable is developing better models which leads to higher precision Big Data Open Source Tools Big data management tools Hadoop Distributed File System (HDFS) • HDFS is the primary storage system used by Hadoop applications. • This open source framework works by rapidly transferring data between nodes. • HDFS is a key component of many Hadoop systems, as it provides a means for managing big data, as well as supporting big data analytics. Hadoop Distributed File System (HDFS) • HDFS operates as a distributed file system designed to run on commodity hardware. • HDFS provides high throughput data access to application data and is suitable for applications that have large data sets and enables streaming access to file system data in Apache Hadoop. • So, what is Hadoop? And how does it vary from HDFS? A core difference between Hadoop and HDFS is that Hadoop is the open source framework that can store, process and analyze data, while HDFS is the file system of Hadoop that provides access to data. • This essentially means that HDFS is a module of Hadoop. HDFS attributes Several attributes of HDFS are: 1. Hardware Failure • Hardware failure is the norm rather than the exception. • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. • The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. HDFS architecture Several attributes of HDFS are: 2. Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • HDFS is designed more for batch processing rather than interactive use by users. • The emphasis is on high throughput of data access rather than low latency of data access. 3. Large Data Sets • Applications that run on HDFS have large data sets. • A typical file in HDFS is gigabytes to terabytes in size. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. • It should support tens of millions of files in a single instance. HDFS architecture Several attributes of HDFS are: 4. Simple Coherency Model • Data is considered coherent if the data in your workbook is in sync with the data saved in the datasource. • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. • A MapReduce application or a web crawler application fits perfectly with this model. HDFS architecture Several attributes of HDFS are: 5. “Moving Computation is Cheaper than Moving Data” • A computation requested by an application is much more efficient if it is executed near the data it operates on. • This is especially true when the size of the data set is huge. • This minimizes network congestion and increases the overall throughput of the system. • HDFS provides interfaces for applications to move themselves closer to where the data is located. 6. Portability Across Heterogeneous Hardware and Software Platforms • HDFS has been designed to be easily portable from one platform to another. • This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. File System and Storage • Efficient file handling to support parallel and distributed operations of large data- sets needed an entirely new file system format. • Researchers and computing vendors have come up with suitable storage solutions to achieve optimal performance in cloud like highper formance environment.