0% found this document useful (0 votes)
22 views

Unit 5

The document discusses concepts of big data and data lakes. It defines big data, data sources, and benefits over traditional databases. It also defines data warehouses, OLTP, and OLAP. Additionally, it defines data lakes, their architecture and significance, and compares them to data warehouses.

Uploaded by

userdemo12334
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Unit 5

The document discusses concepts of big data and data lakes. It defines big data, data sources, and benefits over traditional databases. It also defines data warehouses, OLTP, and OLAP. Additionally, it defines data lakes, their architecture and significance, and compares them to data warehouses.

Uploaded by

userdemo12334
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Unit-5: Concepts of Big Data and Data Lake

5.1 Concepts of Bigdata


5.1.1 Sources of Bigdata
5.1.2 Bigdata benefits over Traditional Database
5.1.3 Concepts of Data Warehouse
5.1.3.1 Concepts of data processing techniques:
5.1.3.1.1 OLTP (Online Transaction Processing)
5.1.3.1.2 OLAP (Online Analytical Processing)
5.2 Concepts of Data Lake:
5.2.1 Data lake concepts and its architecture
5.2.2 Significance of data lake
5.2.3 Comparison of Data Lake and Data Warehousing

5.1 Concepts of Big Data


 Big Data refers to the massive volume of structured and unstructured data generated by
various sources, which is characterized by its size, complexity, and the speed at which it is
generated and processed.
 The concept revolves around managing and deriving valuable insights from these vast
datasets that traditional data processing tools may struggle to handle. Key characteristics of
Big Data are often described using the 3Vs: Volume, Velocity, and Variety.

5.1.1 Sources of Big Data


Big Data originates from various sources, and it is characterized by the 3Vs: Volume,
Velocity, and Variety.
Volume:
 Refers to the massive amounts of data generated daily. Big Data involves datasets
that are too large to be processed and analyzed using traditional databases and
tools.
 Examples include social media posts, sensor data, and log files.

Velocity:
 Describes the speed at which data is generated, processed, and analyzed. Real-time
applications and streaming data contribute to high data velocity.
 Big Data often involves real-time or near-real-time processing to keep up with the
constant influx of data from various sources.
Variety:
 Encompasses the diverse types of data, including structured, semi-structured, and
unstructured data. This includes text, images, videos, and more.
 Big Data includes diverse forms of data, such as text, images, videos, social media
posts, sensor data, and more.

5.1.2 Big Data Benefits over Traditional Database


Big Data offers advantages over traditional databases due to its ability to handle large
volumes, diverse data types, and high velocities. Benefits include:

Scalability: Big Data technologies can scale horizontally, handling massive amounts of data across
distributed systems.
Flexibility: Big Data systems can accommodate various data types and formats, allowing for flexible
data storage and processing.

Real-time Processing: Big Data platforms enable real-time data processing, critical for applications
like fraud detection and monitoring.

Cost-Effectiveness: Distributed computing and open-source solutions make Big Data cost-effective
compared to traditional databases.

5.1.3 Concepts of Data Warehouse


 A Data Warehouse is a central repository that stores and manages large volumes of
structured data from various sources, making it available for complex analysis and
reporting.
 It is designed to support decision-making processes by providing a consolidated and
organized view of an organization's historical and current data. The concept of a
Data Warehouse involves several key elements:

5.1.3.1 Concepts of Data Processing Techniques

5.1.3.1.1 OLTP (Online Transaction Processing)


 OLTP is a type of data processing that focuses on managing and processing
transaction-oriented applications. It involves short and simple queries, often related
to inserting, updating, and deleting records.
 OLTP systems are designed for consistency and handle a large number of concurrent
transactions.

5.1.3.1.2 OLAP (Online Analytical Processing)


 OLAP is geared towards complex queries and analytical processing.
 It involves aggregations and calculations over large datasets.
 OLAP systems are optimized for read-heavy operations and are crucial for business
intelligence and decision support systems.

5.2 Concepts of Data Lake

 A Data Lake is a centralized repository that allows organizations to store vast amounts of
structured, semi-structured, and unstructured data at any scale.
 Unlike traditional databases or data warehouses, a Data Lake does not require predefined
schemas before storing the data, making it a highly flexible and scalable solution.

5.2.1 Data Lake Concepts and Its Architecture

 A Data Lake is a centralized repository that allows storage of structured and unstructured
data at any scale. Key concepts include:

Storage: Data Lakes store data in its raw form, without the need for extensive structuring. This
allows for the storage of diverse data types, including raw, unprocessed data.

Scalability: Data Lakes can scale horizontally, handling vast amounts of data by distributing it across
clusters of inexpensive hardware.

Schema-on-Read: Unlike traditional databases, Data Lakes follow a schema-on-read approach. The
structure is imposed on the data only when it's read, enabling flexibility.
Data Lake Architecture

5.2.2 Significance of Data Lake


The significance of Data Lakes lies in their ability to store and process large volumes of raw
data efficiently. Key points include:

Advanced Analytics: Data Lakes support advanced analytics, machine learning, and other data-
intensive applications by providing a flexible and scalable storage solution.

Cost-Efficiency: They offer a cost-effective solution for storing large volumes of data compared to
traditional storage solutions.

Flexibility: Data Lakes allow organizations to store and analyze diverse data types without the need
for extensive upfront structuring.

5.2.3 Comparison of Data Lake and Data Warehousing

Data Lakes and Data Warehouses serve different purposes, and their comparison involves:

Data Types: Data Lakes store raw, unstructured data, while Data Warehouses store structured,
processed data.

Schema: Data Lakes use a schema-on-read approach, providing flexibility, whereas Data Warehouses
use a schema-on-write approach for structured data.

Processing Time: Data Lakes are suitable for real-time and batch processing, while Data Warehouses
are optimized for batch processing and complex queries.

You might also like