Unit 5
Unit 5
Velocity:
Describes the speed at which data is generated, processed, and analyzed. Real-time
applications and streaming data contribute to high data velocity.
Big Data often involves real-time or near-real-time processing to keep up with the
constant influx of data from various sources.
Variety:
Encompasses the diverse types of data, including structured, semi-structured, and
unstructured data. This includes text, images, videos, and more.
Big Data includes diverse forms of data, such as text, images, videos, social media
posts, sensor data, and more.
Scalability: Big Data technologies can scale horizontally, handling massive amounts of data across
distributed systems.
Flexibility: Big Data systems can accommodate various data types and formats, allowing for flexible
data storage and processing.
Real-time Processing: Big Data platforms enable real-time data processing, critical for applications
like fraud detection and monitoring.
Cost-Effectiveness: Distributed computing and open-source solutions make Big Data cost-effective
compared to traditional databases.
A Data Lake is a centralized repository that allows organizations to store vast amounts of
structured, semi-structured, and unstructured data at any scale.
Unlike traditional databases or data warehouses, a Data Lake does not require predefined
schemas before storing the data, making it a highly flexible and scalable solution.
A Data Lake is a centralized repository that allows storage of structured and unstructured
data at any scale. Key concepts include:
Storage: Data Lakes store data in its raw form, without the need for extensive structuring. This
allows for the storage of diverse data types, including raw, unprocessed data.
Scalability: Data Lakes can scale horizontally, handling vast amounts of data by distributing it across
clusters of inexpensive hardware.
Schema-on-Read: Unlike traditional databases, Data Lakes follow a schema-on-read approach. The
structure is imposed on the data only when it's read, enabling flexibility.
Data Lake Architecture
Advanced Analytics: Data Lakes support advanced analytics, machine learning, and other data-
intensive applications by providing a flexible and scalable storage solution.
Cost-Efficiency: They offer a cost-effective solution for storing large volumes of data compared to
traditional storage solutions.
Flexibility: Data Lakes allow organizations to store and analyze diverse data types without the need
for extensive upfront structuring.
Data Lakes and Data Warehouses serve different purposes, and their comparison involves:
Data Types: Data Lakes store raw, unstructured data, while Data Warehouses store structured,
processed data.
Schema: Data Lakes use a schema-on-read approach, providing flexibility, whereas Data Warehouses
use a schema-on-write approach for structured data.
Processing Time: Data Lakes are suitable for real-time and batch processing, while Data Warehouses
are optimized for batch processing and complex queries.