2 emerging
2 emerging
It can be facts, figures, observations, instructions or measurements that are collected and
stored so they can be used for analysis, decision-making, or problem-solving.
What is information
Information must qualify the following qualities: timely, accuracy and completeness
Data science
Multi – disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from structured, semi structured and unstructured data.
Data science is the field that involves using data to gain insights, make decisions, and solve
problems. It combines elements from:
Computer Science (to process and handle data, often using code)
Domain Knowledge (to understand the context and make meaningful conclusions
Data processing is the act of taking raw data and turning it into meaningful information. It’s like
taking ingredients (raw data), following a recipe (processing), and ending up with a meal (useful
info).
Collection > Preparation / Cleaning > input > processing > output > storage
Data type is an attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.
a data type defines what kind of value a piece of data holds — like a label that tells the
computer how to interpret and work with that data.
Common data types include: integers, Booleans, characters, floating point numbers,
alphanumeric strings. (in computer science and computer programming)
On the other hand, for the analysis of data, it is important to understand that there are three
common types of data types or structures:
Structured data: considered as the most traditional form of data storage. Depends on the
existence of a data model
The ability to analyze unstructured data is especially relevant in the context of big data, since a
large part of an organization is unstructured.
The ability to extract value from unstructured data is one of the main drivers behind the quick
growth of big data.
Example:
JSON
"name": "Ali",
"age": 25,
}
✅ Human-readable
✅ Easy to parse in code
✅ Used a lot in modern web and app development
Example:
<person>
<name>Ali</name>
<age>25</age>
<skills>
<skill>JavaScript</skill>
<skill>React</skill>
</skills>
</person>
✅ Very structured
✅ Good for complex data
✅ Still used in enterprise systems and legacy software
📥 Data Acquisition means collecting or obtaining data from various sources so it can be used
for analysis, processing, or storage.
Data Analysis is the process of examining, organizing, and interpreting data to discover useful
information, patterns, trends, or insights.
"What happened?"
Data curations
It is the active management of data over its life cycle to ensure it meets necessary data quality
requirements for its effective usage.
Data storage
Data usage
Big Data refers to extremely large and complex datasets that are difficult to manage, process,
or analyze using traditional tools (like Excel or small databases).
Think data from millions of users, real-time sensors, or social media platforms — way too
much to handle with just your laptop!
The 5 Vs of Big Data (core pillars) characteristics that make big data different from other data
processing.
Volume
Velocity
Variety
The Big Data Life Cycle describes the end-to-end journey of data — from the moment it's
generated to the moment it's used for decision-making.
The general categories of activities involved with big data processing
1. Data Ingestion
2. Data Storage
Data is stored in systems that can handle large volume and variety
3. Data Analysis
o Statistical methods
o Visualizations
4 . Data Visualization
Clustered computing
Clustered computing is when multiple computers (called nodes) work together as a single,
unified system to perform tasks — especially when one computer alone isn’t powerful or fast
enough.
🧱 HDFS (Hadoop Distributed File System)
HDFS is primarily a storage system designed for Big Data, and while not a traditional database
management system, it is used in conjunction with data processing frameworks like Hadoop,
Spark, and Hive.
✅ Pros of HDFS:
Advantage Description
🔗 Works with Hadoop Seamlessly integrates with big data processing tools like
Ecosystem MapReduce, Hive, Pig, Spark.
❌ Cons of HDFS:
Limitation Description
🧠 Single Point of Failure If the NameNode fails (without high availability setup), the whole
(NameNode) system may stop functioning.
These are traditional databases that store data in structured tables with rows and columns.
They support SQL for querying data.
Advantage Description
🔍 Advanced Querying
Supports complex queries using SQL (e.g., joins, aggregations, filtering).
(SQL)
❌ Cons of RDBMS:
Limitation Description
🧱 Rigid Schema Requires a predefined schema; inflexible when dealing with unstructured
Limitation Description
data.
🐢 Not Ideal for Big Performance may degrade as data grows into the terabyte or petabyte
Data range.
🔄 Write Relational databases can be slower for write-heavy workloads (e.g., logs,
Performance real-time streaming).
NoSQL databases (like MongoDB, Cassandra, and Couchbase) are designed for high scalability,
flexibility, and can handle unstructured or semi-structured data. They are often used in
distributed systems where horizontal scaling is important.
Advantage Description
⚖️Horizontal Easily scales out across multiple servers and regions, making them
Scalability suitable for big data.
Great for write-heavy workloads (e.g., logs, IoT devices) and handling
🚀 High Performance
large volumes of data.
📈 Specialized Data Supports various data models such as document (MongoDB), key-value
Models (Redis), graph (Neo4j), and column-family (Cassandra).
Limitation Description
🧱 Eventual May not provide strong consistency (ACID) like traditional RDBMS, leading
Limitation Description
Complex Data Designing data models can be more challenging, especially for developers
Modeling used to relational databases.
While NoSQL databases support basic queries, they lack full support for SQL-
🔍 Limited
style joins and complex queries (though tools like MongoDB Aggregation
Querying
Framework are improving this).
🧑💻 Young Some NoSQL DBMSs are still evolving, and their ecosystem may not be as
Technology mature as relational databases.
Key-Value, Document,
Data Model Files and blocks Tables with schema
Column-family, Graph
🧠 Summary
HDFS: Ideal for storing massive datasets (often unstructured) in a distributed system for
batch processing. It’s not a DBMS in the traditional sense but works as the storage layer
for big data ecosystems.
Relational DBMS: Best for managing structured data with ACID compliance. Excellent
for transactional systems but struggles with scaling to large datasets (TBs and beyond).
NoSQL DBMS: Best for high scalability, handling unstructured or semi-structured data,
and high write throughput. Offers flexibility, but lacks the full feature set of relational
databases (like complex joins).