0% found this document useful (0 votes)
10 views10 pages

2 emerging

The document explains the concepts of data and information, highlighting that data consists of facts and figures while information is organized data. It covers data science as a multidisciplinary field that extracts insights from various data types, including structured, unstructured, and semi-structured data. Additionally, it discusses data processing, storage, and the differences between HDFS, RDBMS, and NoSQL databases, emphasizing their respective use cases and advantages.

Uploaded by

okaywhynot55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

2 emerging

The document explains the concepts of data and information, highlighting that data consists of facts and figures while information is organized data. It covers data science as a multidisciplinary field that extracts insights from various data types, including structured, unstructured, and semi-structured data. Additionally, it discusses data processing, storage, and the differences between HDFS, RDBMS, and NoSQL databases, emphasizing their respective use cases and advantages.

Uploaded by

okaywhynot55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

What is data

It can be facts, figures, observations, instructions or measurements that are collected and
stored so they can be used for analysis, decision-making, or problem-solving.

Should be suitable for communication, interpretation or processing by human or electronic


machine.

Data is represented by characters, alphabets, and numbers.

What is information

Information is organized data, processed data.

Information must qualify the following qualities: timely, accuracy and completeness

Data science

Multi – disciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from structured, semi structured and unstructured data.

Data science is the field that involves using data to gain insights, make decisions, and solve
problems. It combines elements from:

 Statistics and Mathematics (to analyze data)

 Computer Science (to process and handle data, often using code)

 Domain Knowledge (to understand the context and make meaningful conclusions

Data processing is the act of taking raw data and turning it into meaningful information. It’s like
taking ingredients (raw data), following a recipe (processing), and ending up with a meal (useful
info).

Collection > Preparation / Cleaning > input > processing > output > storage

Data types and its representation

Data type is an attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.

a data type defines what kind of value a piece of data holds — like a label that tells the
computer how to interpret and work with that data.

All programming languages explicitly include the notion of data type.

Common data types include: integers, Booleans, characters, floating point numbers,
alphanumeric strings. (in computer science and computer programming)
On the other hand, for the analysis of data, it is important to understand that there are three
common types of data types or structures:

Structured data: considered as the most traditional form of data storage. Depends on the
existence of a data model

Organized in tables or databases

Easy to search and analyze

Examples: Excel spreadsheets, SQL databases

Unstructured Data: does not have a predefined data model

The ability to analyze unstructured data is especially relevant in the context of big data, since a
large part of an organization is unstructured.

The ability to extract value from unstructured data is one of the main drivers behind the quick
growth of big data.

 Not organized in a fixed format

 Examples: Emails, videos, images, social media posts

 Harder to analyze but very rich in information

Semi – structured data

JSON (JavaScript Object Notation)

 A lightweight and easy-to-read format for storing and sharing data.

 Looks like Python dictionaries or JavaScript objects.

 Widely used in web APIs, apps, and data exchange.

Example:

JSON

"name": "Ali",

"age": 25,

"skills": ["JavaScript", "React"]

}
✅ Human-readable
✅ Easy to parse in code
✅ Used a lot in modern web and app development

🔸 XML (extensible Markup Language)

 A markup language that uses tags to define data.

 More structured and wordy than JSON.

 Used in older systems, documents, and some APIs.

Example:

<person>

<name>Ali</name>

<age>25</age>

<skills>

<skill>JavaScript</skill>

<skill>React</skill>

</skills>

</person>

✅ Very structured
✅ Good for complex data
✅ Still used in enterprise systems and legacy software

Data value chain

📥 Data Acquisition means collecting or obtaining data from various sources so it can be used
for analysis, processing, or storage.

Data Analysis is the process of examining, organizing, and interpreting data to discover useful
information, patterns, trends, or insights.

It helps us answer questions like:

 "What happened?"

 "Why did it happen?"


 "What will happen next

Data curations

It is the active management of data over its life cycle to ensure it meets necessary data quality
requirements for its effective usage.

Data storage

Data usage

Big Data refers to extremely large and complex datasets that are difficult to manage, process,
or analyze using traditional tools (like Excel or small databases).

Think data from millions of users, real-time sensors, or social media platforms — way too
much to handle with just your laptop!

The 5 Vs of Big Data (core pillars) characteristics that make big data different from other data
processing.

Volume

Velocity

Variety

The Big Data Life Cycle describes the end-to-end journey of data — from the moment it's
generated to the moment it's used for decision-making.
The general categories of activities involved with big data processing

1. Data Ingestion

 Bringing data into your system or platform

 Can be done in real-time (streaming) or in batches

 Tools: Apache Kafka, Flume, Sqoop

2. Data Storage

 Data is stored in systems that can handle large volume and variety

 Choices depend on data type:

o HDFS (Hadoop Distributed File System)

o NoSQL databases (like MongoDB, Cassandra)

o Data lakes / cloud storage (AWS S3, Azure Blob)

📍 Example: Saving years of customer click data in Amazon S3

3. Data Analysis

 Discover patterns, trends, and insights using:

o Statistical methods

o Machine learning models

o Visualizations

 Tools: Python, R, Power BI, Tableau

📍 Example: Analyzing customer buying trends

4 . Data Visualization

 Turning insights into graphs, charts, dashboards

 Helps decision-makers understand what's happening

📍 Example: A dashboard showing sales performance by region

Clustered computing

Clustered computing is when multiple computers (called nodes) work together as a single,
unified system to perform tasks — especially when one computer alone isn’t powerful or fast
enough.
🧱 HDFS (Hadoop Distributed File System)

HDFS is primarily a storage system designed for Big Data, and while not a traditional database
management system, it is used in conjunction with data processing frameworks like Hadoop,
Spark, and Hive.

✅ Pros of HDFS:

Advantage Description

Easily scales horizontally by adding more servers (nodes). Can


⚖️Scalability
handle petabytes of data.

Data is replicated (default is 3 copies), ensuring no data is lost


Fault Tolerance
even if a node fails.

Optimized for reading/writing large amounts of data, especially in


🚀 High Throughput
batch processing.

💸 Cost-Effective Can run on commodity hardware, making it relatively inexpensive.

🔗 Works with Hadoop Seamlessly integrates with big data processing tools like
Ecosystem MapReduce, Hive, Pig, Spark.

Designed to store and manage large volumes of unstructured or


📈 Big Data Storage
semi-structured data.

❌ Cons of HDFS:

Limitation Description

Struggles with a large number of small files, as each file creates


🐢 Not Ideal for Small Files
overhead on the NameNode.
Limitation Description

Does not support SQL or querying natively. External tools (like


🔄 No Native Query Support
Hive or Spark SQL) are needed for querying.

Best suited for batch processing; not ideal for real-time or


🧱 Batch Processing Only
interactive queries.

Requires specialized knowledge to set up and maintain a Hadoop


⚙️Complex to Set Up
cluster.

🧠 Single Point of Failure If the NameNode fails (without high availability setup), the whole
(NameNode) system may stop functioning.

2. Relational Database Management Systems (RDBMS)

These are traditional databases that store data in structured tables with rows and columns.
They support SQL for querying data.

✅ Pros of RDBMS (e.g., MySQL, PostgreSQL, Oracle DB):

Advantage Description

Excellent for structured data where relationships are well-defined (e.g.,


📊 Structured Data
bank transactions, inventory).

🔍 Advanced Querying
Supports complex queries using SQL (e.g., joins, aggregations, filtering).
(SQL)

Guarantees data consistency, reliability, and transaction management


Data Integrity (ACID)
(ACID properties).

A well-established technology with extensive tools, libraries, and


💻 Mature Ecosystem
community support.

🔒 Security Built-in security features (e.g., access control, encryption).

❌ Cons of RDBMS:

Limitation Description

🧱 Rigid Schema Requires a predefined schema; inflexible when dealing with unstructured
Limitation Description

data.

Vertical scaling (adding more power to a single server) is expensive;


⚙️Scaling
horizontal scaling (distributing across multiple servers) is challenging.

🐢 Not Ideal for Big Performance may degrade as data grows into the terabyte or petabyte
Data range.

🔄 Write Relational databases can be slower for write-heavy workloads (e.g., logs,
Performance real-time streaming).

3. NoSQL Databases (Distributed DBMS)

NoSQL databases (like MongoDB, Cassandra, and Couchbase) are designed for high scalability,
flexibility, and can handle unstructured or semi-structured data. They are often used in
distributed systems where horizontal scaling is important.

✅ Pros of NoSQL DBMS:

Advantage Description

⚖️Horizontal Easily scales out across multiple servers and regions, making them
Scalability suitable for big data.

Allows for unstructured or semi-structured data and flexible schemas


🔄 Flexible Schema
(great for rapidly changing data).

Great for write-heavy workloads (e.g., logs, IoT devices) and handling
🚀 High Performance
large volumes of data.

🧠 Eventual Optimized for availability and partition tolerance, often adopting an


Consistency eventual consistency model (e.g., Cassandra).

📈 Specialized Data Supports various data models such as document (MongoDB), key-value
Models (Redis), graph (Neo4j), and column-family (Cassandra).

❌ Cons of NoSQL DBMS:

Limitation Description

🧱 Eventual May not provide strong consistency (ACID) like traditional RDBMS, leading
Limitation Description

Consistency to potential data anomalies.

Complex Data Designing data models can be more challenging, especially for developers
Modeling used to relational databases.

While NoSQL databases support basic queries, they lack full support for SQL-
🔍 Limited
style joins and complex queries (though tools like MongoDB Aggregation
Querying
Framework are improving this).

🧑‍💻 Young Some NoSQL DBMSs are still evolving, and their ecosystem may not be as
Technology mature as relational databases.

4. Key Differences Between HDFS and Other DBMSs

Here’s a comparison between HDFS and RDBMS/NoSQL DBMS:

Feature HDFS Relational DBMS NoSQL DBMS

Unstructured, semi- Structured (tables, Structured, semi-structured,


Data Type
structured rows, columns) unstructured

Horizontal (distributed across Horizontal (distributed


Scalability Vertical (single server)
nodes) systems)

No native querying (external SQL (Structured Query NoSQL query languages


Querying
tools like Hive or Spark) Language) (varies)

High fault tolerance ACID transactions Eventual consistency or


Consistency
(replication) (strong consistency) strong consistency (varies)

Key-Value, Document,
Data Model Files and blocks Tables with schema
Column-family, Graph

Big Data storage and batch High-volume, write-heavy


Best for Transactional data
processing workloads

Large-scale data storage and OLTP systems (e.g., Real-time applications,


Use Case
processing financial, ERP) distributed systems

🧠 Summary
 HDFS: Ideal for storing massive datasets (often unstructured) in a distributed system for
batch processing. It’s not a DBMS in the traditional sense but works as the storage layer
for big data ecosystems.

 Relational DBMS: Best for managing structured data with ACID compliance. Excellent
for transactional systems but struggles with scaling to large datasets (TBs and beyond).

 NoSQL DBMS: Best for high scalability, handling unstructured or semi-structured data,
and high write throughput. Offers flexibility, but lacks the full feature set of relational
databases (like complex joins).

You might also like