Syllabus Solving
Syllabus Solving
o Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of
data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
Able to handle and process large and complex data sets that cannot be easily managed with
traditional database systems
Provides a platform for advanced analytics and machine learning applications
Requires specialized skills and expertise in data engineering, data management, and big data
tools and technologies
Can be expensive to implement and maintain due to the need for specialized infrastructure
and software
2. WHAT IS THE PROBLEM OF 3v
The 3 V's (volume, velocity and variety) are three defining properties or dimensions of big data.
Volume refers to the amount of data, velocity refers to the speed of data processing, and variety refers
to the number of types of data.
In data science, the term "3V" is often used to refer to the three primary challenges
associated with big data. These are:
1. Volume: Volume refers to the sheer amount of data that is generated and collected
in today's digital world. With the proliferation of digital devices, sensors, social
media, and more, data is being generated at an unprecedented rate. Managing and
analyzing large volumes of data can be a significant challenge for data scientists.
2. Velocity: Velocity refers to the speed at which data is generated and how quickly it
must be processed and analyzed. Some data streams in real-time, such as social
media posts and sensor data. Data scientists need to develop systems that can
handle this high velocity of data and provide insights in a timely manner.
3. Variety: Variety relates to the diverse types of data that are generated. Data can
come in structured formats (like databases), semi-structured formats (like XML or
JSON), or unstructured formats (like text, images, and videos). Data scientists must be
able to work with and integrate data from various sources and in different formats.
The Gartner Hype Cycle is a graphic representation of the maturity lifecycle of new technologies and
innovations divided into five phases: Innovation Trigger, Peak of Inflated Expectations, Trough of
Disillusionment, Slope of Enlightenment, and Plateau of Productivity.
1. Innovation Trigger. A breakthrough, public demonstration, product launch or other event sparks
media and industry interest in a technology or other type of innovation.
2. Peak of Inflated Expectations. The excitement and expectations for the innovation exceed the reality
of its current capabilities. In some cases, a financial bubble may form around the innovation.
3. Trough of Disillusionment. The original overexcitement about the innovation dissipates, and offset
disillusionment sets in due to performance issues, slower-than-expected adoption or a failure to
deliver timely financial returns.
4. Slope of Enlightenment. Some early adopters overcome the initial hurdles and begin to see the
benefits of the innovation. By learning from the experiences of early adopters, organizations gain a
better understanding of where and how the innovation will deliver significant value (and where it will
not).
5. Plateau of Productivity. The innovation has demonstrated real-world productivity and benefits, and
more organizations feel comfortable with the greatly reduced level of risk. A sharp uptick in adoption
begins until the innovation becomes mainstream.
1. Innovation Trigger: This is the initial stage when a new technology or idea is
introduced. It may have the potential for significant impact, but it's often
experimental and unproven at this point.
2. Peak of Inflated Expectations: In this stage, there is a rapid increase in hype and
expectations surrounding the technology. There's often a lot of enthusiasm, media
attention, and sometimes unrealistic expectations about what the technology can
achieve.
3. Trough of Disillusionment: After reaching its peak, the technology often falls into a
"trough" of disappointment. This stage is characterized by disillusionment as early
adopters and others begin to realize the technology's limitations or challenges. Some
technologies may fail and fade away during this stage.
4. Slope of Enlightenment: As the initial hype wears off, a more realistic and practical
understanding of the technology emerges. This stage may involve ongoing research
and development to address the challenges identified during the Trough of
Disillusionment.
5. Plateau of Productivity: Finally, the technology reaches a point of maturity and
widespread adoption. It becomes a standard tool or practice in its respective domain,
and its benefits are widely recognized and realized.
Link: https://www.geeksforgeeks.org/big-data-analytics-life-cycle/
1. Structured data
2. Semi-structured data
3. Unstructured data
Structured data is generally stored in tables in the form of rows and columns. Structured data in these
tables can form relations with another tables. Humans and machines can easily retrieve information
from structured data. This data is meaningful and is used to develop data models.
Structured data refers to highly organized and formatted information that fits neatly into predefined
tables or relational databases. This type of data has a clear schema with fixed fields and data types.
And, listed below are the disadvantages of keeping the data in a structured manner:
Example
Unprocessed and unorganized data is known as unstructured data. This type of data has no meaning
and is not used to develop data models. Unstructured data may be text, images, audio, videos,
reviews, satellite images, etc. Almost 80% of the data in this world is in the form of unstructured data.
Data is flexible.
It is very scalable
This data can be used for a wide range of purposes as it is in its original form.
Semi structured data is organized up to some extent only and the rest is unstructured. Hence, the level
of organizing is less than that of Structured Data and higher than that of Unstructured Data.
Link: https://www.tutorialspoint.com/difference-between-structured-semi-structured-and-
unstructured-data
7. DIFFERENTIATE STRUCTURE,
UNSTRUCTURE AND SEMI-STRUCTURE
(DATA)
Structured Data Unstructured Data Semi-structured Data
It is less flexible and difficult to It is flexible and scalable. It is It is more flexible and simpler to scale
scale. It is schema dependent. schema independent. than structured data but lesser than
unstructured data.
Versioning over tuples,row,tables Versioning is like as a whole Versioning over tuples is possible.
data.
Easy analysis Difficult analysis Difficult analysis compared to structured
data but easier when compared to
unstructured data.
Financial data, bar codes are Media logs, videos, audios are Tweets organised by hashtags, folder
some of the examples of some of the examples of organised by topics are some of the
structured data. unstructured data. examples of unstructured data.
NoSQL is a type of database management system (DBMS) that is designed to handle and store large
volumes of unstructured and semi-structured data.(for example Google or Facebook which collects
terabits of data every day for their users).
NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than
relational tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. They provide flexible schemas and scale
easily with large amounts of data and high user loads.
1. Document databases: These databases store data as semi-structured documents , such as JSON
or XML, and can be queried using document-oriented query languages.
• A collection of documents
• Data in this model is stored inside documents.
• A document is a key value collection where the key allows access to its value.
• Documents are not typically forced to have a schema and therefore are flexible and easy to
change.
• Documents are stored into collections in order to group different kinds of data.
• Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents.
2. Key-value stores: These databases store data as key-value pairs , and are optimized for simple
and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity . They are optimized for fast and efficient querying of
large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
NoSQL Advantages
High scalability
Distributed Computing
Lower cost
No complicated Relationships
NoSQL Disadvantages
No standardization
SQL NoSQL
These databases are not suited for hierarchical data These databases are best suited for
storage. hierarchical data storage.
It gives only read scalability. It gives both read and write scalability.
Data arrives from one or few locations. Data arrives from many locations.
Its difficult to make changes in database once it Enables easy and frequent changes to
is defined database
Basis of
NoSQL RDBMS
Comparison
Non-relational databases, often known RDBMS, which stands for Relational Database
Definition as distributed databases, are another Management Systems, is the most common
name for NoSQL databases. name for SQL databases.
Query No declarative query language SQL stands for Structured Query Language.
Advantages of NoSQL
1. Flexible Data Structures – NoSQL databases allow for more flexible data structures than
traditional relational databases. This means that data can be stored in a variety of formats,
which is particularly useful when dealing with unstructured or semi-structured data, such as
social media posts or log files.
2. Scalability – NoSQL databases are highly scalable, which means they can easily handle large
amounts of data and traffic. This is achieved through a distributed architecture that allows
data to be spread across multiple servers, making it easy to add more servers as needed.
3. High Performance – NoSQL databases are designed for high performance, meaning that
they can process large amounts of data quickly and efficiently. This is especially important
for applications that require real-time data processing, such as financial trading platforms or
social media analytics tools.
4. Availability – NoSQL databases are highly available, which means that they are designed to
be up and running at all times. This is achieved through a distributed architecture that allows
data to be replicated across multiple servers, ensuring that the system is always available,
even if one or more servers fail.
5. Cost-Effective – NoSQL databases can be cost-effective, especially for large-scale
applications. This is because they are designed to be run on commodity hardware, rather than
expensive proprietary hardware, which can save companies a significant amount of money.
Disadvantages of NoSQL
1. Limited Query Capability – NoSQL databases offer limited query capability when
compared to traditional relational databases. This is because NoSQL databases are designed
to handle unstructured data, which can make it difficult to perform complex queries or
retrieve data in a structured manner.
2. Data Consistency – NoSQL databases often sacrifice data consistency for scalability and
performance. This means that there may be some lag time between when data is written to the
database and when it is available for retrieval. Additionally, because NoSQL databases often
use distributed architectures, there may be instances where different nodes of the database
contain different versions of the same data.
3. Lack of Standardization – NoSQL databases lack standardization, meaning that different
NoSQL databases can have vastly different structures and query languages. This can make it
difficult for developers to learn and work with different NoSQL databases.
4. Limited Tooling – Because NoSQL databases are a relatively new technology, there is
limited tooling available for them when compared to traditional relational databases. This can
make it more difficult for developers to work with NoSQL databases and to debug issues
when they arise.
5. Limited ACID Compliance – NoSQL databases often sacrifice ACID compliance for
scalability and performance. ACID compliance refers to a set of properties that guarantee that
database transactions are processed reliably. Because NoSQL databases often use distributed
architectures and eventual consistency models, they may not always be fully ACID
compliant.
The CAP theorem applies a similar type of logic to distributed systems—namely, that a distributed
system can deliver only two of three desired characteristics: consistency, availability, and partition
tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).
Consistency
Consistency means that all clients see the same data at the same time, no matter which node they
connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or
replicated to all the other nodes in the system before the write is deemed ‘successful.’
Availability
Availability means that any client making a request for data gets a response, even if one or more
nodes are down. Another way to state this—all working nodes in the distributed system return a valid
response for any request, without exception.
Partition tolerance
A partition is a communication break within a distributed system—a lost or temporarily delayed
connection between two nodes. Partition tolerance means that the cluster must continue to work
despite any number of communication breakdowns between nodes in the system.
Link: https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-
5c2be977950e#:~:text=CAP%20Theorem%20is%20a%20concept,on%20our%20unique%20use
%20case .
Consistency
Consistency means that all clients see the same data at the same time, no matter
which node they connect to. For this to happen, whenever data is written to one
node, it must be instantly forwarded or replicated to all the other nodes in the system
before the write is deemed ‘successful.’
Availability
Availability means that any client making a request for data gets a response, even if
one or more nodes are down. Another way to state this—all working nodes in the
distributed system return a valid response for any request, without exception.
Partition tolerance
UNIT 2 and 4:
1. WHAT IS HADOOP
Hadoop is an open source framework based on Java that manages the storage and processing of large
amounts of data for applications.
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster
(Collection).
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair. The output of Map task is consumed by reduce
task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop takes
the other copy of data and use it. Normally, data are replicated thrice but the replication factor
is configurable.
o Ability to store a large amount of data.
o High flexibility.
o Cost effective.
o High computational power.
o Tasks are independent.
o Linear scaling.
Disadvantages:
Link: https://geeksforgeeks.org/hadoop-ecosystem/
As Hadoop 1 is prior to
Hadoop 2 so comparatively On other hand Hadoop 2 has better
less scalable than Hadoop 2 scalability than Hadoop 1 and is
4 Scalability
and in context of scaling of scalable up to 10000 nodes per
nodes it is limited to 4000 cluster.
nodes per cluster
Hadoop 1 is implemented as it
follows the concepts of slots On other hand Hadoop 2 follows
5 Implementation which can be used to run a concepts of containers that can be
Map task or a Reduce task used to run generic tasks.
only.
1. Components: Components are self-contained, independent units of software that perform specific
functions or provide certain features. Each component should have a well-defined interface, clearly
specifying how it can be used by other components or parts of the system.
2. Reusability: One of the primary goals of component-based architecture is reusability. Components
should be designed to be reusable in different parts of the system or in other projects. This reduces
redundancy and accelerates development.
3. Interchangeability: Components should be designed to be interchangeable. This means that you can
replace one component with another that provides the same interface and functionality without
affecting the overall system. Interchangeability promotes flexibility and scalability.
4. Encapsulation: Components encapsulate their internal details, making their inner workings hidden
from the rest of the system. This concept is based on the principle of information hiding, which
enhances security and simplifies maintenance.
5. Independence: Components should be as independent as possible, meaning that they shouldn't rely
heavily on other components. This reduces coupling between components, making the system more
robust and easier to maintain.
6. Communication: Components communicate with each other through well-defined interfaces.
Interactions between components are typically achieved using protocols, such as API calls, message
passing, or remote procedure calls.
1. Modularity: The system is divided into smaller, manageable pieces, making it easier to develop, test,
and maintain.
2. Reusability: Components can be reused in different projects, saving time and effort in software
development.
3. Scalability: Components can be added or replaced as the system evolves, allowing for easy expansion
and adaptation to changing requirements.
4. Parallel Development: Different teams or developers can work on individual components
simultaneously, speeding up the development process.
5. Easier Testing: Components can be tested in isolation, simplifying the testing process and ensuring
the quality of each component.
6. Flexibility: The interchangeability of components allows for system flexibility and the incorporation of
third-party components.
7. Improved Maintenance: With well-defined interfaces and encapsulation, it's easier to troubleshoot
and update individual components without affecting the entire system.
Object-Oriented Programming (OOP): In OOP, classes and objects are used as components,
encapsulating both data and behavior. Classes can be reused and extended in various contexts.
Web Development: In web development, components are commonly used for user interface
elements. For example, React and Angular are JavaScript libraries/frameworks that encourage
component-based development for building web applications.
Service-Oriented Architecture (SOA): SOA is an architectural style that promotes the creation of
services as components. These services can be distributed and used in different applications.
5. NOTE ON YARN
YARN stands for “Yet Another Resource Negotiator“.
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by
allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
YARN Features: YARN gained popularity because of the following features-
Components Of YARN
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
converted to MapReduce jobs.
Architecture of Hive
It is based on Google's Big Table. It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop ecosystem
that provides random real-time read/write access to data in the Hadoop File System.
Why HBase
Features of Hbase
Features of HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up
on Google File System, likewise Apache HBase works on top of Hadoop and HDFS
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Link: https://www.tutorialspoint.com/hbase/hbase_overview.htm
9. NOTE ON KAFKA
What is Kafka?
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle
a high volume of data and enables you to pass messages from one end-point to another. Kafka is
suitable for both offline and online message consumption. Kafka messages are persisted on the disk
and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper
synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming
data analysis.9
Kafka is often used to build real-time data streams and applications. Combining
communications, storage, and stream processing enables the collection and analysis of
real-time and historical data
Benefits
Kafka is very fast and guarantees zero downtime and zero data loss.
Use Cases
Kafka can be used in many Use Cases. Some of them are listed below −
Metrics − Kafka is often used for operational monitoring data. This involves aggregating
statistics from distributed applications to produce centralized feeds of operational data.
Log Aggregation Solution − Kafka can be used across an organization to collect logs from
multiple services and make them available in a standard format to multiple con-sumers.
Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from
a topic, processes it, and write processed data to a new topic where it becomes available for
users and applications. Kafka’s strong durability is also very useful in the context of stream
processing.
Kafka Architecture:
Producers: Producers are responsible for publishing data to Kafka topics. They can send data in the
form of records, and each record consists of a key, a value, and a topic. Producers distribute data to
Kafka brokers, typically in a round-robin fashion or based on custom partitioning logic.
Brokers: Kafka brokers are the servers that store and manage data. They are responsible for receiving
data from producers, storing it, and serving it to consumers. Brokers are distributed and can
communicate with each other for data replication.
Topics: Topics are logical channels for organizing and categorizing data in Kafka. Producers write data
to topics, and consumers subscribe to topics to read data. Each topic can have multiple partitions for
parallelism.
Partitions: Topics can be split into multiple partitions, allowing data to be distributed and processed
in parallel. Partitions are the unit of parallelism in Kafka and provide fault tolerance.
Consumers: Consumers subscribe to topics and read data from Kafka. They can process data in real-
time or batch mode. Kafka ensures that each message is consumed only once, and the offset is used
to keep track of the last consumed message.
Zookeeper: Kafka uses Apache ZooKeeper for distributed coordination and management of Kafka
brokers. ZooKeeper helps in leader election, broker discovery, and synchronization.
1. Log Aggregation: Kafka is used to collect and aggregate logs from various sources, making it easier
to monitor and analyze application performance.
2. Real-Time Analytics: Kafka allows organizations to process and analyze data in real-time, making it
suitable for applications like fraud detection, recommendation engines, and monitoring systems.
3. Data Integration: Kafka Connectors enable data integration between various systems, such as
databases, messaging systems, and data lakes.
4. IoT and Sensor Data: Kafka can handle the high volume and velocity of data generated by IoT
devices and sensors, making it an ideal choice for IoT applications.
5. Event Sourcing: Kafka is used in event-driven architectures and event sourcing patterns, where events
are stored and processed for maintaining system state.
6. Clickstream Analysis: Kafka is employed for real-time processing of user activity data, making it
valuable for applications like e-commerce analytics.
7. Replication and Data Backup: Kafka's durability and fault tolerance features make it suitable for
replicating data across data centers for backup and disaster recovery .
Message passing in distributed systems refers to the communication medium used by nodes
(computers or processes) to commute information and coordinate their actions. It involves
transferring and entering messages between nodes to achieve various goals such as coordination,
synchronization, and data sharing.
Message passing is a flexible and scalable method for inter-node communication in distributed
systems. It enables nodes to exchange information, coordinate activities, and share data without
relying on shared memory or direct method invocation
s. Models like synchronous and asynchronous message passing offer different synchronization and
communication semantics to suit system requirements. Synchronous message passing ensures
sender and receiver synchronization, while asynchronous message passing allows concurrent
execution and non-blocking communication.
Types of Message Passing
Current programming uses synchronous message passing, which allows processes or threads to
change messages in real time. To ensure coordination and predictable execution, the sender waits until
the receiver has received and processed the message before continuing. The most common way to
implement this strategy is through blocking method calls or procedure invocations, where a process or
thread blocks until the called system returns a result or completes its investigation. The caller will be
forced to wait until the message has been processed thanks to this blocking behaviour. However, there
are drawbacks to synchronous message passing, such as system halts or delays if the receiver takes
too long to process the message or gets stuck. It's critical to precisely implement synchronous
message passing in concurrent systems to guarantee its proper operation.
3. Hybrids
Hybrid message passing combines elements of both synchronous and asynchronous message ends. It
provides flexibility to the sender to choose whether to block and hold on for a response or continue
execution asynchronously. The choice between synchronous or asynchronous actions can be made
based on the specific requirements of the system or the nature of the communication. Hybrid message
passing allows for optimization and customization based on different scenarios, enabling a balance
between synchronous and asynchronous paradigms.
Benefits of Zookeeper
Architecture of Zookeeper
Server: The server sends an acknowledge when any client connects. In the case when there is no
response from the connected server, the client automatically redirects the message to another server.
Client: Client is one of the nodes in the distributed application cluster. It helps you to accesses
information from the server. Every client sends a message to the server at regular intervals that helps
the server to know that the client is alive.
Leader: One of the servers is designated a Leader. It gives all the information to the clients as well as
an acknowledgment that the server is alive. It would performs automatic recovery if any of the
connected nodes failed.
o Data Ecosystem: Several applications that use Apache Kafka forms an ecosystem. This
ecosystem is built for data processing. It takes inputs in the form of applications that create
data, and outputs are defined in the form of metrics, reports, etc. The below diagram
represents a circulatory data ecosystem for Kafka.
o Kafka Cluster: A Kafka cluster is a system that comprises of different brokers, topics, and
their respective partitions. Data is written to the topic within the cluster and read by the
cluster itself.
o Producers: A producer sends or writes data/messages to the topic within the cluster. In order
to store a huge amount of data, different producers within an application send data to the
Kafka cluster.
o Consumers: A consumer is the one that reads or consumes messages from the Kafka cluster.
There can be several consumers consuming different types of data form the cluster. The
beauty of Kafka is that each consumer knows from where it needs to consume the data.
o Brokers: A Kafka server is known as a broker. A broker is a bridge between producers and
consumers. If a producer wishes to write data to the cluster, it is sent to the Kafka server. All
brokers lie within a Kafka cluster itself. Also, there can be multiple brokers.
o Topics: It is a common name or a heading given to represent a similar type of data. In Apache
Kafka, there can be multiple topics in a cluster. Each topic specifies different types of
messages.
o Partitions: The data or message is divided into small subparts, known as partitions. Each
partition carries data within it having an offset value. The data is always written in a
sequential manner. We can have an infinite number of partitions with infinite offset values.
However, it is not guaranteed that to which partition the message will be written.
o ZooKeeper: A ZooKeeper is used to store information about the Kafka cluster and details of
the consumer clients. It manages brokers by maintaining a list of them. Also, a ZooKeeper is
responsible for choosing a leader for the partitions. If any changes like a broker die, new
topics, etc., occurs, the ZooKeeper sends notifications to Apache Kafka. A ZooKeeper is
designed to operate with an odd number of Kafka servers. Zookeeper has a leader server that
handles all the writes, and rest of the servers are the followers who handle all the reads.
However, a user does not directly interact with the Zookeeper, but via brokers. No Kafka
server can run without a zookeeper server. It is mandatory to run the zookeeper server.
o
Advantages of Hive
Disadvantages of Hive
-KAFKA,
Advantages of Apache Kafka
1. User Friendly
There are more than one customer waiting to handle messages. When there is a need to integrate with multiple
customers, creating one integration is sufficient. The integration is made simple even for customers with variety
of languages and behaviors.
2. Reliability
As compared to other messaging services, Kafka is considered to be more reliable. In the event of a machine
failure, Kafka provides resistance through means of replicating data. Thus, the consumers are automatically
balanced.
3. Durability
Kafka ensures durable messaging service by storing data quickly as possible. The messages are persisted on the
disk which is one of the reasons for data not being lost.
4. Latency
The latency value offered by Kafka is very low ; not more than 10 milliseconds. The messages received by the
consumer is consumed instantly. Apache Kafka cannot handle most messages sine the messages are
automatically decoupled.
5. Scalability
Apache Kafka is a scalable solution. It allows you to add additional nodes without facing any downtimes. And
also, Kafka posses transparent message handling capabilities. They are able to process even terabytes of data
seamless.
7. Buffering Action
Apache Kafka comes with its own set of servers known as Clusters. These clusters make sure that system does
not crash when there is a data transfer happening real time. Kafka acts as a buffer by relieving data from source
systems and redirecting it to the target systems .
With the above advantages, there are following limitations/disadvantages of Apache Kafka:
1. Do not have complete set of monitoring tools: Apache Kafka does not contain a complete
set of monitoring as well as managing tools. Thus, new startups or enterprises fear to work
with Kafka.
2. Message tweaking issues: The Kafka broker uses system calls to deliver messages to the
consumer. In case, the message needs some tweaking, the performance of Kafka gets
significantly reduced. So, it works well if the message does not need to change.
3. Do not support wildcard topic selection: Apache Kafka does not support wildcard topic
selection. Instead, it matches only the exact topic name. It is because selecting wildcard topics
make it incapable to address certain use cases.
4. Reduces Performance: Brokers and consumers reduce the performance of Kafka by
compressing and decompressing the data flow. This not only affects its performance but also
affects its throughput.
5. Clumsy Behaviour: Apache Kafka most often behaves a bit clumsy when the number of
queues increases in the Kafka Cluster.
6. Lack some message paradigms: Certain message paradigms such as point-to-point queues,
request/reply, etc. are missing in Kafka for some use cases.
-HBASE,
Advantages of HBase –
Disadvantages of HBase –
2. No transaction support
-APACHE HADOOP,
Pros
1. Cost
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-efficient
model, unlike traditional Relational databases that require expensive hardware and high-end
processors to deal with Big Data. The problem with traditional Relational databases is that storing
the Massive volume of data is not cost-effective, so the company’s started to remove the Raw data.
which may not result in the correct scenario of their business. Means Hadoop provides us 2 main
benefits with the cost one is it’s open-source means free to use and the other is that it uses
commodity hardware which is also inexpensive.
2. Scalability
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive
machines in a cluster which is processed parallelly. the number of these machines or nodes can be
increased or decreased as per the enterprise’s requirements. In traditional RDBMS(Relational
DataBase Management System) the systems can not be scaled to approach large amounts of data.
3. Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql
Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos) very efficiently. This
means it can easily process any kind of data independent of its structure which makes it highly
flexible. which is very much useful for enterprises as they can process large datasets easily, so the
businesses can use Hadoop to analyze valuable insights of data from sources like social media,
email, etc. with this flexibility Hadoop can be used with log processing, Data Warehousing, Fraud
detection, etc.
4. Speed
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File
System). In DFS(Distributed File System) a large size file is broken into small size file blocks then
distributed among the Nodes available in a Hadoop cluster, as this massive number of file blocks
are processed parallelly which makes Hadoop faster, because of which it provides a High-level
performance as compared to the traditional DataBase Management Systems. When you are dealing
with a large amount of unstructured data speed is an important factor, with Hadoop you can easily
access TB’s of data in just a few minutes.
5. Fault Tolerance
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In
Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability
of data if somehow any of your systems got crashed. You can read all of the data from a single
machine if this machine faces a technical issue data can also be read from other nodes in a Hadoop
cluster because the data is copied or replicated by default. Hadoop makes 3 copies of each file block
and stored it into different nodes.
6. High Throughput
Hadoop works on Distributed file System where various jobs are assigned to various Data node in a
cluster, the bar of this data is processed parallelly in the Hadoop cluster which produces high
throughput. Throughput is nothing but the task or job done per unit time.
7. Minimum Network Traffic
In Hadoop, each task is divided into various small sub-task which is then assigned to each data node
available in the Hadoop cluster. Each data node process a small amount of data which leads to low
traffic in a Hadoop cluster.
Cons
-TEZ,
Apache Tez is a data processing framework that is often used as an alternative to the
classic MapReduce processing model in the Hadoop ecosystem. Tez is designed to
improve the performance and efficiency of data processing, making it a valuable tool
in data science and big data analytics. Here are some of the advantages of Apache
Tez in the context of data science:
Advantages of Pig
Disadvantages of Pig
Used for handling structured and semi-structured data. It is used in handling structured data.
Supports Avro file format. Does not support Avro file format.
Apache Hive is a query engine HBase is a data storage which is particular for unstructured data
Apache Hive is not ideally a database but it is a MapReduce HBase is a NoSQL database that is commonly used for real-time
based SQL engine that runs atop Hadoop data streaming
Apache Hive is to be used for analytical queries HBase is to be used for real-time queries
Apache Hive has limitations of higher latency HBase doesn’t have any analytical capabilities
S.
No. Parameters Hive HBase
10. Query Hive uses HQL (Hive Query To conduct CRUD (Create, Read,
Language Language). Update, and Delete) activities, HBase
does not have a specialized query
language. HBase includes a Ruby-
based shell where you can use Get,
Put, and Scan functions to edit your
S.
No. Parameters Hive HBase
data.
Level of
Eventual consistency Immediate consistency
11. Consistency
Server operating systems for Hive is all OS Server operating systems for MongoDB are
3.
with a Java VM . Solaris, Linux, OS X, Windows.
The replication method that Hive supports is The replication method that MongoDB
4.
Selectable Replication Factor. supports is Master-Slave Replication.
The primary database model is Relational The primary database model is Document
7.
DBMS. Store.
JDBC, ODBC, Thrift are used as APIs and Proprietary protocol using JSON is used as
8.
other access methods. APIs and other access methods.
Developed by Apache
Developed by MongoDB Inc.
1. Developed by Software Foundation.
Technical
hbase.apache.org docs.mongodb.com/manual
3. Documentation
Implementation
It is written in JAVA. It is written in C++.
5. Language
S.
No. Parameters HBase MongoDB
It has no secondary
It has secondary indexes.
9. Secondary Index indexes.
HDFS HBase
It provides only sequential access of HBase internally uses Hash tables and provides random access,
data. and it stores the data in indexed HDFS files for faster lookups.
HDFS is based on write once read many HBase supports random read and write operation into
times filesystem
HDFS is preferable
HBase is preferable for real time processing
for offline batch processing
HIVE:
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User defined
functions.
Hive DDL Commands
create database
drop database
create table
drop table
alter table
create index
create views
Select
Where
Group By
Order By
Load Data
Join:
o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join
Tez's architecture is centered around Directed Acyclic Graphs (DAGs) and provides a
flexible and efficient execution framework for complex data processing workflows.
The key components and concepts in Tez's architecture include:
1. DAG (Directed Acyclic Graph): DAGs are used to represent complex data
processing workflows. They consist of vertices and edges, where vertices represent
tasks and edges define the data flow between tasks. DAGs allow users to model and
execute multi-stage data processing jobs.
2. Vertices: In Tez, vertices are the units of computation within a DAG. They can be of
various types, including map vertices, reduce vertices, and custom vertices. Custom
vertices can encapsulate user-specific processing logic, allowing for more specialized
data processing.
3. Edges: Edges in a DAG connect vertices and define the data flow between them.
Edges specify the type of data movement, such as one-to-one, one-to-many, or
many-to-one, and also define data ordering and routing.
4. DAG ApplicationMaster (AM): The DAG ApplicationMaster is responsible for
managing and coordinating the execution of a DAG. It interacts with the
ResourceManager in the Hadoop YARN cluster to allocate resources, such as
containers for tasks. The AM monitors task progress, handles task failures, and
ensures the successful execution of the DAG.
5. Tez Sessions: A Tez session represents a runtime execution context for a DAG. It
includes a set of containers where tasks run and where data is managed. These
sessions facilitate efficient resource allocation and data movement.
6. Task Containers: Tez tasks run within containers allocated by YARN (Yet Another
Resource Negotiator). Containers provide a controlled execution environment with
access to CPU, memory, and data locality considerations. Containers manage task
execution, which includes reading input, processing data, and writing output.
7. Data Flow: Tez uses a data flow model to manage the movement of data between
tasks. The data flow is directed along edges, allowing for efficient data transfer and
data locality optimization.
8. Input/Output (I/O) Handlers: Tez provides Input and Output Handlers, which
enable reading data from and writing data to external storage systems. These
handlers support various data formats and data sources, making it easier to integrate
Tez with different data systems.
Architecture of Hive
This component diagram contains different units. The following table describes each unit:
Hadoop distributed file system or HBASE are the data storage techniques to
HDFS or HBASE
store data into file system.
In HBase, tables are split into regions and are served by the region servers. Regions are vertically
divided by column families into “Stores”. Stores are saved as files in HDFS. Shown below is the
architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
HBase has three major components: the client library, a master server, and region servers. Region
servers can be added or removed as per requirement.
MasterServer
Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of tables
and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
When we take a deeper look into the region server, it contain regions and stores as shown below:
The store contains memory store and HFiles. Memstore is just like a cache memory. Anything that is
entered into the HBase is stored here initially. Later, the data is transferred and saved in Hfiles as
blocks and the memstore is flushed.
Zookeeper
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple
slaves.
The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker
nodes. Here,
Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers
the navigation whereas directed and acyclic refers to how it is done.
The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.
To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -
AD
Cluster Manager
o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install
Spark on an empty set of machines.
Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.
Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.
Task
o A unit of work that will be sent to one executor.
Link: https://www.javatpoint.com/apache-spark-architecture
27. WRITE DOWN STEPS TO PERFORM
SINGLE NODE INSTALLATION OF HADOOP
Link: https://www.javatpoint.com/hadoop-installation\
Link: https://www.tutorialspoint.com/how-to-install-and-configure-apache-hadoop-on-a-single-node-
in-centos-8
Installation Steps:
2. Configuration:
Navigate to the Hadoop configuration directory:
CODE: cd hadoop-x.y.z/etc/hadoop
Edit the hadoop-env.sh file to set the Java home. Set the value of JAVA_HOME to
the path of your Java installation:
Configure the Hadoop core-site.xml and hdfs-site.xml files. You can use the provided
template files ( core-site.xml and hdfs-site.xml) and customize them to specify
the Hadoop configuration settings. For example:
core-site.xml:
CODE: <configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
CODE: <configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
CODE: start-dfs.sh
start-yarn.sh
You can also start Hadoop services individually using the start-dfs.sh and start-
yarn.sh scripts.
5. Verify Installation:
Open a web browser and access the Hadoop NameNode web interface at
http://localhost:50070. You should see the HDFS web interface, which indicates that
Hadoop is up and running.
This command will count the words in the input file and store the result in the
output directory.
As Hadoop 1 is prior to
Hadoop 2 so comparatively On other hand Hadoop 2 has better
less scalable than Hadoop 2 scalability than Hadoop 1 and is
4 Scalability
and in context of scaling of scalable up to 10000 nodes per
nodes it is limited to 4000 cluster.
nodes per cluster
Hadoop 1 is implemented as it
follows the concepts of slots On other hand Hadoop 2 follows
5 Implementation which can be used to run a concepts of containers that can be
Map task or a Reduce task used to run generic tasks.
only.
Sr. No. Key Hadoop 1 Hadoop 2
UNIT 4:
1. Introduction to Map-Reduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on
multiple nodes. MapReduce provides analytical capabilities for analysing huge volumes of complex
data.
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the
Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task
into small parts and assigns them to many computers. Later, the results are collected at one place and
integrated to form the result dataset.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.
MapReduce Architecture:
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for processing to the Hadoop
MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of
so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all
the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
UNIT 3:
1. HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of name node and data node help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These
nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is called a Block. The
default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic
fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
3. HDFS Commands
Sr.N
Command & Description
o
-ls <path>
1 Lists the contents of the directory specified by path, showing the names, permissions, owner, size
and modification date for each entry.
-lsr <path>
2
Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du <path>
3 Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full
HDFS protocol prefix.
-dus <path>
4
Like -du, but prints a summary of disk usage of all files/directories in the path.
-mv <src><dest>
5
Moves the file or directory indicated by src to dest, within HDFS.
-rm <path>
7
Removes the file or empty directory identified by path.
-rmr <path>
8 Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or
subdirectories of path).
-cat <filen-ame>
14
Displays the contents of filename on stdout.
-mkdir <path>
17 Creates a directory named path in HDFS.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).
-touchz <path>
19 Creates a file at path containing the current time as a timestamp. Fails if a file already exists at
path, unless the file is already size 0.
-help <cmd-name>
26 Returns usage information for one of the commands listed above. You must omit the leading '-'
character in cmd.
UNIT 5:
Formally, an RDD is a read-only, partitioned collection of records. There are two ways to create
RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an
external storage system, such as a shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.
RDDs address MapReduce's shortcomings in data sharing. When reusing data for computations,
MapReduce requires writing to external storage (HDFS, Cassandra, HBase, etc.). The read and write
processes between jobs consume a significant amount of memory.
Furthermore, data sharing between tasks is slow due to replication, serialization, and increased disk
usage.
RDDs aim to reduce the usage of external storage systems by leveraging in-memory compute
operation storage. This approach improves data exchange speeds between tasks by 10 to 100 times.
An RDD stores data in read-only mode, making it immutable. Performing operations on existing
RDDs creates new objects without manipulating existing data.
RDDs reside in RAM through a caching process. Data that does not fit is either recalculated to reduce
the size or stored on a permanent storage. Caching allows retrieving data without reading from disk,
reducing disk overhead.
RDDs further distribute the data storage across multiple partitions. Partitioning allows data recovery
in case a node fails and ensures the data is available at all times.
Spark's RDD uses a persistence optimization technique to save computation results. Two methods
help achieve RDD persistence:
cache()
persist()
These methods provide an interactive storage mechanism by choosing different storage levels. The
cached memory is fault-tolerant, allowing the recreation of lost RDD partitions through the initial
creation operations.
In-memory computation. Data calculation resides in memory for faster access and fewer I/O
operations.
Fault tolerance. The tracking of data creation helps recover or recreate lost data after a node
failure.
Immutability. RDDs are read-only. The existing data cannot change, and transformations on
existing data generate new RDDs.
Lazy evaluation. Data does not load immediately after definition - the data loads when
applying an action to the data.
Data resilience. The self-recovery mechanism ensures data is never lost, regardless of whether
a machine fails.
Data consistency. Since RDDs do not change over time and are only available for reading,
data consistency maintains throughout various operations.
Performance speeds. Storing data in RAM whenever possible instead of on disk. However,
RDDs maintain the possibility of on-disk storage to provide a massive performance and
flexibility boost.
The disadvantages when working with Resilient Distributed
Datasets include:
No schematic view of data. RDDs have a hard time dealing with structured data. A better
option for handling structured data is through the DataFrames and Datasets APIs, which fully
integrate with RDDs in Spark.
Garbage collection. Since RDDs are in-memory objects, they rely heavily on Java's memory
management and serialization. This causes performance limitations as data grows.
Overflow issues. When RDDs run out of RAM, the information resides on a disk, requiring
additional RAM and disk space to overcome overflow issues.
No automated optimization. An RDD does not have functions for automatic input
optimization. While other Spark objects, such as DataFrames and Datasets, use the Catalyst
optimizer, for RDDs, optimization happens manually.
2. RDD Operations
Spark RDD Operations
RDDs offer two operation types:
2. Actions are operations that do not result in RDD creation and provide some other value.
Actions come as a final step after completed modifications and return a non-RDD result (such
as the total count) from the data stored in the Spark Driver.
3. Printing elements of an RDD
4. Introduction to Graphx
GraphX is the newest component in Spark. It’s a directed multigraph, which means it
contains both edges and vertices and can be used to represent a wide range of data structures.
It also has associated properties attached to each vertex and edge.
GraphX supports several fundamental operators and an optimized variant of the Pregel API.
In addition to these tools, it includes a growing collection of algorithms that help you analyze
your data.
Spark GraphX is the most powerful and flexible graph processing system available today. It
has a growing library of algorithms that can be applied to your data, including PageRank,
connected components, SVD++, and triangle count.
In addition, Spark GraphX can also view and manipulate graphs and computations. You can
use RDDs to transform and join graphs. A custom iterative graph algorithm can also be
written using the Pregel API.
While Spark GraphX retains its flexibility, fault tolerance, and ease-of-use, it delivers
comparable performance to the fastest specialized graph processors.
5. Features of Graphx
Display up to 32 graphics images simultaneously.
Copy an image to the clipboard in bitmap format or the ASCII command sequence.
1. Flexibility
Apache Spark GraphX is capable of working with graphs and perform the
computations on them. Spark GraphX is be used for ETL processing,
iterative graph computation, exploratory analysis, and so on. The data can
be views as a collection as well as a graph and the transformation and
joining of that data can be efficiently performed with Spark RDD.
2. Speed
Apache Spark GraphX provides better performance compared to the fastest
graph systems and since it works with Spark so it by default adopts the
features of Apache Spark such as fault tolerance, flexibility, ease to use, and
so on.
3. Algorithms
Apache Spark GraphX provides the following graph algorithms.
o PageRank
o Connected Components
o Label Propagation
o SVD++
o Strongly Connected Components
o Triangle Count
Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and
trends in the data.
In the world of Big Data, the data visualization tools and technologies are required to
analyze vast amounts of information.
Data visualizations are common in your everyday life, but they always appear in the
form of graphs and charts. The combination of multiple visualizations and bits of
information are still referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see
visualizations in the form of line charts to display change over time. Bar and column
charts are useful for observing relationships and making comparisons. A pie chart is a
great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.
American statistician and Yale professor Edward Tufte believe useful data
visualizations consist of ?complex ideas communicated with clarity, precision, and
efficiency.
To craft an effective data visualization, you need to start with clean data that is well-
sourced and complete. After the data is ready to visualize, you need to pick the right
chart.
9. Various BI tools