Big Data Notes
Big Data Notes
1 – Maintenance Problem
The maintenance of the relational database becomes difficult over time due to the
increase in the data. Developers and programmers have to spend a lot of time
maintaining the database.
2 – Cost
The relational database system is costly to set up and maintain. The initial cost of the
software alone can be quite pricey for smaller businesses, but it gets worse when you
factor in hiring a professional technician who must also have expertise with that
specific kind of program.
3 – Physical Storage
A relational database is comprised of rows and columns, which requires a lot of
physical memory because each operation performed depends on separate storage.
The requirements of physical memory may increase along with the increase of data.
4 – Lack of Scalability
While using the relational database over multiple servers, its structure changes and
becomes difficult to handle, especially when the quantity of the data is large. Due to
this, the data is not scalable on different physical storage servers. Ultimately, its
performance is affected i.e. lack of availability of data and load time etc. As the
database becomes larger or more distributed with a greater number of servers, this
will have negative effects like latency and availability issues affecting overall
performance.
5 – Complexity in Structure
Relational databases can only store data in tabular form which makes it difficult to
represent complex relationships between objects. This is an issue because many
applications require more than one table to store all the necessary data required by
their application logic.
6 – Decrease in performance over time
The relational database can become slower, not just because of its reliance on multiple
tables. When there is a large number of tables and data in the system, it causes an
increase in complexity. It can lead to slow response times over queries or even
complete failure for them depending on how many people are logged into the server
at a given time.
Introduction to NoSQL
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other
than the tabular relations used in relational databases. Such databases came into
existence in the late 1960s, but did not obtain the NoSQL moniker until a surge of
popularity in the early twenty-first century. NoSQL databases are used in real-time
web applications and big data and their use are increasing over time.
• NoSQL systems are also sometimes called Not only SQL to emphasize the
fact that they may support SQL-like query languages. A NoSQL database
includes simplicity of design, simpler horizontal scaling to clusters of
machines and finer control over availability. The data structures used by
NoSQL databases are different from those used by default in relational
databases which makes some operations faster in NoSQL. The suitability
of a given NoSQL database depends on the problem it should solve.
• NoSQL databases, also known as “not only SQL” databases, are a new type
of database management system that have gained popularity in recent
years. Unlike traditional relational databases, NoSQL databases are
designed to handle large amounts of unstructured or semi-structured data,
and they can accommodate dynamic changes to the data model. This
makes NoSQL databases a good fit for modern web applications, real-time
analytics, and big data processing.
• Data structures used by NoSQL databases are sometimes also viewed as
more flexible than relational database tables. Many NoSQL stores
compromise consistency in favour of availability, speed and partition
tolerance. Barriers to the greater adoption of NoSQL stores include the use
of low-level query languages, lack of standardized interfaces, and huge
previous investments in existing relational databases.
• Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation,
Durability) transactions but a few databases, such as MarkLogic,
Aerospike, FairCom c-treeACE, Google Spanner (though technically a
NewSQL database), Symas LMDB, and OrientDB have made them central
to their designs.
• Most NoSQL databases offer a concept of eventual consistency in which
database changes are propagated to all nodes so queries for data might
not return updated data immediately or might result in reading data that
is not accurate which is a problem known as stale reads. Also some NoSQL
systems may exhibit lost writes and other forms of data loss. Some NoSQL
systems provide concepts such as write-ahead logging to avoid data loss.
• One simple example of a NoSQL database is a document database. In a
document database, data is stored in documents rather than tables. Each
document can contain a different set of fields, making it easy to
accommodate changing data requirements
• For example, “Take, for instance, a database that holds data regarding
employees.”. In a relational database, this information might be stored in
tables, with one table for employee information and another table for
department information. In a document database, each employee would
be stored as a separate document, with all of their information contained
within the document.
• NoSQL databases are a relatively new type of database management
system that have gained popularity in recent years due to their scalability
and flexibility. They are designed to handle large amounts of unstructured
or semi-structured data and can handle dynamic changes to the data
model. This makes NoSQL databases a good fit for modern web
applications, real-time analytics, and big data processing.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations
or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by
adding more nodes to a database cluster, making them well-suited for
handling large amounts of data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a
document-based data model, where data is stored in semi-structured
format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value
data model, where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-
based data model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to
be highly available and to automatically handle node failures and data
replication across multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data
in a flexible and dynamic manner, with support for multiple data types and
changing data structures.
8. Performance: NoSQL databases are optimized for high performance and
can handle a high volume of reads and writes, making them suitable for big
data and real-time applications.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling.
Partitioning of data and placing it on multiple machines in such a way that
the order of the data is preserved is sharding. Vertical scaling means
adding more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is not
that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc.
NoSQL can handle a huge amount of data because of scalability, as the data
grows NoSQL scale itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or
semi-structured data, which means that they can accommodate dynamic
changes to the data model. This makes NoSQL databases a good fit for
applications that need to handle changing data requirements.
3. High availability : Auto replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the
previous consistent state.
4. Scalability: NoSQL databases are highly scalable, which means that they
can handle large amounts of data and traffic with ease. This makes them a
good fit for applications that need to handle large amounts of data or
traffic
5. Performance: NoSQL databases are designed to handle large amounts of
data and traffic, which means that they can offer improved performance
compared to traditional relational databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than
traditional relational databases, as they are typically less complex and do
not require expensive hardware or software.
Disadvantages of NoSQL:
NoSQL has the following disadvantages.
1. Lack of standardization : There are many different types of NoSQL
databases, each with its own unique strengths and weaknesses. This lack
of standardization can make it difficult to choose the right database for a
specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant,
which means that they do not guarantee the consistency, integrity, and
durability of data. This can be a drawback for applications that require
strong data consistency guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly
designed for storage but it provides very little functionality. Relational
databases are a better choice in the field of Transaction Management than
NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable
standard for NoSQL yet. In other words, two database systems are likely to
be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed
to handle complex queries, which means that they are not a good fit for
applications that require complex data analysis or reporting.
6. Lack of maturity : NoSQL databases are relatively new and lack the
maturity of traditional relational databases. This can make them less
reliable and less secure than traditional databases.
7. Management challenge : The purpose of big data tools is to make the
management of a large amount of data as simple as possible. But it is not
so easy. Data management in NoSQL is much more complex than in a
relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
8. GUI is not available : GUI mode tools to access the database are not
flexibly available in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a
consistent manner.
10. Large document size : Some database systems like MongoDB and
CouchDB store data in JSON format. This means that documents are quite
large (BigData, network bandwidth, speed), and having descriptive key
names actually hurts since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in that
category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Tabular: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
However, the decision to choose a database is not that simple (what is really?!!).
Both the SQL and NoSQL databases have different structures and different data
storage methods. So the choice between SQL vs NoSQL essentially boils down to the
type of database that is required for a particular project.
What’s so different?
Both SQL and NoSQL databases serve the same purpose i.e. storing data but they go
about it in vastly different ways. There are multiple differences between the SQL and
NoSQL databases and it is important to understand them in order to make an
informed choice about the type of database required.
Keeping that in mind, some of the important differences between the SQL and NoSQL
databases are given as follows:
1. Language:
Let’s imagine that in the database world, everyone speaks X Language. So it would
be quite confusing if you started speaking Y language in the middle of that. This is
the case with SQL databases. The SQL databases manipulate the data based on SQL
which is one of the most versatile and widely-used language options available. While
this makes it a safe choice especially for complex queries, it can also be restrictive.
This is because it requires the use of predefined schemas to determine the structure
of data before you work with it and changing the structure can be quite confusing
(like using Y language).
Now again imagine a database world where multiple languages like are spoken.
While this world would be a little chaotic, speaking Y language would be fine because
you would be sure to find a fellow idiot! This is a NoSQL database that has a dynamic
schema for unstructured data. Here, data is stored in many ways which means it can
be document-oriented, column-oriented, graph-based, etc. This flexibility means
that documents can be created without having a defined structure and so each
document can have its own unique structure.
2. Scalability
Think about a tall building in your neighborhood. If given the option, would it be
better to add more floors in this building or create a new building entirely for more
residents?
This is the problem for SQL and NoSQL databases. SQL databases are vertically
scalable. This means that the load on a single server can be increased by increasing
things like RAM, CPU, or SSD. (More floors can be added to this building). On the
other hand, NoSQL databases are horizontally scalable. This means that more traffic
can be handled by sharding, or adding more servers in your NoSQL database. (More
buildings can be added to the neighborhood).
In the long run, it is better to add more buildings than floors as that is more stable
(Less chance of creating a Leaning Tower of Pisa!!!). Thus, NoSQL can ultimately
become larger and more powerful, making NoSQL databases the preferred choice for
large or ever-changing data sets.
3. Schema Design
A schema refers to the blueprint of a database i.e how the data is organized. The
schema of an SQL database and a NoSQL database is markedly different. Let’s use a
joke to better understand this.
This basically means that the poor database admins couldn’t find a table in NoSQL
because there is no standard schema definition for NoSQL databases. They are either
key-value pairs, document-based, graph databases or wide-column stores depending
on the requirements. On the other hand, if those database admins had gone to a SQL
bar, they certainly would have found tables as SQL databases have a table-based
schema.
This difference in schema makes relational SQL databases a better option for
applications that require multi-row transactions such as an accounting system or for
legacy systems that were built for a relational structure. However, NoSQL databases
are much better suited for big data as flexibility is an important requirement which
is fulfilled by their dynamic schema.
4. Community
SQL is a mature technology(Like your old but very wise Uncle) and there are many
experienced developers who understand it. Also, great support is available for all SQL
databases from their vendors. There are even a lot of independent consultants who
can help with the SQL database for very large scale deployments.
On the other hand, NoSQL is comparatively new(The young and Fun Cousin!) and so
some NoSQL databases are reliant on community support. Also, only limited outside
experts are available for setting up and deploying large scale NoSQL deployments.
The Big Questions!!!
NoSQL is a recent technology compared to SQL. So naturally, there are lots of
questions in regards to it especially in the context of big data and data analytics.
Some of the major questions relating to this are addressed below:
Is NoSQL faster than SQL?
In general, NoSQL is not faster than SQL just as SQL is not faster than NoSQL. For
those that didn’t get that statement, it means that speed as a factor for SQL and
NoSQL databases depends on the context.
SQL databases are normalized databases where the data is broken down into various
logical tables to avoid data redundancy and data duplication. In this scenario, SQL
databases are faster than their NoSQL counterparts for joins, queries, updates, etc.
On the other hand, NoSQL databases are specifically designed for unstructured data
which can be document-oriented, column-oriented, graph-based, etc. In this case, a
particular data entity is stored together and not partitioned. So performing read or
write operations on a single data entity is faster for NoSQL databases as compared
to SQL databases.
Is NoSQL better for Big Data Applications?
They say “Necessity is the Mother of Invention!” and that certainly turned out to be
true in the case of NoSQL. The NoSQL databases for big data were specifically
developed by the top internet companies such as Google, Yahoo, Amazon, etc. as the
existing relational databases were not able to cope with the increasing data
processing requirements.
NoSQL databases have a dynamic schema that is much better suited for big data as
flexibility is an important requirement. Also, large amounts of analytical data can be
stored in NoSQL databases for predictive analysis. An example of this is data from
various social media sites such as Instagram, Twitter, Facebook, etc. NoSQL
databases are horizontally scalable and can ultimately become larger and more
powerful if required. All of this makes NoSQL databases the preferred choice for big
data applications.
The choice between SQL and NoSQL depends entirely on individual circumstances as
both of them have advantages as well as disadvantages. SQL databases are long
established with fixed schema design and a set structure. They are ideal for
applications that require multi-row transactions such as an accounting system or for
legacy systems that were built for a relational structure.
On the other hand, NoSQL databases are easily scalable, flexible and simple to use
as they have no rigid schema. They are ideal for applications with no specific schema
definitions such as content management systems, big data applications, real-time
analytics, etc.
As we all know the graph is a pictorial representation of data in the form of nodes and
relationships which are represented by edges. A graph database is a type of database
used to represent the data in the form of a graph. It has three components: nodes,
relationships, and properties. These components are used to model the data. The
concept of a Graph Database is based on the theory of graphs. It was introduced in the
year 2000. They are commonly referred to NoSql databases as data is stored using
nodes, relationships and properties instead of traditional databases. A graph database
is very useful for heavily interconnected data. Here relationships between data are
given priority and therefore the relationships can be easily visualized. They are flexible
as new data can be added without hampering the old ones. They are useful in the fields
of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:
• Nodes: represent the objects or instances. They are equivalent to a row in
database. The node basically acts as a vertex in a graph. The nodes are
grouped by applying a label to each member.
• Relationships: They are basically the edges in the graph. They have a
specific direction, type and form patterns of the data. They basically
establish relationship between nodes.
• Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base
etc. Out of which Neo4j is the most popular one.
In traditional databases, the relationships between data is not established. But in the
case of Graph Database, the relationships between data are prioritized. Nowadays
mostly interconnected data is used where one data is connected directly or indirectly.
Since the concept of this database is based on graph theory, it is flexible and works
very fast for associative data. Often data are interconnected to one another which also
helps to establish further relationships. It works fast in the querying part as well
because with the help of relationships we can quickly find the desired nodes. join
operations are not required in this database which reduces the cost. The relationships
and properties are stored as first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as well.
Since organizations require a huge amount of data, often it becomes cumbersome to
store data in the form of tables. For instance, if the organization wants to find a
particular data that is connected with another data in another table, so first join
operation is performed between the tables, and then search for the data is done row
by row. But Graph database solves this big problem. They store the relationships and
properties along with the data. So if the organization needs to search for a particular
data, then with the help of relationships and properties the nodes can be found
without joining or without traversing row by row. Thus the searching of nodes is not
dependent on the amount of data.
Types of Graph Databases:
• Property Graphs: These graphs are used for querying and analyzing data
by modelling the relationships among the data. It comprises of vertices that
has information about the particular subject and edges that denote the
relationship. The vertices and edges have additional attributes called
properties.
• RDF Graphs: It stands for Resource Description Framework. It focuses
more on data integration. They are used to represent complex data with
well defined semantics. It is represented by three elements: two vertices,
an edge that reflect the subject, predicate and object of a sentence. Every
vertex and edge is represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
• Graph databases should be used for heavily interconnected data.
• It should be used when amount of data is larger and relationships are
present.
• It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries
since data is connected. Graph algorithms are also applied to find patterns, paths and
other relationships this enabling more analysis of the data. The algorithms help to
explore the neighboring nodes, clustering of vertices analyze relationships and
patterns. Countless joins are not required in this kind of database.
Example of Graph Database:
• Recommendation engines in E commerce use graph databases to provide
customers with accurate recommendations, updates about new products
thus increasing sales and satisfying the customer’s desires.
• Social media companies use graph databases to find the “friends of friends”
or products that the user’s friends like and send suggestions accordingly to
user.
• To detect fraud Graph databases play a major role. Users can create graph
from the transactions between entities and store other important
information. Once created, running a simple query will help to identify the
fraud.
Advantages of Graph Database:
• Potential advantage of Graph Database is establishing the relationships with
external sources as well
• No joins are required since relationships is already specified.
• Query is dependent on concrete relationships and not on the amount of
data.
• It is flexible and agile.
• it is easy to manage the data in terms of graph.
• Efficient data modeling: Graph databases allow for efficient data modeling
by representing data as nodes and edges. This allows for more flexible and
scalable data modeling than traditional relational databases.
• Flexible relationships: Graph databases are designed to handle complex
relationships and interconnections between data elements. This makes
them well-suited for applications that require deep and complex queries,
such as social networks, recommendation engines, and fraud detection
systems.
• High performance: Graph databases are optimized for handling large and
complex datasets, making them well-suited for applications that require
high levels of performance and scalability.
• Scalability: Graph databases can be easily scaled horizontally, allowing
additional servers to be added to the cluster to handle increased data
volume or traffic.
• Easy to use: Graph databases are typically easier to use than traditional
relational databases. They often have a simpler data model and query
language, and can be easier to maintain and scale.
Disadvantages of Graph Database:
• Often for complex relationships speed becomes slower in searching.
• The query language is platform dependent.
• They are inappropriate for transactional data
• It has smaller user base.
• Limited use cases: Graph databases are not suitable for all applications. They
may not be the best choice for applications that require simple queries or
that deal primarily with data that can be easily represented in a traditional
relational database.
• Specialized knowledge: Graph databases may require specialized
knowledge and expertise to use effectively, including knowledge of graph
theory and algorithms.
• Immature technology: The technology for graph databases is relatively new
and still evolving, which means that it may not be as stable or well-
supported as traditional relational databases.
• Integration with other tools: Graph databases may not be as well-integrated
with other tools and systems as traditional relational databases, which can
make it more difficult to use them in conjunction with other technologies.
• Overall, graph databases on NoSQL offer many advantages for applications
that require complex and deep relationships between data elements. They
are highly flexible, scalable, and performant, and can handle large and
complex datasets. However, they may not be suitable for all applications,
and may require specialized knowledge and expertise to use effectively.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely
replace the traditional database. This database deals with a typical set of
interconnected data. Although Graph Database is in the developmental phase it is
becoming an important part as business and organizations are using big data and
Graph databases help in complex analysis. Thus these databases have become a must
for today’s needs and tomorrow success.
Schemaless Database
Traditional relational databases are well-defined, using a schema to describe every functional
element, including tables, rows views, indexes, and relationships. By exerting a high degree
of control, the database administrator can improve performance and prevent capture of low-
quality, incomplete, or malformed data. In a SQL database, the schema is enforced by the
Relational Database Management System (RDBMS) whenever data is written to disk.
But in order to work, data needs to be heavily formatted and shaped to fit into the table
structure. This means sacrificing any undefined details during the save, or storing valuable
information outside the database entirely.
A schemaless database, like MongoDB, does not have these up-front constraints, mapping to
a more ‘natural’ database. Even when sitting on top of a data lake, each document is created
with a partial schema to aid retrieval. Any formal schema is applied in the code of your
applications; this layer of abstraction protects the raw data in the NoSQL database and allows
for rapid transformation as your needs change.
Any data, formatted or not, can be stored in a non-tabular NoSQL type of database. At the
same time, using the right tools in the form of a schemaless database can unlock the value of
all of your structured and unstructured data types.