Unit 2 Evaluating NoSQL
Unit 2 Evaluating NoSQL
Databases
NoSQL
DATABASES
NoSQL
Databases
Objectives:
The Objectives of this module are to understand:
• NoSQL databases.
• Basic principles and design criteria of NoSQL databases.
• Comparisons among different types of NoSQL databases.
• Different types of features of different NoSQL databases.
• Internals of different NoSQL databases.
• Different use cases for different NoSQL databases.
• Data storage and processing techniques.
• Advantages of NoSQL database over RDBMS.
2
NoSQL
Databases
Outcome:
At the end of this module, you are expected to explain/describe:
• NoSQL databases.
• Basic principles and design criteria of NoSQL databases.
• Comparisons among different types of NoSQL databases.
• Different types of features of different NoSQL databases.
• Internals of different NoSQL databases.
• Different use cases for different NoSQL databases.
• Data storage and processing techniques.
• Advantages of NoSQL database over RDBMS.
3
NoSQL
Databases
Content
• The Technical Evaluation
• Choosing NoSQL
• Search Features
• Scaling NoSQL
• Keeping Data Safe
• Visualizing NoSQL
• Extending Data Layer
• Business Evaluation
• Deploying Skills
• Deciding Open Source versus commercial software
• Business critical features, Security
4
NoSQL
The Technical Evaluation
Databases
• The term “NoSQL” itself is too broad and meant to encapsulate any database that does
not follow the relational model.
• There are several categories of NoSQL databases and within each category there are
several options.
• It is hard to discern which among those are the best fit for your application
requirements and allow for
future support and growth within your enterprise.
• As we go about investigating the different options, there are several important criteria
to mull over.
• Firstly, whether the database enjoys widespread adoption and support. Does the
database have a robust developer community and partner ecosystem? If so, then that is
a good indicator of its potential for future growth and adoption within your
organization.
5
NoSQL
Databases
The Technical Evaluation
• Another aspect to consider is the applicability of the database to a broad variety of use
cases. If the database technology is only good at satisfying one or two scenarios, then you
can expect it to be unsuccessful at an early stage.
• Based on just these two criteria alone, MongoDB is the NoSQL database to consider for
modern, Big Data applications.
• The most popular database with 40 million downloads and counting, MongoDB has a
thriving developer community with thousands of certified professionals, and it
consistently ranks as the most popular NoSQL database according to DB-Engines’
monthly rankings.
• Also, as a general purpose database you can successfully employ MongoDB to address
many different use
cases.
6
NoSQL
Databases
Choosing NoSQL
• NoSQL databases provides high operational speed and increased flexibility for software
developers and other users when compared to traditional tabular (or SQL) databases.
• NoSQL databases can be scaled across thousands of servers, though sometimes with
loss of data consistency. But what makes NoSQL databases especially relevant today is
that they are particularly well suited for working with large sets of distributed data,
which makes them a good choice for big data and analytics projects.
7
NoSQL
Databases
How to choose a NoSQL database: Key factors
• With more than two dozen open source and commercial NoSQL databases in the
market, how do you choose the right product or cloud service?
• One vital factor is to know the purpose to which you want to put the data, says Carl
Olofson, an IDC
research vice president.
• NoSQL databases vary in architecture and function, so you need to pick the type that
is best for the
desired task.
• In general, key-value stores are best for the persistent sharing of data by multiple
processes or microservices in an application.
• Do not assume your initial project is the only usage model that you will apply to the
database. You might start out just doing state or session data management, then look to
do transaction processing, and still later do some analytics.
• For the near term, the focus should be around performance, scale, security, support for
various workloads (including transactional, operational, and analytics), integration
with existing ecosystems, administration effort, cloud support, and type of use cases
supported, says Noel Yuhanna, a principal analyst at Forrester Research.
• Of all these, security is critical. NoSQL databases that have security certifications should
be given higher consideration.
9
NoSQL
Databases
How to choose a NoSQL database: Key factors
• Look for features such as encryption of both data at rest and data in motion to protect
sensitive information.
• Also, not all NoSQL databases can scale well, Yuhanna says, so do not take for granted
that just because a
product is in the NoSQL category it will scale and perform better than relational
databases.
• NoSQL offers different consistency levels in the scale-out model, so look at solutions that
meet your specific requirements. For example, if you want to support highly critical
banking-like transactions, relational databases are still the best solution.
10
NoSQL
Databases
The NoSQL databases you should consider
NoSQL databases to consider MongoDB
MongoDB is the most popular NoSQL database. A free and open source, cross-platform,
document-oriented database, MongoDB uses JSON-like documents with schemas. The
platform is maintained by MongoDB Inc. and is published under a combination of the Gnu
Affero General Public License and the Apache License.
MongoDB Atlas incorporates operational best practices the company has learned from
optimizing thousands of deployments at organizations of all sizes. The cloud-based offering
handles database management, setup and configuration, software patching, monitoring,
and backups, and it operates as a distributed database cluster.
11
NoSQL
Databases
Amazon DynamoDB
Amazon DynamoDB is another popular cloud-based NoSQL database. Amazon DynamoDB is
a fully managed NoSQL platform that uses a solid-state drive (SSD) to store, process, and
access data to support high performance and scale-driven applications.
It automatically shards data across servers based on the workload’s throughput and storage
requirements, and handles larger high-performance use cases.
Users can scale, monitor, and manage their tables both via Application Programming
Interfaces (APIs) and the Amazon Web Services Management Console. DynamoDB is tightly
integrated with Amazon EMR (a managed framework for Apache Hadoop, Apache Spark,
and HBase) that offers the ability to run queries that span multiple data sources.
The platform supports both key-value and document models and also has a library for
geospatial indexing. Organizations use DynamoDB to support a variety of use cases,
including advertising campaigns, social media applications, tracking gaming information,
collecting and analysing sensor and log data, and e-commerce. 12
NoSQL
Databases
DataStax and DataStax Enterprise Platform
DataStax leverages Apache Cassandra for distribution across data centres. A strong plus for
DataStax NoSQL has been its global distributed architecture, says Forrester’s Yuhanna.
DataStax distributes, contributes to, and supports the commercial enterprise version of
Apache Cassandra, an open source project. Cassandra is a wide- row store, distributed key-
value database based on Google Bigtable.
Among its key features are fault tolerance, scale-out architecture, low-latency data access,
and simplified administration. DataStax provides additional features such as analytics,
search, monitoring, in-memory, and security to support critical applications.
13
NoSQL
Databases
Couchbase
Couchbase is a JSON document support database platform distributed by Couchbase Inc.
The open source NoSQL DBMS supports broad use cases.
Couchbase Server, an open source NoSQL key-value and document database with built-in
cache, appeals to
enterprises that need a database that can deliver performance, multi-model, scale, and
automation.
Organisations use Couchbase to support social and mobile applications, content and
metadata stores, e- commerce transactions, and online gaming applications. Couchbase
provides full support for documents, flexible data model, indexing, full-text search, and
MapReduce for real-time analytics.
The platform is used by large enterprises to support various critical workloads, including
operational and
analytical processes. 14
NoSQL
Databases
Redis Enterprise
Sponsored by Redis Labs, open source platform Redis Enterprise is one of the most
common key-value NSQ databases, says IDC’s Olofson. (Learn more at InfoWorld about
using Redis for real-time metering, managing access control, and traffic-shaping
WebSockets.)
Redis offers a high-performing, in-memory database that supports both relaxed and strong
consistency, a flexible schema less model, high availability, and ease of deployment.
Redis Labs developed additional features and technology that encapsulates the open source
software and provides an enhanced deployment architecture for Redis, while supporting
the open source API.
The data model supports key-value; a variety of data structures such as lists, sets, bitmaps,
and hashes; and a range of models through pluggable modules such as search, graph, JSON,
and XML. Redis supports a variety of use cases, including real-time analytics, transactions,
data ingestion, social media, job management, message queuing, and caching. 15
NoSQL
Databases
MarkLogic
MarkLogic NoSQL Database is an operational and transactional enterprise database
designed for NoSQL speed and scale. Using a multi-model approach, the database provides
integrates and stores critical data, then lets you view that data as documents, as a graph, or
as relational data—whether on-premises, virtualized, or in the cloud.
It provides high availability and security features at the data level, including ACID
compliance, element-level security, anonymisation, redaction, and advanced encryption. For
those reasons, it is suitable for enterprises looking to share massive amounts of sensitive
information. MarkLogic is also the only NoSQL database with a Common Criteria
certification.
16
NoSQL
Databases
Other NoSQL options
Other open source and commercial NoSQL database
offerings includes:
17
NoSQL
Databases
Search Features
Many NoSQL databases support query capabilities and certain search capabilities. Choosing
the right one often comes down to understanding the features you need to support.
Relevancy calculations enable many more-flexible search interactions. The users doing the
searches make the
final call about which result is a match for them — the search engine just provides ordered
hints.
Both search and query enable exact value matches and range queries — for example, where
a date field value in a record lies between two values. Range queries are not supported by
many NoSQL databases or search engines, so if you need them, be sure to check for this
early in your selection process.
Most search engines are designed to search entire records and to limit their query terms to
specific fields (such as a “published on” date). Typically, multiple free-text query methods
are available, including these (next slide): 18
NoSQL
Databases
Search Features
Word query, (where each word is OR’ed together): So “Adam Fowler blog” is evaluated as
Adam OR fowler OR blog, with a match of all words, resulting in a higher relevancy score
than using just one of the words.
Phrase query, (where the whole phrase is treated as one): So “Of Mice and Men” is
evaluated such that the result must have all the words, in the same order, to be a match.
Wildcard: Searching for “run*” returns results for “run,” “runs,” “running,” and “runner.”
Stemming: A search for “run” also returns results for “ran” and “runs,” but not “running”
or “runner”; searching for “cat” also returns results for “cats.”
A key concept in scaling out is "shared nothing." An ideal scale-out architecture is based on
a shared-nothing
architecture, where all nodes are peers and there is no single shared resource that serves as
a bottleneck.
In addition to all nodes being independent, all the data must be evenly distributed or
partitioned across these nodes through a process called sharding. This is an important
process and can be accomplished either manually or through an automated system.
21
NoSQL
Databases
Manual vs. autosharding
To understand the differences between manual and autosharding, consider the registration
process at a typical conference. When you walk into the registration area, you may be asked
to go to the registration booth that corresponds to the first initial of your last name to check
in. For instance, A through D might check in at booth No. 1, E through H at booth No. 2, and
so on. This is an example of scaling via manual sharding.
With manual sharding, the registrations are distributed across a series of check-in booths.
This works because there is a well-defined, pre-determined scheme. There are no
guarantees, however, that the data or registrants will be distributed evenly or that the
booths can be easily expanded without reshuffling all the registrations.
Furthermore, the shutdown of a single booth (equivalent to a node failure) requires
reshuffling across the other booths.
22
NoSQL
Databases
Querying large datasets
The contrast between manual and autosharding also emerges in querying large data sets.
Think of the registration analogy again. With manual sharding, if you want to find all
individuals at the conference with a last name beginning with "S," you would only have to
go to a single check-in booth to determine those names.
With autosharding, if you wanted to find the same information, you would have to go to
each booth to determine who checked in with the last name beginning with "S" to pull the
same data. Typically, map-reduce techniques are used to accomplish this.
23
NoSQL
Databases
Challenges ahead
As an industry based on sharded scale-out architectures, we need to solve the access
patterns that require related objects to be retrieved, plus support querying data through
secondary indexing. While map-reduce techniques have been useful in building the first
generation of solutions, interesting challenges arise in building the next generation of
these innovations.
A large body of work has been accomplished in distributed algorithms that is relevant in
moving towards distributed query processing and query optimizers for large scale-out
architectures. The future of managing large data sets is likely to see significant
innovations in indexing schemes and query optimization.
24
NoSQL
Databases
Keeping Data Safe
Data breaches are a serious concern for any enterprise, especially as the frequency and
severity of security breaches are increasing. In fact, some researchers on the matter believe
that attacks will increase nearly 50% every year. Securing your database, then, should be a
top priority in database administration.
MongoDB, the leading NoSQL database according to monthly DB-Engines rankings, offers
Enterprise Server, the commercial version of MongoDB with advanced security features.
The Enterprise version meets strict security and compliance standards with Kerberos and
LDAP authentication, Red Hat Identity Management Certification, and auditing.
With these advanced security features, you can defend, detect, and control access to your
data. MongoDB’s comprehensive security framework features are:
• ** Auditing. ** A native audit log lets you track access and operations performed on the
database which
works for regulatory compliance.
• ** Encryption. ** MongoDB data can be encrypted on the network and on disk. Protection
of data at-rest is an integral feature within the database. Thanks to the introduction of
MongoDB’s Encrypted storage engine.
As you evaluate different NoSQL database systems, you should give particular attention to
the database’s
security architecture. How it handles data security has serious implications on your business.
26
NoSQL
Databases
Visualizing NoSQL
You can use database triggers, alert actions, and external systems to analyse source data.
Perhaps it is mostly free text but mentions known subjects. These triggers and alert
actions could highlight the text as being a Person or Organization, effectively tagging the
content itself, and the document it lays within.
A good example is the content in a news article. You can use a tool like Apache Stanbol or
OpenCalais to identify key terms. These tools may see “President Putin” and decide that
this relates to a person called Vladimir Putin, who is Russian, and is the current president
of the Russian Federation.
27
NoSQL
Databases
SEARCH AND ALERTING
Once you store your information, you may want to search it. Free‐text search is
straightforward, but after
performing entity extraction, you have more options. You can search specifically for a
person named “Orange” (as in William of Orange) rather than search records that mention
the term orange — which, of course, is also a colour and a fruit.
Doing so, results in a more granular search. It also allows faceted navigation. If you go to
Amazon and search for Harry Potter, you will see categories for books, movies, games, and
so on. The product category is an example of a facet, which shows you an aspect of data
within the search results — that is, the most common values of each facet across all search
results, even those not on the current page.
28
NoSQL
Databases
AGGREGATE FUNCTIONS
Once you find relevant information, you may want to dig deeper. Depending on the source,
you might ask how many countries have a GDP of greater than $400 billion, or what is the
average age of all the members in your family tree, or where do the most snake bites occur
in Australia. These examples illustrate how analytics are performed over a set of search
results. These are count, mean average, and geospatial heat map calculations, respectively.
29
NoSQL
Databases
CHARTING AND BUSINESS INTELLIGENCE
The next obvious user‐interface extension involves charting and viewing table summaries
for live management information and historical business intelligence analysis.
Most NoSQL databases provide an easy‐to‐integrate REST API in their databases. This means
you can plug in a range of application tiers, or even directly connect JavaScript applications
to these databases. A variety of excellent charting libraries are available for JavaScript. You
can even use the R Ecosystem to create charts based on data held in these databases, after
installing an appropriate database connector.
Some NoSQL databases even provide an ODBC or JDBC relational database plug‐in. Creating
indexes within a given record and showing them as a relational view is a neat way to turn
unstructured data in a NoSQL document database into data that can be analysed with a
business intelligence tool.
30
NoSQL
Databases
Extending Data Layer
DataLayer.Mongo Find retrieves data from the MongoDB system. Its attributes can be used
to form a "query" so that specific data is returned. It also supports a cursor, which allows
results sets to be greater than 16MB in size.
31
NoSQL
Databases
Deciding Open Source versus commercial software
Open source software (OSS) refers to the software which uses the code freely available on
the Internet. The code can be copied, modified or deleted by other users and organizations.
As the software is open to the public, the result is that it constantly updates, improves and
expands as more people can work on its improvement.
Closed source software (CSS) is opposite to OSS and means the software which uses the
proprietary and closely guarded code. Only the original authors of software can access, copy,
and alter that software. In case with closed source software, you are not purchasing the
software, but only pay to use it.
For better understanding the peculiarities of open source software and closed source
software, we have made a comparison of five basic aspects: pricing, security, support,
source availability, and usability.
32
NoSQL
Databases
Price Policy
Open source is often referred as a free of cost software. It can, however, have costs for
extras like assistance, additional services or added functionality. Thus, you may still pay
for a service with OSS.
Closed source software is usually a paid software. The costs can vary depending on the
complexity of the software. While the price can be higher, what you get is a better product,
full support, functionality and innovation. However, most companies provide free trials to
convince the purchaser that their software is the right fit.
33
NoSQL
Databases
Security
The question of security is very controversial as each software has two sides of the coin. The
code of open source software can be viewed, shared and modified by community, which
means anyone can fix, upgrade and test the broken code. The bugs are fixed quickly, and the
code is checked thoroughly after each release.
However, because of availability, the source code is open for hackers to practice on.
On the contrary, closed source software can be fixed only by a vendor. If something goes
wrong with the software, you send a request and wait for the answer from the support team.
Solving the problem can take much longer than compared to OSC.
When it comes to choosing the most secure software, the answer is that each of them has its
pros and cons. Thus, it is often a challenge for firms which works in particular industry.
34
NoSQL
Databases
Quality of Support
Comparing open source and closed source software support, it is obvious that CSS is
predominant in this case. The costs for it include an option to contact support and get it in
one business day in most cases. The response is well organized and documented.
For open source software such option is not provided. The only support options are forums,
useful articles and hired expert. However, it is not surprising that using such kind of service
will not receive a high level of response.
35
NoSQL
Databases
Source Code Availability
Open source software provides an ability to change the source code without any
restrictions. Individual users can develop what they want and get benefits from innovation
developed by others within the user community. As the source code is easily accessible, it
enables the software developers to improve the already existing programs.
Сlosed source software is more restricted than open source software because the source
code cannot be
changed or viewed. However, such limitation is what may contribute to CSS security and
reliability.
36
NoSQL
Databases
Usability
Usability is a painful subject of open source software. User guides are written for
developers rather than to layperson users. Also, these manuals are failing to conform to
the standards and structure.
For closed source software usability is one of the merits. Documentation is usually well-
written and contains
detailed instructions.
37
NoSQL
Databases
Security
Security is a major concern for IT Enterprise Infrastructures. Security in NoSQL
databases is very weak, Authentication and Encryption is almost nonexistence or is
very weak when implemented.
With all this security problems, it is best to understand that NoSQL databases are still new
technologies and more security enhancements will be added to newer version. Enterprise
package Cassandra tools provided by companies like Datastax does have more security
enhancements and hence is more secure and provide companies with all the security
needed.
Datastax enterprise provides what Internet Enterprises need to compete in today’s high-
speed, always-on data economy. With in-memory computing capabilities, enterprise-level
security, fast and powerful integrated analytics and enterprise search, visual management,
and expert support, DataStax Enterprise is the leading distributed database choice for
online applications that require fast performance with no downtime 40
NoSQL
Databases
Self-Assessment Questions
1. In a collection that contains 100 post documents, what does the following
command do?
db.posts.find().skip(5).limit(5)
a) Skip and limit nullify each other. Hence returning the first five
documents.
b) Skips the first five documents and returns the sixth document five
times
c) Skips the first five documents and returns the next five
d) Limits the first five documents and then return them in reverse order
Answer: c 40
NoSQL
Databases
Self-Assessment Questions
a) ACID Transactions
b) Relationships between Collections Foreign
(Primary Key Key)
c) Journaling
d) Transaction Management
Answer:
c
41
NoSQL
Databases
Self-Assessment Questions
a) SQL Server
b) MongoDB
c) Cassandra
d) None of the mentioned
Answer: a
42
NoSQL
Databases
Self-Assessment Questions
i. Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents.
ii. MongoDB has official drivers for a variety of popular programming languages and
development
environments.
iii. When compared to relational databases, NoSQL databases are more scalable and
provide superior performance.
a) Only i
b) Only ii
c) Only i and ii
d) All i, ii and iii
43
NoSQL
Databases
Self-Assessment Questions
a) SQL
b) Document databases
c) JSON
d) All of the mentioned
Answer: b
44
NoSQL
Databases
Self-Assessment Questions
a) Cassandra
b) Riak
c) MongoDB
d) Redis
Answer: a
45
NoSQL
Databases
Self-Assessment Questions
i. Non Relational databases require that schemas be defined before you can
add data.
ii. NoSQL databases are built to allow the insertion of data without a
predefined schema.
iii. NewSQL databases are built to allow the insertion of data without a
predefined schema.
a) Only i
b) Only ii
c) Only i and ii
d) All i, ii and iii
46
NoSQL
Databases
Self-Assessment Questions
a) LAN
b) SAN
c) MAN
Answer:
b
47
NoSQL
Databases
Self-Assessment Questions
9. Most NoSQL databases support automatic , meaning that you get high
availability and disaster recovery.
a) Processing
b) Scalability
c) Replication
Answer:
c
48
NoSQL
Databases
Self-Assessment Questions
10. Which one of the given options are the simplest NoSQL
databases?
a) Key-value
b) Wide-column
c) Document
d) All of the mentioned
Answer: a
49
NoSQL
Databases
Self-Assessment Questions
11. stores are used to store information about networks, such as social
connections.
a) Key-value
b) Wide-column
c) Document
d) Graph
Answer: d
50
NoSQL
Databases
Self-Assessment Questions
12. NoSQL databases is used mainly for handling large volumes of data.
a) Unstructured
b) Structured
c) Semi-structured
Answer:
a
51
NoSQL
Databases
Self-Assessment Questions
a) Hive
b) MapReduce
c) Oozie
d) None of the mentioned
Answer: b
52
NoSQL
Databases
Self-Assessment Questions
a) Primary
b) Secondary
c) Capped
Answer:
c
53
NoSQL
Databases
Self-Assessment Questions
15. MongoDB uses a lock that allows concurrent read access to a database but
exclusive write access to a single write operation.
a) Readers
b) Readers-writer
c) Writer
Answer:
b
54
NoSQL
Databases
Self-Assessment Questions
a) Collation
b) Collection
c) Heap
Answer:
a
55
NoSQL
Databases
Self-Assessment Questions
17. Which one of the given statements is not correct?
i) Non Relational databases require that schemas be defined before you can
add data
ii) NoSQL databases are built to allow the insertion of data without a
predefined schema
iii) NewSQL databases are built to allow the insertion of data without a
predefined schema
a) Only i
b) Only ii
c) Only i and ii
d) All i, ii and iii
56
NoSQL
Databases
Self-Assessment Questions
18. Most NoSQL databases support automatic , meaning that you get high
availability and
disaster recovery.
a) Processing
b) Scalability
c) Replication
Answer: c
57
NoSQL
Databases
Self-Assessment Questions
19. Which one of the given options are the simplest NoSQL
databases?
a) Key-value
b) Wide-column
c) Document
Answer:
a
58
NoSQL
Databases
Self-Assessment Questions
a) Performance
b) Availability
c) Scalability
Answer: b
59
NoSQL
Databases
Self-Assessment Questions
Answer: c
60
NoSQL
Databases
Self-Assessment Questions
22. replicas maintain a copy of the data on the primary using built-in
replication.
a) Primary
b) Secondary
c) Backup
Answer:
b
61
NoSQL
Databases
Self-Assessment Questions
a) Replication
b) Partitioning
c) Sharding
Answer:
c
62
NoSQL
Databases
Self-Assessment Questions
24. Which one of the given statements is not correct?
i) Each replica set member will act in the role of primary replica only.
ii) The primary replica performs all writes and reads by default.
iii) Secondaries can also perform read operations, but the data is eventually
consistent by default.
a) Only i
b) Only ii
c) Only i and ii
d) All i, ii and iii
Answer: a
63
NoSQL
Databases
Self-Assessment Questions
25. MongoDB can be used as a , taking advantage of load balancing and data replication
features
over multiple machines for storing files.
a) AMS
b) CMS
c) File system
Answer: a
64