Nosql Module 1
Nosql Module 1
22MCA335
Module 1
Introduction to NoSQL
NoSQL Database is a non-relational Data Management System, that does not require a fixed
schema.
The major purpose of using a NoSQL database is for distributed data stores with huge data
storage needs.
For example, companies like Twitter, Facebook and Google collect terabytes of user data
every single day.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data.
1|Page
Definition of NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Also (storage of structured,
unstructured, semi-structured or polymorphic data.)
NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently
than relational tables.
Unlike traditional relational databases that use tables with pre-defined schemas to store
data, NoSQL databases use flexible data models that can adapt to changes in data structures
and are capable of scaling horizontally to handle growing amounts of data.
NOSQL History
The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming his lightweight,
open-source “relational” database that did not use SQL.
This concept was then adopted and popularized by GAFAMs ( stocks of Google, Apple,
Facebook, Amazon, and Microsoft.) such as Google, Facebook or Amazon faced with huge
volumes of data. Relational databases had become too slow.
In the year 2000, the graphical database Neo4j was launched. Then it was the turn of the
Google Bigtable, in 2004, and CouchDB in 2005. The history of NoSQL databases was also
marked by Amazon Dynamo in 2007.
Then, in 2008, Facebook made open source the non-relational database it uses
internally: Cassandra. This tool became the reference for NoSQL databases, and put the
term NoSQL back in the spotlight by giving it its current meaning and popularity.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the
load increases. This method is known as “scaling out.”
Handling large volumes of data at scale seamlessly. NoSQL databases can handle large
amounts of data by spreading it across multiple servers.
2|Page
Enabling rapid development.
3|Page
Key-value stores: Data is stored in key/value pairs. It is designed in such a way to handle lots
of data and heavy load.
➢ Key-value pair storage databases store data as a hash table where each key is unique, and
the value can be a JSON, BLOB(Binary Large Objects), string, etc.
➢ They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as
the data is readily available in a column.
➢ Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs
Graph databases: A graph type database stores entities as well the relations amongst those
entities.
➢ The entity is stored as a node with the relationship as edges. An edge gives a relationship
between nodes.
➢ Graph base database mostly used for social networks, logistics, spatial data.
Other Types
5. Object-Oriented Database: Objects, classes, and methods are used to organize and store
data. This model is particularly suitable for applications with complex data structures.
4|Page
Data Models used are
1. Document Store:.
2. Key-Value Store:
3. Wide-Column Store:
4. Graph Database:
5. Object-Oriented Database:
6. Time-Series Database:
Features of NoSQL
➢ Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate
changing data structures without the need for migrations or schema alterations.
➢ Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a
database cluster, making them well-suited for handling large amounts of data and high
levels of traffic.
➢ Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where
data is stored as a collection of key-value pairs.
➢ Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
➢ Distributed and high availability: NoSQL databases are often designed to be highly available
and to automatically handle node failures and data replication across multiple nodes in a
database cluster.
➢ Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and
dynamic manner, with support for multiple data types and changing data structures.
➢ Performance: NoSQL databases are optimized for high performance and can handle a high
volume of reads and writes, making them suitable for big data and real-time applications.
5|Page
Advantages of NoSQL
➢ High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of data
and placing it on multiple machines in such a way that the order of the data is preserved is
sharding.
➢ Vertical scaling means adding more resources to the existing machine whereas
horizontal scaling means adding more machines to handle the data. Vertical scaling
is not that easy to implement but horizontal scaling is easy to implement.
➢ Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can
handle a huge amount of data because of scalability, as the data grows NoSQL
scalesThe auto itself to handle that data in an efficient manner.
➢ High availability: The auto, replication feature in NoSQL databases makes it highly available
because in case of any failure data replicates itself to the previous consistent state.
➢ Scalability: NoSQL databases are highly scalable, which means that they can handle large
amounts of data and traffic with ease. This makes them a good fit for applications that need
to handle large amounts of data or traffic
➢ Performance: NoSQL databases are designed to handle large amounts of data and traffic,
which means that they can offer improved performance compared to traditional relational
databases.
Disadvantages of NoSQL
➢ Lack of standardization: There are many different types of NoSQL databases, each with its
own unique strengths and weaknesses. This lack of standardization can make it difficult to
choose the right database for a specific application
➢ Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which means that
they do not guarantee the consistency, integrity, and durability of data. This can be a
drawback for applications that require strong data consistency guarantees.
➢ Lack of support for complex queries: NoSQL databases are not designed to handle complex
queries, which means that they are not a good fit for applications that require complex data
analysis or reporting.
6|Page
➢ Lack of maturity: NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional
databases.
MongoDB:
Cassandra:
Couchbase:
Neo4j:
CHALLENGES OF RDBMS
RDBMS builds on a prerequisite that the properties of the data can be defined up front and
that its interrelationships are well established and systematically referenced.
It also assumes that indexes can be consistently defined on data sets and that such indexes
can be uniformly leveraged for faster querying.
Unfortunately, RDBMS starts to show signs of giving way as soon as these assumptions don’t
hold true.
RDBMS can certainly deal with some irregularities and lack of structure but in the context of
massive sparse data sets with loosely defined structures, RDBMS appears a forced fit.
7|Page
With massive data sets the typical storage mechanisms and access methods also get
stretched.
Denormalizing tables, dropping constraints, and relaxing transactional guarantee can help an
RDBMS scale, but after these modifications an RDBMS starts resembling a NoSQL product
NoSQL alleviates the problems that RDBMS imposes and makes it easy to work with large
sparse data, but in turn takes away the power of transactional integrity and flexible indexing
and querying
Gigabyte (GB) — 10 ^ 9
Terabyte (TB) — 10 ^ 12
Petabyte (PB) — 10 ^ 15
Yottabyte (YB) — 10 ^ 24
Scalability
Scalability can be achieved either by provisioning a large and powerful resource to meet the
additional demands or it can be achieved by relying on a cluster of ordinary machines to
work as a unit.
Provisioning super computers with many CPU cores and large amounts of directly attached
storage is a typical vertical scaling solution.
Such vertical scaling options are typically expensive and proprietary. The alternative to
vertical scalability is horizontal scalability.
Horizontal scalability involves a cluster of commodity systems where the cluster scales as
load increases.
Horizontal scalability typically involves adding additional nodes to serve additional load.
8|Page
The MapReduce model possibly provides one of the best possible methods to process large-
scale data on a horizontal cluster of machines
MapReduce is a parallel programming model that allows distributed processing on large data
sets on a cluster of computers
MapReduce derives its ideas and inspiration from concepts in the world of functional
programming
Eg Multiply-by-two function on a list [1, 2, 3, 4] would generate another list [2, 4, 6, 8].
Like the map function, functional programming has a concept of a reduce function
commonly known as a fold function
Reduce or fold function applies a function on all elements of a data structure, such as a list,
and produces a single result or output.
So applying a reduce function-like summation on the list generated out of the map function,
that is, [2, 4, 6, 8], would generate an output equal to 20.
This same simple idea of map and reduce has been extended to work on large data sets.
The map function applies a function on every key/value pair in the collection and generates
a new collection.
Then the reduce function works on the new generated collection and applies an aggregate
function to compute a final output.
This simple map function on this collection could get the names of all those who reside in a
particular zip code
Now a reduce function could work on this output to simply count the number of people who
belong to particular zip code. The final output then would be as follows:
9|Page
The column-oriented storage allows data to be stored effectively.
It avoids consuming space when storing nulls by simply not storing a column when a value
doesn’t exist for that column.
Each unit of data can be thought of as a set of key/value pairs, where the unit itself is
identified with the help of a primary
Bigtable and its clones tend to call this primary key the row-key.
The units of data are sorted and ordered on the basis of the row-key.
Consider a simple table of values that keeps information about a set of people.
Such a table could have columns like first_name, last_name, occupation, zip_code, and
gender. A person’s information in this table could be as follows:
first_name: John
last_name: Doe
zip_code: 10001
gender: male
first_name: Jane
zip_code: 94303
The profile column-family has values only for the data point with row-key 1 so it stores only the
following:
The column-family acts as a key for the columns it contains and the row-key acts as the key
for the whole data set.
10 | P a g e
KEY/VALUE STORES
The key of a key/value pair is a unique value in the set and can be easily looked up to access
the data.
The data is fetched by a unique key or a number of unique keys to retrieve the associated
value with each key.
The values can be simple data types like strings and numbers or complex objects
A key-value database is defined by the fact that it allows programs or users of programs to
retrieve data by keys, which are essentially names, or identifiers, that point to some stored
value.
Common features
Retrieving a value (if there is one) stored and associated with a given key
Deleting the value (if there is one) stored and associated with a given key
Setting, updating, and replacing the value (if there is one) associated with a given key
DOCUMENT DATABASES
The word document in document databases connotes loosely structured sets of key/ value
pairs in documents, typically
JSON (JavaScript Object Notation), and not documents or spreadsheets (though these could
be stored too).
11 | P a g e
Document databases treat a document as a whole and avoid splitting a document into its
constituent name/value pairs.
At a collection level, this allows for putting together a diverse set of documents into a single
collection.
Document databases allow indexing of documents on the basis of not only its primary
identifier but also its properties
GRAPH DATABASES
A graph database is a database designed to treat the relationships between data as equally
important to the data itself.
Instead, the data is stored like we first draw it out - showing how each individual entity
connects with or is related to others.
Exploring NoSql
Location-based services are gaining prominence as local businesses are trying to connect
with users who are in the neighborhood and large companies are trying to customize their
online experience and offerings based on where people are stationed.
The location preferences are maintained by storing user identifier and a zip code .
Data points like “John Doe, 10001,” “Lee Chang, 94129,” “Jenny Gonzalez 33101,” and
“Srinivas Shastri, 02101” will need to be maintained
To store such data in a flexible and extendible way, this example uses a non-relational
database product named MongoDB
Assuming you have installed MongoDB successfully, start the server and connect to it.
You can start a MongoDB server by running the mongod program within the bin folder of the
distribution. Distributions vary according to the underlying environment, which can be
12 | P a g e
Windows, Mac OS X, or a Linux variant, but in each case the server program has the same
name and it resides in a folder named bin in the distribution
The simplest way to connect to the MongoDB server is to use the JavaScript shell available
with the distribution. Simply run mongo from your command-line interface. The mongo
JavaScript shell command is also found in the bin folder
Now that the database server is up and running, use the mongo JavaScript shell to connect
to it. The initial output of the shell should be as follows:
PS C:\applications\mongodb-win32-x86_64-1.8.1> bin/mongo
By default, the mongo shell connects to the “test” database available on localhost
> help
To start out, create a preferences database called prefs. After you create it, store tuples (or
pairs) of usernames and zip codes in a collection, named location, within this database. Then
store the available data sets in this defined structure.
13 | P a g e
use prefs
db.location.save(w);
db.location.save(x);
db.location.save(y);
db.location.save(z);
> db.location.find()
“zip” : 10001 }
“zip” : 94129 }
“zip” : 33101 }
“zip” : 1089 }
> db.location.save(a);
> db.location.save(b);
14 | P a g e
To get a list of only those people who are in the 10001 zip code, you could query as follows:
“zip” : 10001 }
For starters, list the existing keyspaces in your Cassandra server. Go to the cassandra-cli,
type the show keyspaces command, and press Enter.
To include NoSQL solutions into the application stack, it’s extremely important that robust
and flexible language bindings allow access and manipulation of these stores from some of
the most popular languages.
MongoDB’s Drivers In this section, MongoDB drivers for four different languages, Java, PHP,
Ruby, and Python, are introduced in the order in which they are listed.
First, download the latest distribution of the MongoDB Java driver from the MongoDB
github code repository at http://github.com/mongodb. All offi cially supported drivers are
hosted in this code repository. The latest version of the driver is 2.5.2, so the downloaded jar
fi le is named mongo-2.5.2,jar.
Once again start the local MongoDB server by running bin/mongod from within the
MongoDB distribution. Now use a Java program to connect to this server. Look at Listing 2-2
for a sample java program that connects to MongoDB, lists all the collections in the prefs
database, and then lists all the documents within the location collection.
15 | P a g e
import java.net.UnknownHostException;
import java.util.Set;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.Mongo;
import com.mongodb.MongoException;
public class ConnectToMongoDB {
Mongo m = null;
DB db;
public void connect() {
try {
m = new Mongo(“localhost”, 27017 );
} catch (UnknownHostException e) {
e.printStackTrace();
} catch (MongoException e) {
e.printStackTrace();
}
}
public void listAllCollections(String dbName) {
if(m!=null){
db = m.getDB(dbName);
Set<String> collections = db.getCollectionNames();
for (String s : collections) {
System.out.println(s);
}
}
}
public void listLocationCollectionDocuments() {
if(m!=null){
db = m.getDB(“prefs”);
DBCollection collection = db.getCollection(“location”);
DBCursor cur = collection.find();
while(cur.hasNext()) {
System.out.println(cur.next());
}
} else {
System.out.println(“Please connect to MongoDB
and then fetch the collection”);
}
}
public static void main(String[] args) {
ConnectToMongoDB connectToMongoDB = new ConnectToMongoDB();
connectToMongoDB.connect();
connectToMongoDB.listAllCollections(“prefs”);
connectToMongoDB.listLocationCollectionDocuments();
}
}
16 | P a g e
MongoDB PHP Driver
First, download the PHP driver from the MongoDB github code repository and confi gure the driver
to work with your local PHP environment.
A sample PHP program that connects to a local MongoDB server and lists the documents in the
location collections in the prefs database is as follows:
$collection = $connection->prefs->location;
$cursor = $collection->find();
{ echo “$id: “;
var_dump( $value );
To get ready to connect to MongoDB from Ruby, get at least the mongo and bson gems. You
can install the mongo gem as follows:
db = Mongo::Connection.new(“localhost”, 27017).db(“prefs”)
locationCollection = db.collection(“location”)
The easiest way to install the Python driver is to run easy_install pymongo.
db = connection.prefs
collection = db.location
doc
17 | P a g e
Thrift
Thrift is an interface definition language and binary communication protocol used for
defining and creating services for numerous programming languages. It was developed at
Facebook for "scalable cross-language services development“.
Apache Cassandra uses the Thrift interface to provide a layer of abstraction to interact with
the column data store.
Import the Cassandra class from the cassandra library. This likely refers to a Python client for Apache
Cassandra, a NoSQL database.
This includes various Thrift-generated types used in communication with Cassandra, such as
ConsistencyLevel and ColumnParent.
import time:
Import the standard Python time module, which provides various time-related functions.
18 | P a g e
Querying CarDataStore keyspace using the Thrift interface
To explain the different ways of data storage and access in NoSQL, I first classify them into the
following types:
➢ Key/value store (in-memory, persistent and even ordered) — Redis and BerkeleyDB
19 | P a g e
Storing Data In and Accessing Data from MongoDB
The Apache web server Combined Log Format captures the following request and response
attributes for a web server:
➢ IP address of the client — This value could be the IP address of the proxy if the client
requests the resource via a proxy.
➢ Identity of the client — Usually this is not a reliable piece of information and often is not
recorded.
➢ Referrer — Typically, the URI or the URL that links to a web page or resource.
➢ User-agent — The client application, usually the program or device that accesses a web page
or resource.
➢ Time when the request was received — Includes date and time, along with timezone.
➢ The request itself — This can be further broken down into four different pieces: method
Querying MongoDB
This prints the data set in a nice presentable format like this:
“_id” : ObjectId(“4cb164b75a91870732000000”),
“http_vers” : “HTTP/1.1”,
“ident” : “-”,
“http_response_code” : “200”,
“referrer” : “-”,
“url” : “/hi/tag/2009/”,
“ip” : “123.125.66.32”,
“time” : “09/Oct/2010:07:30:01 -0600”,
“http_response_size” : “13308”,
“http_method” : “GET”,
“user_agent” : “Baiduspider+(+http://www.baidu.com/search/spider.htm)”,
20 | P a g e
“http_user“ : “-“,
“request_line“ : “GET /hi/tag/2009/ HTTP/1.1“
}
The method db.logdata.find() returns all the records in the logdata collection so you have
the entire set to iterate over using the cursor.
The previous code sample simply iterates through the elements of the cursor and prints
them out.
The printjson function prints out the elements in a nice JSON-style formatting for easy
readability.
21 | P a g e
Redis supports a few different data structures, namely:
Lists, or more specifically, linked lists — Collections that maintain an indexed list of elements
in a particular order. With linked lists, access to either of the end points is fast irrespective of
the number of elements in the list.
Sets — Collections that store unique elements and are unordered.
Sorted sets — Collections that store sorted sets of elements.
Hashes — Collections that store key/value pairs.
Strings — Collections of characters.
Querying Redis
Continuing with the redis-cli session, you can first list the title and author of member 1,
identified by the id 1, of the set of books as follows:
$ ./redis-cli smembers books:1:title
1. “The Omnivore\xe2\x80\x99s Dilemma”
$ ./redis-cli smembers books:1:author
1. “Michael Pollan”
To store this data, I intend to create a collection named blogposts and save pieces of information
into two categories, post and multimedia. So, a possible entry, in JSON-like format, could be as
follows:
{
“post” : {
“title”: “an interesting blog post”,
“author”: “a blogger”,
“body”: “interesting content”,
},
“multimedia”: {
“header”: header.png,
“body”: body.mpeg,
},
}
22 | P a g e
Querying HBase
23 | P a g e