3.3 NOSQL DATA ARCHITECTURE PATTERNS
3.3.1 Key-Value Store
The simplest way to implement a schema-less data store is to use key-value _ pairs.
The data store characteristics are high performance, scalability and flexibility. Data retrieval
is fast in key-value pairs data store. A simple string called, key maps to a large data string
or BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing the
values. Therefore, the store can be easily scaled up for very large data. The concept is similar
toa hash table where a unique key points to a particular item(s) of data. Figure 3.4 shows key-
value pairs architectural pattern and example of students’ database as key-value pairs
Key Value
“ashish” | “Category: Student; Class:
Bech,; Semester: VIl; Branch:
Engineering; Mobile:3999912345"|
“mayuri® | “Category: student; class:
MTTech.; Mobile:8888823456"
Keyl Values1
Key2 Values2
KeyN-1 ValuesN-1
KeyN 1 Values
Number of key-values pair, N can be
avery large number
Advantages of a key-value store are as follows:
1, Data Store can store any data type in a value field. The key-value system
stores the information as a BLOB of data (such as text, hypertext, images,video and
audio) and return the same BLOB when the data is retrieved. Storage is like an English
dictionary. Query for a word retrieves the meanings, usages, different forms as a single
item in the dictionary. Similarly, querying for key retrieves the values.
2. A query just requests the values and retums the values as a single item, Values can
be of any data type.
3. Key-value store is eventually consistent.
4. Key-value data store may be hierarchical or may be ordered key-value store.
5. Returned values on queries can be used to convert into lists, table* columns, data-(HD bie Data Analytics (280572)
frame fields and columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operationalcost..
7. The key can be synthetic or auto-generated. The key is flexible and can be represented in
‘many formats: (i) Artificially generated strings created from a hash of a value, (ii) Logical
path names to images or files, (iii) RESTweb-service calls (request response cycles), and (iv)
SQL queries.
Limitations of key-value store architectural pattern are:
1. No indexes are maintained on values, thus a subset of values is not searchable.
2. Key-value store does not provide traditional database capabilities, such as atomicity of transactions,
or consistency when multiple transactions are executed simultaneously. The application needs to
implement such capabilities.
3. Maintaining unique values as keys may become more difficult when the volume of data increases.
One cannot retrieve a single result when a key* value pair is not uniquely identified.
4. Queries cannot be performed on is
usable that filters a result set.
fividual values. No clause like ‘where’ in a relational database
Table 3.2 Traditional relational data model vs. the key-value store model
Traditional relational model Key-value store model
Result set based on row values Queries return a sin;
Values of rows for large datasets are indexed | No indexes on values
‘Same data type values in columns ‘Any data type values
‘Typical uses of key-value store are:
(i) Image store,
Gi) Document or file store,
Gii) Lookup table, and
(iv) Query-cache.
Riak is open-source Erlang language data store. It is a key-value data store system. Data anin~(WAYSE Handling Big Data Problems
ow] [=a meee
action | | Se —
Sscaee | | ‘Sooacoe Sane
anianee| | Stee ——
t i
peelnte | mrtotinete a
Sea? Ses pits Sere
Sa SS = so
ss
=a
Four ways for handling big data problems
[DD Big vata Analytics (18¢872)
Following are the ways:
|. Evenly distribute the data on a cluster using the hash rings: Consistent hashing refers to a
process where the datasets in a collection distribute using a hashing algorithm which generates
the pointer for a collection. Using only the hash of Collection_ID, a
node determines the data location in the cluster. Hash Ring refers to a map of hashes with
Data solution client
locations. The client, resource manager or scripts use the hash ring for data searches and Big
Data solutions, The ring enables the consistent assignment and usages of the dataset to a
specific processor.
2. Use replication to horizontally distribute the client read-requests: Replication means
creating backup copies of data in real time. Many Big Data clusters use replication to make
the failure-proof retrieval of data in a
tributed environment. Using replication enables
horizontal scaling out of the client requests.
3. Moving queries to the data, not the data to the queries: Most NoSQL data stores use cloud
utility services (Large graph databases may use enterprise servers). Moving client node queries
to the data is efficient as well as a requirement in Big Data solutions.
4. Queries distribution to multiple nodes: Client queries for the DBs analyze at the
analyzers, which evenly distribute the queries to data nodes/ replica nodes. High performance
query processing requires usages of multiple nodes. The query execution takes place separately
from the query evaluation (The evaluation means interpreting the query and generating a plan
for its execution sequence),Server) instances.
Double
Represents a float value.
String
UTE-8 format string.
Object
Represents an embedded document.
Array
Sets or lists of values.
Binary
data
String of arbitrary bytes to store images, binaries.
Object id
Objectids (MongoDB document identifier, equivalent to a primary key) are:
small, likely unique, fast to generate, and ordered. The value consists of 12-
bytes, where the first four bytes are for timestamp that reflects the instance
when Objectid creates,
Boolean,
Represents logical true or false value.
Date
BSON Date is a 64-bit integer that represents the number of milliseconds
since the Unix epoch Oan 1, 1970).
Null
Represents a null value. A value which is missing or unknown is Null
Regular
Expression
RegExp maps directly to alavaScript RegExp
32-bit
integer
Numbers without decimal points save and return as 32-bit integers.
Timestamp
‘A special timestamp type for internal MongoDB use and is not associated
with the regular date type. Timestamp values are a 64-bit value, where first
32 bits are time, t (seconds since the Unix epoch), and next 32 bits are an
incrementing ordinal for operations within a given second.
[64-bit
linteger
Number without a decimal point save and return as 64-bit integer .
(HONE Bi Data Analytics (18¢572)
in key
‘MinKey compare less than all other possible BSON element values,
respectively, and exist primarily for internal use.
lax key
MaxKey compares greater than all other possible BSON element values,
respectively, and exist primarily for internal use.
Data Types which Mango DB document SupportsStarts MongoDB; (*mongo is MongoDB client). The defaultdatabase im
we MongoDB is test
db.help() Runs help. This displays the list of all the commands,
db stats() Gets statistics about MongoDB server.
SUNN Big Data Analytics (18872)
Use .
find)
Views all documents in a collection
db..update ()
Updates a document
db. remove ()
Deletes a document
MongoDB querying commandsReplication: Replication ensures high availability in Big Data. Presence of multiple copies
increases on different database servers. This makes DBs fault» tolerant against any database
server failure. Multiple copies of data certainly help in localizing the data and ensure
availability of data in a distributed system environment.
MongoDB replicates with the help of a replica set. A replica set in MongoDB is a group of
mongod (MongoDb server) processes that store the same dataset. Replica sets provide
redundancy but high availability. A replica set usually has minimum three nodes. Any one out
of them is called primary. The primary node receives all the write operations. All the other
nodes are termed as secondary. The data replicates from primary to secondary nodes. A new
primary node can be chosen among the secondary nodes at the time of automatic failover or
maintenance. The failed node when recovered can join the replica set as secondary node again.
‘Commands Description
rs.initiate() | To initiate a new replica set
ts.conf() | To check the replica set configuration
rs.status () | To check the status of a replica set
rs.add() | To add members to a replica set
s
Figure shows a replicated dataset after creating three secondary membersfrom a primary
member.
Figure 3.13 Replicated set on creating secondary members
Auto-sharding :Sharding is a method for
distributed applic
ributing data across multiple machines in a
:n environment, MongoDB uses sharding to provide services to Big DataLimitations of Hive is:
¥ Not a full database. Main disadvantage is that Hive does not provide
update, alter and deletion of records in the database.
Not developed for unstructured data.
Not designed for real-time queries.
Performs the partition always from the last column.
[VE cE J
Web Brower User Agpcation ID8CJOO8C Appleton
Web iertace -- teres reve SAVER
Figure 4.10 Hive architecture
Page 26
lap Reduce, Hive and PIG
Components of Hive architecture are:
Y Hive Server (Thrift) - An optional service that allows a remote client to submit requests ti
Hive and retrieve results. Requests can use a variety of programming languages.
v Thrift Server exposes a very simple client AP! toexecute HiveQL statements.
<
Hive CLI (Command Line Interface) - Popular interface to interact with Hive.
Hive runs in local mode that uses local storage when running the CLI on a
Hadoop cluster instead of HDFS.
¥ Web Interface - Hive can be accessed using a web browser as well. This requires
a HWI Server running on some designated code. The URL http:// hadoop: / hwi command can be used to access Hive through the web.
¥ Metastore - It is the system catalog. All other components of Hive interact with
the Metastore. It stores the schema or metadata of tables, databases, columns
in a table, their data types and HDFS mapping.
¥ Hive Driver - It manages the life cycle of a HiveQL statement during
compilation, optimization and execution.Schema Less Database
‘Schema of a database system refers to designing af a structure for datasets and data structures for
storing into the database, NoSQL data not necessarily have a fixed table schema, The systems do not
use the concept of Join (between distributed datasets), A cluster-based highly distributed node
manages a single large data store with a NoSQL DB. Data written at one node repli
ates to multiple
nodes, Therefo
these are identical, fault-tolerant and partitioned into shards. Distributed databases
‘can store and process set of information on more than one computing nodes,
(ERNE Big Date Anais (286872)
ese no evn aatmatcay
owed of at determines ad
swacureserg wed: | metmdits tow infee
Ne need fer ort ‘dna athe aetalonds
moatng ‘eo seaboe
Nowtstlogct dats Modeling becomes
satiated proces
Giererwrmentor
edna ception a
omotianenoféata
Figure 3.2 Characteristics of Schema-less model
Increasing Flex
lity for Data Manipulation
NoSQL data stare possess characteristic of increasing flexibility for data manipulation. The
‘new attributes to database can be increasingly added. Late binding of them is also permitted.
BASE Properties BA stands for bas
coventual consisteney,
availability, S stands for soft state and E stands for
1. Basic availability ensures by distribution of shards (many partitions of huge data store) across
‘many data nodes with a high degree of replication, Then, a segment failure does not necessarily
‘mean a complete data store unavailability
2, Soft state ensures processing even in the presence of inconsistencies but achieving
consistency eventually. A program suitably takes into account the inconsistency found during
processing. NoSQL database design does not consider the need of consistency all along the
processing time,
3. Eventual consistency means consistency requirement in NoSQL. databases meeting at some
point of time in future, Data converges eventually to a consistent state with no time-frame
specification for achieving that. ACID rules require consistency all slong the processing on
completion of each transaction, BASE does not have that requirement and has the flexibility,Key-Value Pair
csi cs ibtae pind‘ boc phan) 0 Bagliodbis hai oy valve pecs we lagen
and output. Data should be first converted into key-valuc pairs before it ispassed to-
the Mapper, as the Mapper only understands key-value pairs of data.
Key-vaiue pairs in Hadoop MapReduce are generated as follows:
InputSptit - Defines a logical representation of data and presents a Split data for
processing at individual map().
RecordReader - Communicates with the InputSplit and converts. the Split into
Page?
Big Data Analytics 4 Nip Reduce, Hive and PIG.
records which are in the form of key-value pairs in a format suitable for readingby
the Mapper.
RecordReader uses TextinputFormat by default for converting data into key-value
pairs.
RecordReader communicates with the InputS plit untilthe file is read.
Sater)
Lat
Figure 45 Key-value pairing in MapReduce
ret nts tore en Hoe
Figure 4.5 shows the steps in MiGRESGS@K9-valve pairing.
Generation of a key-value pair in MapReduce depends on the dataset and the required
output. Also, the functions use the key-value pairs at four places: map() input, map()
output, reduce() input and reduce() output.
Grouping by Key
When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the value v2
‘append in a list of values. A "Group By” operation on intermediate keys creates ¥2,
Shuffle and Sorting Phase
Al pairs with the same group key (2) collect and group together, crestingone group
for each key.