Big Data Unit 5
Big Data Unit 5
NOSQL DATABASE
The availability of a high-performance, elastic distributed data environment enables
creative algorithms to exploit variant modes of data management in different ways.
Data management frameworks are bundled under the term “NoSQL databases”.
Combine traditional SQL (or SQL-like query languages) with alternative means of
querying and access.
NoSQL data systems hold out the promise of greater flexibility in database management
while reducing the dependence on more formal database administration.
Types: Key –Value Stores:
Values (or sets of values, or even more complex entity objects) are associated with
distinct character strings called keys.
Programmers may see similarity with the data structure known as a hash table.
Key Value
BMW {“1-Series”, “3-Series”, “5-Series”, “5-Series GT”, “7-Series”, “X3”, “X5”,
“X6”, “Z4”}
Buick {“Enclave”, “LaCrosse”, “Lucerne”, “Regal”}
The key is the name of the automobile make, while the value is a list of names of models
associated with that automobile make.
Operations:
Get(key), which returns the value associated with the provided key.
Put(key, value), which associates the value with the key.
Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list of
keys.
Delete(key), which removes the entry for the key from the data store
Characteristics:
Uniqueness of the key - to find the values you are looking for, you must use the exact
key.
In this data management approach, if you want to associate multiple values with a single
key, you need to consider the representations of the objects and how they are associated
with the key.
Key-value stores are essentially very long, and likely thin tables.
The table’s rows can be sorted by the key value to simplify finding the key during a
query.
The keys can be hashed using a hash function that maps the key to a particular location
(sometimes called a “bucket”) in the table.
The representation can grow indefinitely, which makes it good for storing large amounts
of data that can be accessed relatively quickly, as well as environments requiring
incremental appends of data.
Examples include capturing system transaction logs, managing profile data about
individuals.
The simplicity of the representation allows massive amounts of indexed data values to be
appended to the same key value table, which can then be sharded, or distributed across
the storage nodes.
Under the right conditions, the table is distributed in a way that is aligned with the way
the keys are organized.
While key value pairs are very useful for both storing the results of analytical algorithms
(such as phrase counts among massive numbers of documents) and for producing those
results for reports, the model does pose some potential drawbacks.
Drawbacks:
The model will not inherently provide any kind of traditional database capabilities (such
as atomicity of transactions, or consistency when multiple transactions are executed
simultaneously)—those capabilities must be provided by the application itself.
Another is that as the model grows, maintaining unique values as keys may become more
difficult.
Types: Document Stores:
A document store is similar to a key value store in that stored objects are associated (and
therefore accessed via) character string keys.
The difference is that the values being stored, which are referred to as “documents,”
provide some structure and encoding of the managed data.
A document store is similar to a key value store in that stored objects are associated (and
therefore accessed via) character string keys.
The difference is that the values being stored, which are referred to as “documents,”
provide some structure and encoding of the managed data.
Common encodings - XML
Example:
{StoreName:“Retail Store #34”, {Street:“1203 O ST”, City:“Lincoln”, State:“NE”,
ZIP:“68508”} }
{StoreName:”Retail Store #65”, {MallLocation:”Westfield Wheaton”, City:”Wheaton”,
State:”IL”} }
{StoreName:”Retail Store $102”, {Latitude:” 40.748328”, Longitude:” -73.985560”} }
The document representation embeds the model so that the meanings of the document
values can be inferred by the application.
One of the differences between a key value store and a document store is that while the
former requires the use of a key to retrieve data, the latter often provides a means (either
through a programming API or using a query language) for querying the data based on
the contents.
Types: Tabular Stores
Tabular, or table-based stores are largely derived from Google’s original Bigtable design
to manage structured data.
The HBase model, a Hadoop-related NoSQL data management system that evolved from
bigtable.
The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table
that is indexed by a row key, a column key that indicates the specific attribute for which a
data value is stored, and a timestamp that may refer to the time at which the row’s column
value was stored.
As an example, various attributes of a web page can be associated with the web page’s
URL: the HTML content of the page, URLs of other web pages that link to this web page,
and the author of the content.
Columns in a Bigtable model are grouped together as “families,” and the timestamps
enable management of multiple versions of an object.
The timestamp can be used to maintain history—each time the content changes, new
column attachments can be created with the timestamp of when the content was
downloaded.
HIVE
Apache Hive enables users to process data without explicitly writing MapReduce code.
One key difference to Pig is that the Hive language, HiveQL (Hive Query Language),
resembles Structured Query Language (SQL) rather than a scripting language.
A Hive table structure consists of rows and columns.
The rows typically correspond to some record, transaction, or particular entity (for
example, customer) detail.
The values of the corresponding columns represent the various attributes or
characteristics for each row.
Hadoop and its ecosystem are used to apply some structure to unstructured data.
Therefore, if a table structure is an appropriate way to view the restructured data, Hive
may be a good tool to use.
Additionally, a user may consider using Hive if the user has experience with SQL and the
data is already in HDFS.
Another consideration in using Hive may be how data will be updated or added to the
Hive tables.
If data will simply be added to a table periodically, Hive works well, but if there is a need
to update data in place, it may be beneficial to consider another tool, such as Hbase.
A Hive query is first translated into a MapReduce job, which is then submitted to the
Hadoop cluster.
Thus, the execution of the query has to compete for resources with any other submitted
job.
Hive is intended for batch processing
Data easily fits into a table structure.
Data is already in HDFS.
Developers are comfortable with SQL programming and queries.
There is a desire to partition datasets based on time.
Batch processing is acceptable
Basics:
From the command prompt, a user enters the interactive Hive environment by simply
entering hive:
$ hive
hive>
From this environment, a user can define new tables, query them, or summarize their
contents.
hive> create table customer ( cust_id bigint, first_name string, last_name string,
email_address string) row format delimited fields terminated by ‘\t’;
HiveQL query is executed to count the number of records in the newly created table,
customer.
The table is currently empty, the query returns a result of zero, the last line of the
provided output.
The query is converted and run as a MapReduce job, which results in one map task and
one reduce task being executed.
hive> select count(*) from customer;
When querying large tables, Hive outperforms and scales better than most conventional
database queries.
HBASE
HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
Apache Hbase is capable of providing real time read and write access to datasets with
billions of rows and millions of columns.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write access.
Storage Mechanism in HBase:
HBase is a column-oriented database and the tables in it are sorted by row.
The table schema defines only column families, which are the key value pairs.
A table has multiple column families and each column family can have any number of
columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the
table has a timestamp.
Architecture:
Tables are split into regions and are served by the region servers.
Regions are vertically divided by column families into “Stores”.
Stores are saved as files in HDFS
Master Server:
Assigns regions to the region servers
Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing
Regions:
Tables are split up and spread across the region servers.
Regions Server:
Communicate with the client and handle data related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
Memstore – Cache Memory
The data is transferred and saved in Hfiles as blocks and the memstore is flushed.
Zoo Keeper:
It provides services like maintaining configuration information, naming, providing
distributed synchronization, etc.
It keeps track of all the region servers in the HBase cluster and tracking information like,
how many region servers are there and which region servers are holding which DataNode.
Services:
Establishing client communication with region servers.
Tracking server failure and network partitions.
Maintain Configuration Information
SHARDING
Sharding is a database architecture pattern related to horizontal partitioning.
The practice of separating one table’s rows into multiple different tables, known as
partitions.
Each partition has the same schema and columns, but also entirely different rows.
The data held in each is unique and independent of the data held in other partitions.
Before Sharding:
After Sharding:
In a vertically-partitioned table, entire columns are separated out and put into new,
distinct tables.
The data held within one vertical partition is independent from the data in all the others,
and each holds both distinct rows and columns.
Horizontal or Range Based Sharding:
In this case, the data is split based on the value ranges that are inherent in each entity.
For example, the if you store the contact info for your online customers, you might
choose to store the info for customers whose last name starts with A-H on one shard,
while storing the rest on another shard.
ID Name Mail ID
1 A [email protected]
2 B [email protected]
3 C [email protected]
4 D [email protected]
In that case, your first shard will be experiencing a much heavier load than the second
shard and can become a system bottleneck.
Vertical Sharding:
In this case, different features of an entity will be placed in different shards on different
machines.
ID Name Mail ID
1 A [email protected]
2 B [email protected]
3 C [email protected]
4 D [email protected]
ID Name ID Mail ID
1 A 1 [email protected]
2 B 2 [email protected]
3 C 3 [email protected]
4 D 4 [email protected]
Benefits:
It handles the critical part of your data differently from the not so critical part of your data
and build different replication and consistency models around it.
Disadvantages:
It increases the development and operational complexity of the system.
If your Site/system experiences additional growth then it may be necessary to further
shard a feature specific database across multiple server.
Key or hash based sharding:
In this case, an entity has a value which can be used as an input to a hash function and a
resultant hash value generated. This hash value determines which database server(shard)
to use.
The main drawback of this method is that elastic load balancing (dynamically
adding/removing database servers) becomes very difficult and expensive.
A large number of the requests cannot be serviced and you'll incur a downtime till the
migration completes.
Directory based sharding:
Directory based shard partitioning involves placing a lookup service in front of the
sharded databases.
The lookup service knows the current partitioning scheme and keeps a map of each entity
and which database shard it is stored on.
The client application first queries the lookup service to figure out the shard (database
partition) on which the entity resides/should be placed.
Then it queries / updates the shard returned by the lookup service.
The lookup service is usually implemented as a web service.
Steps:
Keep the modulo 4 hash function in the lookup service.
Determine the data placement based on the new hash function - modulo 10.
Write a script to copy all the data based on #2 into the six new shards and possibly on the
4 existing shards. Note that it does not delete any existing data on the 4 existing shards.
Once the copy is complete, change the hash function to modulo 10 in the lookup service
Run a cleanup script to purge unnecessary data from 4 existing shards based on step#2.
The reason being that the purged data is now existing on other shards.
There are two practical considerations which needs to be solved on a per system basis:
While the migration is happening, the users might still be updating their data. Options
include putting the system in read-only mode or placing new data in a separate server that
is placed into correct shards once migration is done.
The copy and cleanup scripts might have an effect on system performance during the
migration. It can be circumvented by using system cloning and elastic load balancing -
but both are expensive.