0% found this document useful (0 votes)
20 views

Surveyondatamanagementsystemfor Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Surveyondatamanagementsystemfor Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/312218717

Survey on data management system for Big data analytics

Conference Paper · February 2016

CITATIONS READS
0 2,710

3 authors, including:

Thulasi Accottillam Raju G.


Christ University, Bangalore Christ University, Bangalore
3 PUBLICATIONS   2 CITATIONS    64 PUBLICATIONS   421 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Structure based XML clustering View project

All content following this page was uploaded by Thulasi Accottillam on 12 January 2017.

The user has requested enhancement of the downloaded file.


1

Survey on data management system for


Big data analytics
Thulasi A., Remya K T V., and Raju G, Kannur University
[email protected], [email protected] [email protected].

 vendors should start again for tomorrow's requirements[3].


Abstract— The current tendency in the data management To overcome the complexity of relational or indexed SQL
focuses on Big Data which concerns the large volume, variety and based systems over large amount of data the new concept
velocity of growing data sets. The conventional database NoSQL was introduced. It supports management of excess
management systems fails to handle this phenomena and a rich amount of data without following the tabular format or
set of alternatives are introduced. This paper focuses on the schema. Thus it can handle exceedingly scalable data with
evolution of NoSql as an alternative to conventional DBMS to
high availability ensuring the best performance.
handle the data management issues associated with the big data.

Index Terms— NoSQL; BigData; DBMS; ACID; CAP; II. NOSQL


sharding; mapreduce Due to the advent of Web2.0 domain, the data store using
traditional DBMS faced a lot of insufficiencies like centralized
data management, little horizontal scalability of memory, etc
I. INTRODUCTION came to forefront. For large scale data which requires parallel
t initial stage the databases were managed using flat file data processing among distributed server systems leads to
A systems, but it faced a lot of challenges. In 1970 Codd
introduced the concept of Relational DBMS, which uses
pavement of the non-relational database, NoSQL[4].
The merit of reliability, scalability and accessibility of the
tables(relations) to hold the data[1].The RDBMS continued to data in the cloud with low-cost servers and high bandwidth
be the emperor of data management over many decades. The networks endeavor NoSQL dataset on the reach of users. It
veterans from industry reinforced the RDBMS with focuses either CAP (Consistency, Availability, Partition
documentation and support. Tolerance) or BASE (Basically available, Soft state,
In 1983 Andreas Reuter and Theo Harder come up with the Eventually consistent) concept instead transactional ACID
acronym ACID (Atomicity, Consistency, Isolation, Durability) (Atomicity, Consistency, Isolation, Durability) characteristics
which is capable of describing the properties of DBMS[2]. in traditional RDBMS.
Atomicity ensures that the entire transaction fails if its one A. CAP-Theorem
part is failed and the database remains unchanged.Consistency
make sure that the transaction change the database from one In 2000, Eric Brewer introduced the concept of CAP
valid state to another. Isolations takes care of relaxed theorem in the paper titled “Towards Robust Distributed
serializability and controls the concurrent execution of Systems”[5]. In this paper he stated that out of the three
transactions. Durability protects the effect of transactions from properties named consistency, availability, and partition-
the crash, power loss or errors. tolerance, a system can have at most two with compromising
RDBMS was designed for the centralised storage.With the the other.
emergence of technologies such as Cloud computing, which ● Consistency: In the distributed system the data source is
demands for distributed management of data, new challenges shared among multiple servers. Consistency ensures that
started popping up. Further the volume of data to be managed single up to date copy of data is available
and the dimensionality of data increased exponentially, ● Availability: high availability meaning that a system is
affecting the performance of RDBMS based systems. In such designed such a manner that if nodes in a cluster crash or
situation the design and normalization of an effective database occurring some hardware failures or software parts are
is comparably too time consuming. Key relations, Normal going down due to upgrades it allows to continue the
Forms, Data Types and similar design concepts must be operations.
properly understood and implemented to avoid mistakes and ● Partition Tolerance: Partition tolerance is the ability of the
inconsistencies in practical. system that can continue operations even though any
The Big Data era changed the concept of structured data disturbance on network(either addition or removal of
storage since the data became more unstructured and the nodes) occurs.
conventional systems failed to manage these unstructured data
efficiently. Michael Stonebraker, along with his co authors
published a paper titled "The end of an architectural era: (it's The NoSQL systems generally give up consistency. So,
time for a complete rewrite)" which narrates that the DBMS NoSQL has emerged as a solution for today’s data store needs
and has been a topic of discussion and research in the recent
times.
2

B. Features of NoSQL Database


● NoSQL databases are able to handle any type of data
(structured, semi-structured and unstructured)
● NoSQL databases are simple and faster because it lacks
rigid schemas.
● Most NoSQL databases support object-oriented
programming
● NoSQL adopts the sharding technique which can be useful
for achieving horizontal scalability by connecting multiple
databases.
● NoSQL ensures high availability of contents through
distributed environment (backups, recovery). Figure 2: Consistent hashing

C. Mapreduce
III. COMMON CONCEPTS Mapreduce is the programming paradigm suitable for
handling large amount of data in parallel.The framework takes
A. Sharding the input in key value pair, the map function performs the
While dealing with large volume of data single machine filtering and sorting and the reduce function performs the
cannot store or process the data with the limited RAM size and summary operation and produce the output. The main
input/output capacity of disk drives. So the concept of scaling contribution of the map reduce is the scalability and fault
was introduced. Sharding is the horizontal scaling, which tolerance achieved for a variety of domains by optimizing the
stores the data in multiple servers or shards. Each shard is an execution engine once [8].
independent database, with high availability and consistency
and collectively forms the single logical database [6].

Database
1TB

Shard 1 Shard 2 Shard 3 Shard 4


256 GB 256 GB 256 GB 256 GB

Figure 1:Sharding

B. Consistent hashing
Consistent hashing allows incremental scalability of clusters
without rehashing the older values. This method is commonly
used by many NoSql databases to apply sharding pretty Figure 3: Mapreduce
elegantly. While altering the hash tables using consistent
hashing only need to remap k/n keys where k is the number of IV. DATA MODELS
keys and n is the number of shards or slots. So it can prevent NoSQL data models differ from the RDBMS system. The
the overwhelm of servers without remapping the whole keys. traditional system faced a lot of challenges, which are actually
It uses the hash rings which points each object to the edge of added as the requirements for this new set of data stores in
the circle and then walk around to fall in the first bucket NoSQL like, storage of non-relational data in distributed
encountered [7]. environment,open source, effective horizontal scalability,
schema-less,replication support, eventual consistency and user
friendly APIs. NoSQL data models are mainly categorized
into four:
A. Key Value Store
Key value store is a simple key and value pairs, which
follows hash or dictionary like storage mechanism. The key is
a unique identifier for managing values. It supports schema
free, distributed environment but it lacks relational structure,
indexing and data level querying[9].
3

Key Value databases


● Cassandra: Cassandra is an open source, distributed data
K1 AAA,BB,CCC store, that can be optimized for a large number of write
request. If the cluster does not have a master node, then
K2 AAA,BBB
read and write can be handled by any node. Apache
Cassandra was developed by Avinash Lakshman and
K3 AAA,DDD
Prashant Malik. Cassandra is in use at Rackspace, Digg,
Facebook, Twitter, Cisco, Mahalo and more companies that
K4 AAA,2,01/01/2015
handle large data set [14].
K5 3,ZZZ,5623
● HBase: A non relational, distributed database of Hadoop
Distributed File System (HDFS). HBase is offering
parallelism through MapReduce concepts and also indexing
Key value store performs simple operations based on the key using B+Tree.
attribute. Normally this data store concepts are effective in ● Hypertable: Hypertable is massively scalable database
caching mechanism and takes comparably less memory modeled after Bigtable, Google's proprietary, massively
requirements leads to high performance. Some of the popular scalable database [15].
key-value databases are
● Riak: Which is a key-value store with high availability, C. Document Stores
fault tolerance, eventually consistent, distributed, Documents stores or document-oriented system are meant
operational simplicity, and scalability as the goals. Riak is to storing and retrieving documents like XML, JSON, BSON,
not a good solution to handle small set of data. It is YAML (known as semi-structured data) and so on. Document
available in different versions like open-source, supported databases store documents in the value part of the key-value
enterprise and a cloud storage [10]. store which further need database engine for optimization.
● Redis (often referred to as Data Structure server): Redis Key part is used to locate the document. Document databases
is the most popular key-value store. It is an open-source, is designed to offer an affluent understanding on modern
networked with optional durability [11]. programming concepts that does not rely on fixed schema.
● Dynamo (not open-source): To implement high scalability Indexing and querying are the highlights of document store
and availability of data, Amazon developed this [16].
technology, that also ensures reduced cost, high
performance. It combines both databases and distributed
hash tables concepts. Dynamo is a vital part of Amazon
Web Services infrastructure [12].
B. Column family stores
Most of the Column stores follow the Google’s Big Table
concepts, which highlights column oriented representation for
storage [13]. It provides better performance and consistency to
the values. Columns can be grouped to column families, which
is especially important in data organization and partitioning.
Due to this efficient partitioning nature, it is useful for very
large cluster applications. Column family configuration is Figure 5: Document Stores
typically predefined but it assures a more flexibility in storing
any data type. Some of the popular document databases:
● CouchDB: It is an open-source, document-oriented NoSQL
data store that uses JSON format along with the
applications like JavaScript, HTTP and MapReduce.
CouchDB was first released in 2005 and later became an
Apache project in 2008(Apache CouchDB). CouchDB is
the richer data store with predefined schemas. It performs
transaction commit operations in case of system crash
scenario to offer durability which is similar as in ACID.
● MongoDB: MongoDB is a free and open-source software
that supports the features like indexing, replication, Load
balancing, file storage and aggregation. MongoDB
Figure 4: Column family stores
highlights BSON documents instead of using conventional
database concepts to integrate data in some kind of
As comparing with relational databases it also offer similar application. MongoDB acquires high scalability through
graphical representations due to the tabular format, but it sharding techniques where data is distributed across
handles null values effectively through varying number of multiple nodes [6][16].
attributes for large dataset. The popular column-family
4

● OrientDB: It is a multi model data store that supports key- NoSql is not an alternative to RDBMS and the migration
value, document and graph structures but mainly the between the relational databases and NoSql is still a big
relationships are managed through graph model. MVRB- deal[18]. The relational database is a full fledged area and
Tree is used as an indexing solution with user prefered combining those traditional features to NoSql is also
security and uses a simple layer of querying for challenging.
traversal[16]. Another important research challenge is related with the
privacy and security. The distributed architecture of NoSql has
D. D. Graph Stores lots of limitations like inadequacy of encryption support, poor
As the name implies graph stores are interconnected authentication between client and server, vulnerability to SQL
collection of nodes, node itself is an entity and the edges injection and Denial of Service attacks.
represents the relationship between. The nodes can interpreted
in different ways based on relationships. Edges or VI. CONCLUSION
relationships have directional significance which results
In this paper we presented the limitations of conventional
interesting patterns over the dataset[16].
relational database and the evolution of NoSql. The common
concepts for NoSql data management and different data
models are discussed here identifying some major research
challenges. Even though the NoSql is the first preference for
big data management it is not an ultimate solution. More
research works should be collaborated to attain higher degrees
of big data management effectively.

REFERENCES
[1] E. F. Codd “A relational model of data for large shared data banks” in
Magazine, Communications of the ACM. vol.13 (6), pp. 377-387, June 1970

[2] Theo Härder, Andreas Reuter ”Principles of Transaction-Oriented


Database Recovery” in ACM Comput.Surv. vol.15 (4), pp. 287-317 ,
Figure 6: Graph Stores December 1983.
[3] Stonebraker, S. Madden, D. Abadi, S. Harizopoulos, N. Hachem,and P.
Unlike RDBMS in graph structure, traversing the joins or Helland “The end of an architectural era (it’s time for a complete rewrite)” In
relationships in graph store is quite faster because it uses Proc. of VLDB Conf., pp 1150–1160, Sep 2007
persisted relationships other than timely queries. Here [4] https://en.wikipedia.org/wiki/NoSQL
relationship poses properties which adds intelligence like [5] Eric A. Brewer “Towards robust distributed systems” in Proceedings of
the nineteenth annual ACM symposium on Principles of distributed
distance between nodes, shared aspects of nodes etc. These computing Page 7, 2000
properties of relationship can be further used for querying [6]Sharding and MongoDB
graph. https://docs.mongodb.org/master/MongoDB-sharding-guide-master.pdf
A lot of thought and design work is needed to model the [7] Consistent Hashing http://blog.plasmaconduit.com/consistent-hashing/
relationships in the domain that we are trying to work with. [8] Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplied Data
Adding new relationship types is easy; changing existing Processing on Large Clusters
nodes and their relationships is similar to data migration, http://static.googleusercontent.com/media/research.google.com/es/us/archive/
mapreduce-osdi04.pdf
because these changes will have to be done on each node and [9]Key-value Stores http://db-engines.com/en/article/Key-value+Stores
each relationship in the existing data. [10] Why Riak http://docs.basho.com/riak/latest/theory/why-riak/
There are many graph databases available, such as Neo4J, [11] Redis Documentation http://redis.io/documentation
Infinitegraph, OrientDB, or FlockDB [12] Amazon DynamoDB Documentation
https://aws.amazon.com/documentation/dynamodb/
[13] Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M,
V. MAJOR RESEARCH CHALLENGES Chandra T,
Fikes A, Gruber R (2006) “Bigtable: A distributed structured data storage
Although big data management is not a latest technology, system.” 7th OSDI 26:305 – 314
the research in this field is in its early stages. Several existing [14]Lakshman A, Malik P (2010) “Cassandra: a decentralized structured
issues have not been fully addressed. Moreover, new storage system.” ACM SIGOPS Operating Syst Rev 44(2):35 – 40.
challenges continue to emerge from applications by [15] Hypertable, http://hypertable.org/
[16] https://www.thoughtworks.com/insights/blog/nosql-databases-overview
organization. Single data management system to deal with the [17] Divyakant Agrawal, Sudipto Das, Amr El Abbadi “Big data and
big data is yet to be designed[17]. We already discussed many cloud computing: current state and future opportunities” in Proceeding
data models for dealing with big data, each of them addressing EDBT/ICDT '11 Proceedings of the 14th International Conference on
certain aspects only and the research challenges still remain. Extending Database Technology, Pages 530-533, 2011
[18] Aaron Schram, Kenneth M. Anderson in “MySQL to NoSQL: data
The NoSql technologies can handle scaling beyond the modeling challenges in supporting scalability” in Proceeding SPLASH '12
capabilities of conventional RDBMS and can handle Proceedings of the 3rd annual conference on Systems, programming, and
unstructured data with its schema less architecture. But the applications: software for humanity Pages 191-202, 2012

View publication stats

You might also like