0% found this document useful (0 votes)
25 views

BDA

Uploaded by

harshadm2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
25 views

BDA

Uploaded by

harshadm2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
3.3 NOSQL DATA ARCHITECTURE PATTERNS 3.3.1 Key-Value Store The simplest way to implement a schema-less data store is to use key-value _ pairs. The data store characteristics are high performance, scalability and flexibility. Data retrieval is fast in key-value pairs data store. A simple string called, key maps to a large data string or BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing the values. Therefore, the store can be easily scaled up for very large data. The concept is similar toa hash table where a unique key points to a particular item(s) of data. Figure 3.4 shows key- value pairs architectural pattern and example of students’ database as key-value pairs Key Value “ashish” | “Category: Student; Class: Bech,; Semester: VIl; Branch: Engineering; Mobile:3999912345"| “mayuri® | “Category: student; class: MTTech.; Mobile:8888823456" Keyl Values1 Key2 Values2 KeyN-1 ValuesN-1 KeyN 1 Values Number of key-values pair, N can be avery large number Advantages of a key-value store are as follows: 1, Data Store can store any data type in a value field. The key-value system stores the information as a BLOB of data (such as text, hypertext, images,video and audio) and return the same BLOB when the data is retrieved. Storage is like an English dictionary. Query for a word retrieves the meanings, usages, different forms as a single item in the dictionary. Similarly, querying for key retrieves the values. 2. A query just requests the values and retums the values as a single item, Values can be of any data type. 3. Key-value store is eventually consistent. 4. Key-value data store may be hierarchical or may be ordered key-value store. 5. Returned values on queries can be used to convert into lists, table* columns, data- (HD bie Data Analytics (280572) frame fields and columns. 6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operationalcost.. 7. The key can be synthetic or auto-generated. The key is flexible and can be represented in ‘many formats: (i) Artificially generated strings created from a hash of a value, (ii) Logical path names to images or files, (iii) RESTweb-service calls (request response cycles), and (iv) SQL queries. Limitations of key-value store architectural pattern are: 1. No indexes are maintained on values, thus a subset of values is not searchable. 2. Key-value store does not provide traditional database capabilities, such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously. The application needs to implement such capabilities. 3. Maintaining unique values as keys may become more difficult when the volume of data increases. One cannot retrieve a single result when a key* value pair is not uniquely identified. 4. Queries cannot be performed on is usable that filters a result set. fividual values. No clause like ‘where’ in a relational database Table 3.2 Traditional relational data model vs. the key-value store model Traditional relational model Key-value store model Result set based on row values Queries return a sin; Values of rows for large datasets are indexed | No indexes on values ‘Same data type values in columns ‘Any data type values ‘Typical uses of key-value store are: (i) Image store, Gi) Document or file store, Gii) Lookup table, and (iv) Query-cache. Riak is open-source Erlang language data store. It is a key-value data store system. Data anin~ (WAYSE Handling Big Data Problems ow] [=a meee action | | Se — Sscaee | | ‘Sooacoe Sane anianee| | Stee —— t i peelnte | mrtotinete a Sea? Ses pits Sere Sa SS = so ss =a Four ways for handling big data problems [DD Big vata Analytics (18¢872) Following are the ways: |. Evenly distribute the data on a cluster using the hash rings: Consistent hashing refers to a process where the datasets in a collection distribute using a hashing algorithm which generates the pointer for a collection. Using only the hash of Collection_ID, a node determines the data location in the cluster. Hash Ring refers to a map of hashes with Data solution client locations. The client, resource manager or scripts use the hash ring for data searches and Big Data solutions, The ring enables the consistent assignment and usages of the dataset to a specific processor. 2. Use replication to horizontally distribute the client read-requests: Replication means creating backup copies of data in real time. Many Big Data clusters use replication to make the failure-proof retrieval of data in a tributed environment. Using replication enables horizontal scaling out of the client requests. 3. Moving queries to the data, not the data to the queries: Most NoSQL data stores use cloud utility services (Large graph databases may use enterprise servers). Moving client node queries to the data is efficient as well as a requirement in Big Data solutions. 4. Queries distribution to multiple nodes: Client queries for the DBs analyze at the analyzers, which evenly distribute the queries to data nodes/ replica nodes. High performance query processing requires usages of multiple nodes. The query execution takes place separately from the query evaluation (The evaluation means interpreting the query and generating a plan for its execution sequence), Server) instances. Double Represents a float value. String UTE-8 format string. Object Represents an embedded document. Array Sets or lists of values. Binary data String of arbitrary bytes to store images, binaries. Object id Objectids (MongoDB document identifier, equivalent to a primary key) are: small, likely unique, fast to generate, and ordered. The value consists of 12- bytes, where the first four bytes are for timestamp that reflects the instance when Objectid creates, Boolean, Represents logical true or false value. Date BSON Date is a 64-bit integer that represents the number of milliseconds since the Unix epoch Oan 1, 1970). Null Represents a null value. A value which is missing or unknown is Null Regular Expression RegExp maps directly to alavaScript RegExp 32-bit integer Numbers without decimal points save and return as 32-bit integers. Timestamp ‘A special timestamp type for internal MongoDB use and is not associated with the regular date type. Timestamp values are a 64-bit value, where first 32 bits are time, t (seconds since the Unix epoch), and next 32 bits are an incrementing ordinal for operations within a given second. [64-bit linteger Number without a decimal point save and return as 64-bit integer . (HONE Bi Data Analytics (18¢572) in key ‘MinKey compare less than all other possible BSON element values, respectively, and exist primarily for internal use. lax key MaxKey compares greater than all other possible BSON element values, respectively, and exist primarily for internal use. Data Types which Mango DB document Supports Starts MongoDB; (*mongo is MongoDB client). The defaultdatabase im we MongoDB is test db.help() Runs help. This displays the list of all the commands, db stats() Gets statistics about MongoDB server. SUNN Big Data Analytics (18872) Use . find) Views all documents in a collection db..update () Updates a document db. remove () Deletes a document MongoDB querying commands Replication: Replication ensures high availability in Big Data. Presence of multiple copies increases on different database servers. This makes DBs fault» tolerant against any database server failure. Multiple copies of data certainly help in localizing the data and ensure availability of data in a distributed system environment. MongoDB replicates with the help of a replica set. A replica set in MongoDB is a group of mongod (MongoDb server) processes that store the same dataset. Replica sets provide redundancy but high availability. A replica set usually has minimum three nodes. Any one out of them is called primary. The primary node receives all the write operations. All the other nodes are termed as secondary. The data replicates from primary to secondary nodes. A new primary node can be chosen among the secondary nodes at the time of automatic failover or maintenance. The failed node when recovered can join the replica set as secondary node again. ‘Commands Description rs.initiate() | To initiate a new replica set ts.conf() | To check the replica set configuration rs.status () | To check the status of a replica set rs.add() | To add members to a replica set s Figure shows a replicated dataset after creating three secondary membersfrom a primary member. Figure 3.13 Replicated set on creating secondary members Auto-sharding :Sharding is a method for distributed applic ributing data across multiple machines in a :n environment, MongoDB uses sharding to provide services to Big Data Limitations of Hive is: ¥ Not a full database. Main disadvantage is that Hive does not provide update, alter and deletion of records in the database. Not developed for unstructured data. Not designed for real-time queries. Performs the partition always from the last column. [VE cE J Web Brower User Agpcation ID8CJOO8C Appleton Web iertace -- teres reve SAVER Figure 4.10 Hive architecture Page 26 lap Reduce, Hive and PIG Components of Hive architecture are: Y Hive Server (Thrift) - An optional service that allows a remote client to submit requests ti Hive and retrieve results. Requests can use a variety of programming languages. v Thrift Server exposes a very simple client AP! toexecute HiveQL statements. < Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive runs in local mode that uses local storage when running the CLI on a Hadoop cluster instead of HDFS. ¥ Web Interface - Hive can be accessed using a web browser as well. This requires a HWI Server running on some designated code. The URL http:// hadoop: / hwi command can be used to access Hive through the web. ¥ Metastore - It is the system catalog. All other components of Hive interact with the Metastore. It stores the schema or metadata of tables, databases, columns in a table, their data types and HDFS mapping. ¥ Hive Driver - It manages the life cycle of a HiveQL statement during compilation, optimization and execution. Schema Less Database ‘Schema of a database system refers to designing af a structure for datasets and data structures for storing into the database, NoSQL data not necessarily have a fixed table schema, The systems do not use the concept of Join (between distributed datasets), A cluster-based highly distributed node manages a single large data store with a NoSQL DB. Data written at one node repli ates to multiple nodes, Therefo these are identical, fault-tolerant and partitioned into shards. Distributed databases ‘can store and process set of information on more than one computing nodes, (ERNE Big Date Anais (286872) ese no evn aatmatcay owed of at determines ad swacureserg wed: | metmdits tow infee Ne need fer ort ‘dna athe aetalonds moatng ‘eo seaboe Nowtstlogct dats Modeling becomes satiated proces Giererwrmentor edna ception a omotianenoféata Figure 3.2 Characteristics of Schema-less model Increasing Flex lity for Data Manipulation NoSQL data stare possess characteristic of increasing flexibility for data manipulation. The ‘new attributes to database can be increasingly added. Late binding of them is also permitted. BASE Properties BA stands for bas coventual consisteney, availability, S stands for soft state and E stands for 1. Basic availability ensures by distribution of shards (many partitions of huge data store) across ‘many data nodes with a high degree of replication, Then, a segment failure does not necessarily ‘mean a complete data store unavailability 2, Soft state ensures processing even in the presence of inconsistencies but achieving consistency eventually. A program suitably takes into account the inconsistency found during processing. NoSQL database design does not consider the need of consistency all along the processing time, 3. Eventual consistency means consistency requirement in NoSQL. databases meeting at some point of time in future, Data converges eventually to a consistent state with no time-frame specification for achieving that. ACID rules require consistency all slong the processing on completion of each transaction, BASE does not have that requirement and has the flexibility, Key-Value Pair csi cs ibtae pind‘ boc phan) 0 Bagliodbis hai oy valve pecs we lagen and output. Data should be first converted into key-valuc pairs before it ispassed to- the Mapper, as the Mapper only understands key-value pairs of data. Key-vaiue pairs in Hadoop MapReduce are generated as follows: InputSptit - Defines a logical representation of data and presents a Split data for processing at individual map(). RecordReader - Communicates with the InputSplit and converts. the Split into Page? Big Data Analytics 4 Nip Reduce, Hive and PIG. records which are in the form of key-value pairs in a format suitable for readingby the Mapper. RecordReader uses TextinputFormat by default for converting data into key-value pairs. RecordReader communicates with the InputS plit untilthe file is read. Sater) Lat Figure 45 Key-value pairing in MapReduce ret nts tore en Hoe Figure 4.5 shows the steps in MiGRESGS@K9-valve pairing. Generation of a key-value pair in MapReduce depends on the dataset and the required output. Also, the functions use the key-value pairs at four places: map() input, map() output, reduce() input and reduce() output. Grouping by Key When a map task completes, Shuffle process aggregates (combines) all the Mapper outputs by grouping the key-values of the Mapper output, and the value v2 ‘append in a list of values. A "Group By” operation on intermediate keys creates ¥2, Shuffle and Sorting Phase Al pairs with the same group key (2) collect and group together, crestingone group for each key.

You might also like