Bda - Unit 2

BDA unit 2 Notes

Uploaded by

VISHWA PRIYA I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

128 views30 pages

Bda - Unit 2

BDA unit 2 Notes

Uploaded by

VISHWA PRIYA I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 30

UNIT II NoSQL Data Management Syllabus Introduction to NoSQL - aggregate data models - key-value and document data models - relationships - graph databases - schemaless databases - materialized views - distribution models - master-slave replication - consistency - Cassandra - Cassandra data model - Cassandra examples ~ Cassandra clients Contents 24 22 23 24 25 26 27 28 Introduction to NoSQL Aggregate Data Models Schemaless Databases Materialized Views Distribution Models Consistency Cassandra ‘ Two Marks Questions with Answers (2-1)NoSQL Di j Big Data Analytics 2:2 SOF Pata Managemen, ERG introduction to NoSQL | NoSQL means Not Only SQL, it solves the problem of handling huge volume o, data that relational databases cannot handle. NoSQL databases are schema free and are non-relational databases. Most of the NoSQL databases are open source, | NoSQL is also type of distributed database, which means that information jg | copied and stored on various servers, which can be remote or local. This ensures | availability and reliability of data, If some of the data goes offline, the rest of the database can continue to run, | NoSQL encompasses polymorphic data. structured data, semi-structured data, unstructured data ang | No SQL database provides a mechanism for storage ‘and retrieval of data that employs less constrained consistency models than traditional relational databases, NoSQL is a response to nowadays business data related factors : 1. Volume and velocity, referring to the ability to handle large datasets that arrive quickly; 2. Variability, referring to how diverse data types don't fit into structured tables; Agility, referring to how fast an organization responds to business changes. NoSQL databases are very often referred to as data stores rather than data-bases, NoSQL. sy computer s ems work on multiple processors and can run on low-cost separate stems - No need for expensive nodes to get high-speed performance, It supports linear scalability. Every time we add more processors, we get a consistent increase in performance. History of NoSQL = © The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming, his lightweight, open-source "relational" database that did not use SQL. The name came up again in 2009 when Erie Evans and Johan Oskarsson used It to describe non-relational database » Relational databases are oft referred to as SQL systems, The term NoSQL can mean either No SQL systems” or the more commonly accepted translation of “Not only SQ1," to emphasize the fact wome systems might support SQLlike query languages, # NoSQL de ped at Teast Jn the beginning ana reaponne to web data, the need for processing unstructured data and the need for faster processing: The NoSQL model uses a distributed databane system, meaning a nystem with multiple computers. TEGHMIOAL PUBLICATIONS” « an up-thrust for KnowledgeBig Data Analytics 2-3 NoSQL Data Management Not only can NoSQL systems handle both structured and unstructured data, but they can also process unstructured big data quickly. This led to organizations such as Facebook, Twitter, LinkedIn, and Google adopting NoSQL systems, These organizations process tremendous amounts of unstructured data, coordinating it to find patterns and gain business insights. Big data became an official term in 2005, Why NoSQL 7 It can handle large volumes of structured, semi-structured and unstructured data. = Agile sprints, quick iteration and frequent code pushes. = Object-oriented programming that is easy to use and flexible. = Scale-out architecture, Types of NoSQL Stores : 1 Column Oriented (Accumulo, Cassandra, Hbase) 2, Document Oriented (MongoDB, Couchbase, Clusterpoint) 3. 4, Graph (Allegro, Neodj, OrientDB) Key-value (Dynamo, MemeacheDB, Riak) NoSQL databases are guaranteed to adhere to two of the CAP properties. Such databases are of several types, 1, Key-value store : Stores in the form of a hash table (Example - Rink, Amazon $3 (Dynamo), Redis) 2, Document-based store : Stores objects, mostly JSON, which is web friendly or supports ODM (Object Document Mappings). (Example - CouchDB, MongoDB) 3. Column-based store : Ench storage block contains data from only one column (Example - HBase, Cassandra) 4. Graph-based : Graph representation of relationships, mostly used by social networks, (Example - NeodJ) The Dofinition of Four Types of NoSQL Databases NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases, NoSQL Is often interpreted as Not-only-SQL to emphasize that they may also support SQL-like query languages. Most NoSQL databases are designed to store Inege quantities of data in a fault-tolerant way, NoSQL tn simply the term that Is used to describe a family of databases that are All nor-relational, While the technologies, data types and use casen vary widely Smount them, it I generally agreed that there are four types of NoSQL databases s TECHNICAL PUBLICATIONS! 40 upethruat for knowledge<4 NoSQL Date Menagon, Big Data Anelytics | using any of four primary aul 7, manage information © NoSQL databases can 8 mn based and graph based, “4, models : Key-value store, document-based, colu [ERE Example and Advantages © Examples of NoSQL databases a) Apache CouchDB, an open source, JSON document-based database that y,. JavaScript as its query language. | b) Elasticsearch, a document-based database that includes a full-text search enging | ©) Couchbase, a key-value and document database that empowers developers 1. build responsive and flexible applications for cloud, mobile and edgy computing. t Advantages a) NoSQL databases have a simple and flexible structure. They are schema-free, b) NoSQL databases are based on key-value pairs. ©) Some store types of NoSQL databases include column store, document store key value store, graph store, object store, XML, store and other data store modes. | | @) Each value in the database has a key. Some NoSQL database stores also allow developers to store serialized objects into the database, not just simple string) values. e) Open-source NoSQL databases do not require expensive licensing fees and can| run on inexpensive hardware, rendering their deployment cost-effective. | Disadvantages : | 2) Most NoSQL databases do not support reliability features that are natively! supported by relational database systems. | b) In order to support reliability and consistency features, developers mus implement their own proprietary code, which adds more complexity to the /\ system. CAP Theorem JS Fig. 2.1.1 shows the three properties of the CAP theorem. © The theorem states that distributed data systems will offer a trade-off between” consistency, availability and partition tolerance. And, that any database can only guarantee two of the three properties : f © Consistency : Every node in the cluster responds with the most recent data, even” if the system must block the request until all replicas update. If you query * TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics NoSQL Data Management Avaliablity Partition tolerance Fig. 2.1.1 CAP theorem “consistent system” for an item that is currently updating, you will wait for that i response until all replicas successfully update. However, you will receive the most current data. Availability : Every node returns an immediate response, even if that response is | not the most recent data. If you query an “available system" for an item that is j updating, you will get the best possib le answer the service can provide at that moment. Partition tolerance : Guarantees the system continues to operate even if a : replicated data node fails or loses connectivity with other replicated data nodes. Comparison of SQL and NoSQL Databases Sr. No. SQL 1 SQL databases are relational. 2. SQL databases are vertically scalable. SQL databases use strisctured language and have a predefined | Lan as ae 4. SQL databases are table-based. NoSQL databases are: “document, _key-value, graph, or wide-column stores, | SQL databases are better for multi-row while NoSQL is better for unstructured transactions. data like documents or JSON. at We a | EBV Keshrosat Data Models * Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an aggregate is a collection of data that interact as a unit. Moreover, these units of data or aggregates of data form the boundaries for the ACID operations. ao TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeData Ansitios 2-6 NoSQL Data Menagom en . dseeee's a models in NoSQL make it easier for the databases to manage = a a a as the aggregate data or unit can now reside on se sane nats, Whenever data is retrieved from the databate all the da i S aggregate data models in NoSQL. er eon in NoSQL do not support ACID transactions and SACtificg pan ot toe AC Properties. With the help of aggregate data models in Nosqy cen easily perform OLAP operations on the database. We can achieve higy, efficiency of the aggregate data models in the NoSQL database if the d transactions and interactions take place within the same aggregate. .: Key-value Store * In the key-value structure, the key is usually a simple string of characters and the value is 2 series of uninterrupted bytes that are opaque to the database. Key-value store is like 2 relational database with only two columns : The key or attribute name and the value. Fig. 22.1 shows key-value store. Key Value oo > [sereete_] Aadher 222222222222 cay —— it Fig. 2.2.1 Key-value store 2. comes © Saves data as a group of key value pairs, which are made up of two data items that are linked. The link between the items is a "key" which acts as an identifier for an item within the data and the "value" that is the data that has been identified. . The data itself is usually some primitive data type (string, integer, array) or a more complex object that an application needs to persist and access directly. This replaces the rigidity of relational schemas with a more flexible data model that allows developers to easily modify fields and object structures as. their applications evolve. ¢ systems treat the data as a single opaque collection which may have Key valu different fields for every record. In each key value pair, a) The key is represented by an arbitrary string b) The value can be any kind of data like an image, file, text or document. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 2-7 NoSQL Data Management « In general, key-value stores have no query language. They simply Provide a way to store, retrieve, and update data using simple GET, PUT and DELETE commands. The simplicity of this model makes a key-value store fast, easy to use, scalable, portable and flexible. Advantages of key value stores : 2 ‘i a) The secret to its speed lies in its simplicity. The Path to retrieve data is a direct request to the object in memory or on disk. b) The relationship between data does not have to be calculated by a query language, there is no optimization performed. ©) They can exist on distributed systems and do not need to worry about where to store indexes. Disadvantages of key value stores : a) No complex query filters b) All joins must be done in code ©) No foreign key constraints 4) No trigger. Document-based * A document is an object and keys (strings) that have values of recognizable types, including numbers, Booleans and strings, as well as nested arrays and dictionaries. All data is stored in one table, so there is no need for cross-referencing and instead of storing information in a table, it is stored in a document. *° Fig. 2.2.2 shows document based data model. Documents Documents Fig. 2.2.2 Document based data model * Document databases are designed for flexibility. They are not typically forced to have a schema and are therefore easy to modify. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-8 (E283 column-based < amounts of data, document databases are a good oP! NoSQL Data Menogn,. | nt o store varying attributes along with jg, ion. Be including XML and JSON, 7, ypedance match. is If an application requires the ability t Document stores work with multiple formats in¢ allows for storage and retrieval of data without an #™ Terminologied in document data store are as follows : a) A table is called a collection b) A row is called a document c) A column in called a field. Typical use cases for document stores include the storage blog posts, news articles and data analysis. MongoDB and Apache CouchDB are databases. Do not use document databases for transactions across multiple documents and retrieval of catalog, examples of popular document-baseg (records) and Ad hoc cross-document queries. Advantages of document based model : a) Faster retrieval of data. b) Dynamic architecture for unstructured data and storage options ¢) Sharing for horizontal scalability 4) Replication is managed internally, so chances of accidental loss of data is negligible. Disadvantages of document data model : a) No views, triggers, scripts or stored procedure. b) Relationship not well defined. ©) No support for transactions, which could lead to data corruption. ! Column-based is also called 'wide column’ models enabling very quick data access using a row key, column name and cell timestamp. The flexible schema of these types of databases means that the columns do nt have to be consistent across records and you can add a column to specific 1o¥8 | without having to add them to every single record. | | It is also called a two-level map as it offers a two-level aggregate structure. ‘As data is organized into columns, we have better indexing compared to oth key-value stores. Also, when it comes to updates, multiple column block update can be aggregated. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeig Data Analytics 2-9 NoSQL Data Management Column store databases were born when of a Column store NoSQL database called well-known Google e-mail service, NoSQL Database. The wide, columnar storés data model, like derived from Google's BigTable paper. Google open sourced its implementation Big Table. Apparently, the data for the Gmail, is stored in the Google Big Table that found in Apache Cassandra, are Organizations mostly use Column data stores for data warehousing and data processing, which is evident in services suchas Amazon Redshift. Advantages of column data stores : a) Column stores are very efficient at data compression and/or partitioning. b) Columnar stores can be loaded extremely fast, ©) Columnar databases are very scalable. 4) Due to their structure, columnar databases perform particularly well with aggregation queries. Disadvantages of column data store : - a) Updates can be inefficient. The fact that columnar families group attributes, as opposed to rows of tuples, works against it. b) If multiple attributes are touched by a join or query, this may also lead to column storage experiencing slower performance. ©) It is also slower when deleting rows from columnar systems, as a record needs to be deleted from each of the record files. EJ craph-based The modern graph database is a data storage and processing engine that makes the persistence and exploration of data and relationships more efficient. Graph-based data models store data in nodes that are connected by edges. These Aggregate Data Models in NoSQL are widely used for storing the huge volumes of complex aggregates and multidimensional data having many interconnections between them. In graph theory, structures are composed of vertices and edges, or what would later be called "data relationships". Graphs behave similarly to how people think, in specific relationships between discrete units of data. This database type is particularly useful for visualizing, analyzing, or helping to find connections between different pieces of data. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Dota Analytics 2-10 : No standard language. J joSQL Key/Value Database : MongoDB . NoSQL Data Me 1eG0m on | } ologies for recommendation en | ines | ies of graph-based NoSQL dat Binoy | As a result, businesses leverage graph tech Abate fraud analytics and network analysis. Examp! include Neodj and JanusGraph. tions, social media any analyze customer interac Graph databases can be used to 0 traverse Jong relationship graph |, | scientific applications where it is crucial ( | better understand data. | Advantages of graph data : a) More descriptive queries | b) Greater flexibility in adapting your model ¢) Greater performance when traversing data relationships. | Disadvantages of graph data stores : i a) Difficult to scale | MongoDB is an open-source document database that provides high performance, | high availability and automatic scaling. MongoDB is one of the most popular open-source NoSQL databases written in C++. As of February 2015, MongoDB is | the fourth most popular database management system. It was developed by a | company 10gen which is now known as MongoDB Inc. Why use MongoDB ? } a) Simple queries i b) Functionality provided applicable to most web applications c) Easy and fast integration of data d) No ERD diagram ¢) Not well suited for heavy and complex transactions systems. t MongoDB did not provides any command to create a “database”. Actually, you do | not need to create it manually, because, MangoDB will create it on the fly, during — the first time you save the value into the defined collection (or table in SQL) and database. | MongoDB is a document-oriented database which stores data in JSON-like } documents with dynamic schema. It means you can store your records without! worrying about the data structure such as the number of fields or types of fields to store values. MongoDB documents are similar to JSON objects. } MongoDB stores data records 2s BSON documents, BSON is a binary | representation of JSONdocuments, though it contains more data types than JSON. | , TECHNICAL PUBLICATIONS” - an up-thrust for knewdodyoBoa Anas aq NoSQL Data Management « MongoDB stores data in documents in-spite of tables, You can change the structure of records simply by adding new fields or deleting existing ones. This ability of MongoDB helps you to represent hierarchical relationships, to store arrays and other more complex structures easily. + MongoDB uses Mongo server and Mongo shell commands to fetch records or the information from the database (i.e. collections). Few areas where MongoDB is ideal are big data, user data management, mobile and social infrastructure, content management and delivery, data hub. + A MongoDB instance may have zero or more databases. A database may have zero or more ‘collections’. A collection may have zero or more ‘documents’. A document may have one or more ‘fields’. MongoDB ‘Indexes’ function much like their RDBMS counterparts. + Database is a physical container for collections. Collection is a group of documents and is similar to an RDBMS table, A document is a set of key-value pairs. Documents have dynamic schema. + MongoDB documents are composed of field-and-value pairs and have the following structure : fleld1: valuel, flold2; value2, fleld3: valuo3, floldN: valueNy } * The value of a field can be any of the BSON data types, including other documents, arrays and arrays of documents. MongoDB supports many data types such as : String, integer, boolean, double, arrays, timestamp, object, null, symbol, date, code and binary data. * Fig. 2.2.3 shows relation between SQL terms and MangoBD terms. weve Fig. 2.2.3 Relation between SQL terms and MangoBD terms TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics NoSQL Data Managemen supports Ad hoc queries * MongoDB uses MongoDB language and ee f MongoDB that helps it 4, replication and sharding. Sherding is a feature o| operate as a distributed data system. * Sharding is used by MongoDB to store data across multiple machines. Tt use. horizontal scaling to add more machines to distribute data and operation with Tespect to the growth of load and demand. * Sharding arrangement in MongoDB has mainly three components : Fig. 22.4 Sharding by MongoDB 2) Shards or replica sets : Each shard serves 2s a separate replica set. They store all the date. They target to increase the consistency and availability of the data. are like the managers of the clusters. These metadatz. They actually have the mapping of the The b) Configuration servers router is mongo instances which serve as interfaces y take in the user queries from the applications and ne required results serve the applications v Advantages of MangoDB : »goDB is 2 schema - less document type database. © MongoDB supports field, range based query, regular expression for searching the from the stored data. TECHNICAL PUBLICATIONS® - an upthnust for knowtedgeBig Dato Analytics 2-13 NoSQL Data Management MongoDB is very easy to scale up or down, It uses internal memo, ry for storing the working temporary datasets for which it is much faster. MongoDB support primary and Secondary indexes on any field. MongoDB supports replication of databases, MongoDB can be used as a file storage system which is known as a GridES, EE] schemaless Databases Since NoSQL does not require a schema, there is no blueprint on how data should be stored and therefore varies between databases, Generally, there are two ways that NoSQL data storage functions : 1. On-the-disk using B-Trees, with the top of it being permanently in RAM. 2. Inmemory where it is all on RAM using RB-Trees and anything stored on the disc is just an append. Schemaless databases are a type of No: Predefined schema or structure for data. This means that data can be inserted and retrieved without adhering to a specific structure and the database can adapt to changes in data over time without requiring schema migrations or changes. Schemaless database manages information without the need for a blueprint. The onset of building a schemaless datab jase does not rely on conforming to certain fields, tables, or data model structures. SQL databases that do not have a There is no Relational Database Management System (RDBMS) to enforce any specific kind of structure. In other words, it is a nor-relational database thet can handle any database type, whether that be a key-value store, document store, in-memory, column-oriented, or graph data model. In actuality, there is no such thing as schema-less dataset : 1. In a relational database, the schema is explicit and created separately in advance. 2. In column-based databases, we create a fresh schema for each row and in fact, We often reuse schema fragments from rows that are grouped together. The same is true for document databases. 3. Im column-based and also in document databases, users directly query data based on the schema. 4. In graph-based databases, we are in essence building the schema as we build the data. TECHNICAL PUBLICATIONS® - an upthrust for knowledgeig Dota Analytics et4 NoSQL Data Monogeme | ony | In schemaless databases, information is stored in JSON-style documents which «, have varying sets of fields with different data types for each field. So, a collects. I { could look like this : | t name : "Joe", ago : 30, interests : ‘football’ } a t name : ‘Kate’, age : 25 In the above condition, the data itself normally has a fairly consistent structuy | With the schemaless MongoDB database, there is some additional structure, tp, | system namespace contains an explicit list of collections and indexes. Collection, may be implicitly or explicitly created, indexes must be explicitly declared. | Benefits of using schemaless databases : 1. Flexibility : Schemaless databases allow for greater flexibility in data modeling, | 2. Scalability : Schemaless databases are designed for scalability, ag they can handle large amounts of unstructured data with ease. 3. Reduced complexity : Schemaless databases can reduce the complexity of data| modeling and development. | 4. Good support for non-uniform data. | Disadvantages : 1, Potentially inconsistent names and data types for a single value, 2. Management of the implicit schema migrates into the application layer. Ba Materialized Views | Materialized views solve the problem of views. The views provide a mechanism t hide from the client whether data is derived data or base data, Views are used when data is to be accessed infrequently and the data in a table gets updated on a_ frequent basis. 7 ‘A materialized view is a replica of a target master from a single point in time. The * master can be either a master table at a master site or a master materialized view | at a materialized view site. A materialized view is like a cache, a copy of the dalt_ that can be accessed quickly. | If a regular view is a saved query, a materialized view is a saved query plus it results stored as a table. } NoSQL databases do not have views, they may have precomputed and cached queries and they reuse the term "materialized view" to describe them NoSql. | TECHNICAL PUBLICATIONS® - an up-hrust for knowiadgo |Big Data Analytics 15 NoSQL Data Management We can use mat ized vi i 1. Ease network lower 4 VMS ' achieve one or more of the following goals : 2. Create a mass deployment environment 3. Enable data subsetting 4. Enable disconnected ¢ * Two methods are used f ‘or building a materialized view « 1. Eager approach : user . eviews: ‘omputing, it becomes more difficult and « bigger server to run the database on. ERI single server * Database is run on a single machine which handles all the reads and writes to the data store. Organizations prefer a single server because it eliminates all the complexities that the other options introduce. * Single server is eas 'y to manage for application developers. Lot of NoSQL databases are designe d around the idea of running on a cluster, it can make sense to use NoSQL with a single-server distribution model if the data model of the NoSQL store is more suited to the application. Single-server configuration is suitable for graph-database. If data usage is mostly about Processing aggregates, then a single-server document or key-value store may be useful. EEF sharding * Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, increasing the total storage capacity of the system. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-16 NoSQL Data Managemen q as horizontal scaling or scale-out, he load. Horizontal scaling allows | intense workloads. tf Sharding is a form of scaling known additional nodes are brought on to share # near-limitless scalability to handle big data and Sharding is also known as data partitioning. Many NoSQL databases of, | auto-sharding. Fig. 2.5.1 shows Sharding. -|—_ eee Fig. 2.5.1 Sharding Sharding is the process of splitting a large dataset into many small partitions which are placed on different machines. Each partition is known as a “shard”, Each shard has the same database schema as the original database. Most data is distributed such that each row appears in exactly one shard. The combined data from all shards is the same as the data from the original database. The load is balanced out nicely between servers, for example, if we have five servers, each one only has to handle 20 % of the load. The NoSQL framework is natively designed to support automatic distribution of the data across multiple servers including the query load. Both data and query replacements are automatically distributed across multiple servers located in the | different geographic regions and this facilitates rapid, automatic and transparent — replacement of the data or query instances without any disruption. Sharding is particularly valuable for performance because it can improve both read i and write performance. Using replication, particularly with caching, can greatly improve read performance but does little for applications that have a lot of writes. Advantages of Sharding a) Faster performance : There are more servers available to handle input/output. b) Horizontal scaling : We can quickly add additional servers to a cluster. ©) Costs : Horizontal scaling can often be less expensive than vertical scaling. d) Distribution/uptime : A horizontally scaled distributed database can achieve better uptime than a traditional single server. TECHNICAL PUBLICATIONS® - an up-thrst for knowledgeBig Data Analytics a 2-17 NoSQL Data Management © Disadvantages of Sharding a) Complexity : Depending on the database system, sharding complexity can vary. b) Rebalancing : When adding additional machines to a cluster, the shards will likely need to be rebalanced to distribute data evenly. c) Increased infrastructure costs. Master-slave Replication * We replicate data across multiple nodes. One node is designed as primary (master), others as secondary (slaves). Master is responsible for processing any updates to that data. A replication process synchronizes the slaves with the master. Master is the authoritative source for the data. It is responsible for processing any updates to that data. Masters can be appointed manually or automatically. Slaves is a replication process that synchronizes the slaves with the master. After a failure of the master, a slave can be appointed as new master very quickly. Fig. 2.5.2 shows master-slave replication. All updates aro, Read made to master Masse —— Changés propogates to all slaves Fig. 2.5.2 Master-slave replication Master-slave replication is most helpful for scaling when we have a read-intensive dataset. It will scale horizontally to handle more reads. This design offers read resilience. Even if one or more of the servers fails, the remaining servers can keep offering read access. This can help a lot with read-heavy applications, but will offer little benefit to write-intensive applications. As the slaves are exact replicas of the master server, one of them can assume the role of the master in case the master fails. In fact most of the time you can simply create a set of nodes and have them automatically decide who would be the master. There are some consistency issues that occur due to the delay in updating between master and slaves. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-18 NoSQL Data Menagemen, automatically. In manual appointing d we configure one node as the * Masters can be appointed manually 0! Juster of nodes and they clog, performed when we configure our cluster am master. With automatic appointment, we create a ¢ one of themselves to be the master. * Problems of master-slave replication : 1. Does not help with scalability of writes 2. Provides resilience against failure of a slave, but not of a master 3. The master is still a bottleneck. EBERT Peer-to-Peer Replication Any node © In a peerto-peer replication setup the various nodes are all “equals’. can accept reads as well as writes and they communicate these writes to each other, In peer-to-peer replication updates on any one server are replicated to all other associated servers. Fig. 2.5.3 shows peer-to-peer replications. Requests: Node] Node] Node} Fig. 2.5.3 Peer-to-peer replications « The advantage of this setup is its read and write resilience. One node's failure does not cause problems, as the remaining nodes can continue their work without losing a beat. The problem that arises is that of consistency. For example we may have conflicting write requests that come to different nodes and then those nodes attempt to communicate those requests to the rest of the nodes. This could lead to considerable inconsistencies. There are various ways to resolve this problem. The most standard approach would be to have the replicas communicate their writes first before they “accept, them. Once a majority of the replicas has confirmed a write, it can now be considered as having been successfully performed and a response sent to the client. This requires a certain amount of network traffic in coordinating these writes. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgelg Date Anelytics 2-19 NoSOL Data Management There is a problem of write-write conflict. Two users can update different copies of the same record stored on different nodes at the same time is called a write-write conflict. EI combining Sharding and Replication « Sharding and replication can be combined to get a better response. If we use both master-slave replication With Sharding and Peer-to-peer replication with Sharding. 4, Master-slave replication and Sharding : We have multiple masters, but each data item only has a single master. « A-node can be a master for some data and a slave for others. 2, Peer-to-peer replication and Sharding : + A common strategy for column-family databases. « A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard is present on three nodes. [EGES Difference between Replication and Sharding Replication Sharding ‘sr. No The primary server node copies data Sharding : Handles horizontal scaling: onto secondary server nodes. This can across servers using a shard key. help increase data availability and act as | a backup, in case the primary server =| fails. serve multiple s Replication copies data across multiple Sharding distributes different data = Each __ places. of data can be found in multiple a subset of data. Each server acts as the single source for } de Replicated servers contain identical Sharded database servers each contain a | copies of the entire database. part of the overall data, ie. they store | | different data on separate nodes: More read AA Gonsistony * The CAP theorem is important when considering a distributed database, since we must make a decision about what we are willing to give up. The database we choose will lose either availability or consistency. Reading about NoSQL databases we can face the concept of quorum. A quorum is the minimal number of nodes that_must respond to a read or write operation to be considered complete. It can improve both reads and writes. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-20 NOSGt Pala Managemen Update Consistency OF course having a maximum quorum and querying all servers is the way we , determine the correct result. : Consistency can be simply defined by how the copies from the same data may vary quip, the same replicated database system, ‘ Nowadays systems need to scale, The “traditional monolithic databas, | architecture, based on a powerful server, does not guarantee the high availabitin and network partition required by today’s web-scale systems, as demonstrated py the CAP theorem. To achieve such requirements, systems cannot impose strong consistency. In the past, almost all architectures used in database systems were strongly consistent. In these cases, most architectures would have a single database instang only responding to a few hundred clients. Nowadays, many systems are accesseq by hundreds of thousands of clients, so there was a mandatory requirement tg system's architectures that scale. However, considering the CAP theorem, | high-availability and consistency do conflict on distributed systems when subjeq to a network partition event. Two users updating the same data item at the same time is called write-write conflict. When the writes reach the server of the two users, the server will serialize them and decide to apply one, then the other. First user's update would be applied and immediately overwritten by the second user. In this case first user's is a lost update. Here the lost update is not a big problem. We see this as a failure of consistency because second user's update was based on the state before first user's update, yet was applied after it. Approaches for maintaining consistency in the case of concurrency are often described as pessimistic or optimistic. A pessimistic approach works by preventing conflicts from occurring; an optimistic approach lets conflicts occur, but detects them and takes action to sort them out. For update conflicts, the most common pessimistic approach is to have write locks, so that in order to change a value we need to acquire a lock and the system ensures that only one client can get a lock at a time. 7 So both users would attempt to acquire the write lock, but only the first user would succeed, Second user would then see the result of the first user's write before deciding whether to make his own update. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Data Analytics 2-21 NoSQL Data Management + A common optimistic approach is a conditional update where any client that does an update tests the value just before updating it to see if it is changed since his last read. + Both the pessimistic and optimistic approaches rely on a consistent serialization of the updates and it is possible for a single server. « Two general solutions for write-write conflict are as follows : 1. Pessimistic approach : Preventing conflicts from occurring. Also acquire write locks before update. 2. Optimistic approach : Lets conflicts occur, but detects them and takes actions to resolve them. « If there are more than one server ie. peer-to-peer replication, then two nodes might apply the updates in a different order, resulting in a different value for the telephone number on each peer. Sequential consistency is used in distributed systems. * Optimistic way to handle a write-write conflict is to save both updates and record that they are in conflict. Replication makes it much more likely to run into write-write conflicts. If different nodes have different copies of some data which can be independently updated, then we will get conflicts unless we take specific measures to avoid them. Using a single node as the target for all writes for some data makes it much easier to maintain update consistency. [EQ Read Consistency * Problem : One user reads in the middle of another user's writing. + It is called read-write conflict, inconsistent reading. This leads to logical inconsistency, * In NoSQL databases, read consistency refers to the level of consistency between multiple read operations on the same data. In a distributed database, where data can be replicated across multiple nodes, ensuring read consistency can be challenging. * Aggregate-oriented databases do support atomic updates, but only within a single aggregate. This means that we will have logical consistency within an aggregate but not between aggregates. * The length of time an inconsistency is present is called the inconsistency window. A NoSQL system may have a quite short inconsistency window. * There are different levels of read consistency available in NoSQL databases, ranging from eventual consistency to strong consistency, TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBig Date Analytics 2-22 NeS2L Data Manag, Eventual consistency allows for a cersin degree of Ineosistency t0 ocqy different replicas of data, In this model, the database guarantees that ay) “i . but it makes no guarantees aboys 4 Peay Will eventually propagate to all nodes, but ill be applied, ho this will take or about the order in which updates will De applted. Read-your-writes consistency means that once we have updated a Tecord, Yall our subsequent reads of that record will return the updated value, Nyy * Session consistency means read-yourswrites consistency but at session ley Session can be identified with a conversation between a client and a sere, long as the conversation continues, we will read everything we haye Wri during this conversation. If the session ends and we start another session, with _ same server, there is no guarantee that we can read values we haye Mite during previous conversation. * Session consistency is of two types : Sticky session and version stamps, Quorums * Quorum consistency is used in systems where consistency is more important thy availability (CAP theorem) for write and read. + In systems with multiple replicas there is a possibility that the user rea inconsistent data. This happens say when there are 2 replicas, N1 and N2 in cluster and 2 user writes value v1 to node NI and then another user reads fron! node N2 which is still behind Ni and thus will not have the value v1, so th second user will not get the consistent state of data. * In order to achieve a state where at least one node has consistent data we wi quorum consistency. © Fig. 2.6.1 shows write and read quorums Write (key, A) Write (key, of Bh (key, A) ‘Write (key, B),. (a) Write quorums Fig. 2.6.1 TECHNICAL PUBLICATIONS® ~ an up-thrust for knowledge2-23 NoSQL Data Management pig Dota Analytics Quorum is achieved when nodes follow the below protocol: w+r>n where n= Nodes in the quorum group, w = Minimum write nodes, 1 = Minimum read nodes Here w is our write quorum and r is our read quorum. oo Relaxing Durability 1a write is committed, the change is permanent. « Whe strict durability is not essential and it In some cases, (write performance). relax durability is to store data in memory and flush to disk we lose updates in memory. can be traded for scalability + A simple way to regularly. f the system shuts down, ERA Cejsancra base. It was initially developed by Facebook to Cassandra is a column NoSQL datat fulfill the needs of the company's Inbox Search services. In 2009, it became an ‘Apache Project. ‘Apache Cassandra is an open source, distributed, NoSQL database. Apache Cassandra is a distributed database system using a shared nothing architecture. Apache Cassandra was initially designed at Facebook using a Staged Event-Driven ‘Architecture (SEDA) to implement a combination of Amazon's Dynamo distributed storage and replication techniques and Google's Bigtable data and storage engine model. + A columnar database, also called a column-oriented database or a wide-column store, is a database that stores the values of each column together, rather than storing the values of each row together. * Columnar databases are well suited for big data processing, Business Intelligence (BD and analytics. * Cassandra provides tunable consistency i.e. users can determine the consistency level by tuning it via read and write operations. Cassandra enables users to configure the number of replicas in a cluster that must acknowledge a read or write operation before considering the operation successful. * Cassandra uses a gossip protocol to discover node state for all nodes in a cluster. Cassandra is designed to handle “big data" workloads by distributing data, reads and writes (eventually) across multiple nodes with no single point of failure. TECHNICAL PUBLICATIONS? an up-thrust for knowledgeBig Data Anaiytios 2. NoSQL. Data Managomont it allows to add more alata as per requirement, Features of Cassandra : 1) Blastic scalability: Cassandra is highly. seal) le; hardware te accommodate more customers andl More point of failure. 2) Always on architecture : Cassandra has no single is Hnearly scalable, ie, it increases 3) Fast linearseale performance : Cassandra in the cluster. throughput as we increase the number of node 4 Mlenible data storage : Cassandra accommodtates all poss incleding : Structured, semi-structured and unstructured 5) Transaction support : Cassandra supports properties Tike ACID. canta provides the flexibility to distribute data datacenters. ible data formats 6) Easy data distribution : C where you need by replicating data across multiple Cassandra Architecture © Components of C shows Cassandra architecture. Fig, 2.7.1 Cassandra architecture andra architecture are node, data center, cluster, commit log, Table, Bloom Filters and Cassandra query Janguage. memtable, Node ; A Cassandra node is a place where data is stored. Data center ; Data center is a collection of related nodes. Cluster ; A cluster is a component which contains one or more data centers. Commit log : In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is written to the commit log. Mem+-table : A menvtable is a memory-resident data structure. After commit log, the data will be written to the mem-table, Sometimes, for a single-column family, there will be multiple memtables. TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge5 © SSToble : tt ig 4 disk file te contents reach a thr shold vale ea nian Bloom fi ment is Bloo : ° Bloom filter ; Hers are very fast, nondete in whether an ole 4 member of a 4, ae filters are Accessed after every query, et. Tt is a Special algorithms for testing kind of cache, Bloom * Some of the features o} 1) Data in Cassandta i f Cassandra data model ai is stored as a 2) Tables are also called column fa re as follows ; Set of rows that are organized into tables, milies. 3) Bach row is identified by a primary key value, *) Data is partitioned by the primary key, Data modelling in Cassandra uses a Query-driven approach, in which specific Gueries are the key to organizing the data, The mein goal of Cassandra data modeling is to develop and design a hightperformance ang well-organized Cluster, Apache Cassandra data model components include keyspaces, tables and columns : a) Cassandra stores data as a set of rows organized into tables or column families b) A primary key value identifies each row ©) The primary key partitions data 4) We can fetch data in part or in its entirety based on the primary key. Cassandra data model provides a mechanism for data storage, The components of Cassandra data model are keyspaces, tables and columns, TECHNICAL PUBLICATIONS® - an upthrust for knowledgeNoSQL Data Man Big Data Analytics 2-26 =aemony © Fig, 2.7.2 shows Cassandra data model Cluster | KeySpacet Column familyt ‘Column femily4 L Column family2 Row Row Row Ro, | KeySpace2 | coun ICotumn2}| Cotumnt}|Cotumn2 cone coun Cotumng [cote cota Value |) Value |) Value | Fig. 2.7.2 Cassandra data model Value Value |} Value |) Value vaiue | Value 4. Keyspaces : * Ata high level, the Cassandra NoSQL data model consists of data containers called keyspaces. Keyspaces are similar’ to the schema in a relational database, ‘Typically, there are many tables in a keyspace. i © Features of keyspaces are : a) A keyspace needs to be defined before creating tables, as there is no default keyspace. a b) A keyspace can contain any number of tables and a table bélongs only td one keyspace. This represents a one-to-many relationship. ©) Replication is specified at the keyspace level. For example, replication of three implies that each data row in the keyspace will have three copies. f 2. Tables : j * Tables, also called column families in eatlier iterations of Cassandra, aré (defined within the keyspaces. Tables store data in a set of rows and contain a primary key and a set of columns. * Cassandra tables are used to hold the actual data in the form of rows and columns. A table in Cassandra must be created with the primary key during table creation time, post that it can not be altered. © To alter the table new tables should be created with existing data, The primary key would be used to locate and order the data, TECHNICAL PUBLICATIONS® - an up.tnrust for knowledgeBig Data Analytics 2-27 NoSQL Data Management * Some of the features of tables are : a) Tables have multiple rows and col P lumns. As mentioned earlier, a table is also called column family in the earlier versions of Cassandra. b) It is still referred to as column family in some of the error messages and documents of Cassandra. ¢) It is important to define a primary key for a table. 3. Columns : + Columns define data structure within a table. There are various types of columns, such as Boolean, double, integer and text. * Cassandra column is used to store a sin; gle piece of data. The column can consist of various types of data such as big inte; ger, double, text, float and Boolean. Each column value has a timestamp associated with it that shows the time of update. Cassandra provides the collection type of columns such as list, set and map. © Some of its features'are : a) Columns consist of various types, such as integer, big integer, text, float, double and Boolean. b) Cassandra also provides collection types such as set, list and map. ©) Further, column values have an associated time stamp representing the time of update. d) This timestamp can be retrieved using the function write time. Cassandra Clients * Thrift is the driver-level interface; it provides the API for client implementations in a wide variety of languages. Thrift was developed at Facebook. A Client holds connections to a Cassandra cluster, allowing it to be queried. Each Client instance maintains multiple connections to the cluster nodes, provides Policies to choose which node to use for each query and handles retries for failed query etc... Client instances are designed to be long-lived and usually a single instance is enough per application, As a given Client can only be “logged” into one keyspace at a time, it can make sense to create one client per keyspace used. This is however not necessary to query multiple keyspaces since it is always possible to use a single session with fully qualified table name in queries. The Cassandra cluster is denoted as a ring. The idea behind this representation is to show token distribution. TECHNICAL PUBLICATIONS® ~ an up-thrust for knowledgeaie NoSOL Date Managem, | iace to write the date to. The task oy ‘sczs) that are responsible to hold g. strategy to do that. ctlizes 2 replication axe idected, # sends the RowMutetion message to § - hace nodes, but # does not weit for all the reptie, the node waits west individually. Every node Gist writes the cad Gen write: the mutztion to the memiable . Sls is Gokhed to an immmnteble structure celled as SSTable (Sorted Sting Table). The commit log is used for playback purposes in case data Som the memizble is lost due to node feilure. EZ] Two Marks Questions with Answers az What is consistency In a distributed system ? Ans. stem, consistency will be defined as one that responds with | the seme output for the same request at the same time across all the replicas. z Hbuting 2 single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be s and stored in multiple deta nodes, increasing the total storage TECHNICAL PUBLICATIONS” - an up-thrust for bmoutedgeQ6 What are write-write and read-erite conficts 7 Ans. : Write-write confiicts ocr when tro diews = middle of another dien!’s write, Q7 Define Cassandra. Cessendra is a peerto-peer distributed system made up any nods can accept 2 read or write request. Qs What is the use of Bloom filters in Cassandra 7 Ans. : Bloom filters are used as 2 performance booster: Bloom flies ae vey Sst | nondeterministic algorithms for testing whether an clement is 2 member of 2 set They are nondeterministic because it is possible to get 2 telsepositive read fom 2 Bloom filter, but not 2 false-negative. Bloom filtess work by mapping the values ina doe into a bit array and condensing a larger data set into 2 digest sting Bloom filter is 2 | special kind of cache. Q3 Explain sorted strings table. Ans. : Sorted strings table which is 2 fle format used by Cassandra to store the statics and the data from the memtzbles. The Cassendre $STables are ixmutzble hence any update on the table creates 2 new SSTables fle The dete structie focmat wed by SSTables is Log-Structured Mezge which is qualified for writing intense heavy dete sets compared to the traditional B tree structure. TECHNICAL PUBLICATIONS? - en ustteust for Imontedgeig Dots Analytics oo NoSOL Data Managemen, Q.10 Explain Cassandra data center. Ans. : Cassandra data center is the collection of related nodes which are configured in | the cluster to perform the replication. The data centers can be physical data centers o, | logical data center and depending upon the workload a separate data center can be | used, | | Qi1 Explain advantages and disadvantages of graph data. | Ans. : © Advantages of graph data: | 2) More descriptive queries | b) Greater flexibility in adapting your model <) Greater performance when traversing data relationships. | © Disadvantages of graph data stores : | 2) Difficult to scale, j b) No standard language. | Q32 Describe session consistency. | Ans. : Session consistency means read-your-writes consistency but at session level. | Session can be identified with 2 conversation between a client and a server. As long as the conversation continues, we will read everything we have written during this conversation. If the session ends and we start another session with the same server, there is no guarantee that we can read values we have written during previous conversation. Qi3 What are schemaless databases 7 Ans. : Schemaless databases are a type of NoSQL databases that do not have a predefined schema or structure for data. This means that data can be inserted and ved without adhering to a specific structure and the database can adapt to changes in date over time without requiring schema migrations or changes. qaa + an up-thrust for knowledge

P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
Bda Unit-2
No ratings yet
Bda Unit-2
29 pages
Module1 Introduction
No ratings yet
Module1 Introduction
42 pages
Unit 3
No ratings yet
Unit 3
127 pages
BDA.Unit-2
No ratings yet
BDA.Unit-2
30 pages
Music recommendation system- Mini project
No ratings yet
Music recommendation system- Mini project
19 pages
Full Stack UNIT 4
No ratings yet
Full Stack UNIT 4
50 pages
Cs3591-Unit 5
No ratings yet
Cs3591-Unit 5
27 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
DBMS Module 3 PPT
No ratings yet
DBMS Module 3 PPT
51 pages
Mern lab Manual
No ratings yet
Mern lab Manual
12 pages
Backend Concepts
No ratings yet
Backend Concepts
46 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Oop PPT 1
No ratings yet
Oop PPT 1
94 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
SPARQL
No ratings yet
SPARQL
39 pages
DBMS Unit 1 Notes
100% (1)
DBMS Unit 1 Notes
22 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
CS3251 UNIT IV Notes
No ratings yet
CS3251 UNIT IV Notes
27 pages
Unit III - Graphs
No ratings yet
Unit III - Graphs
38 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
UNIT 3 DBMS R23
No ratings yet
UNIT 3 DBMS R23
24 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
HBase
No ratings yet
HBase
36 pages
Unit 3
No ratings yet
Unit 3
28 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Mc4102 Oose Unit 1 KVL Notes
No ratings yet
Mc4102 Oose Unit 1 KVL Notes
26 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Unit 4 - Device Drivers
No ratings yet
Unit 4 - Device Drivers
8 pages
CS8603 U.ii
No ratings yet
CS8603 U.ii
20 pages
BDA Unit2 Complete
No ratings yet
BDA Unit2 Complete
56 pages
What Is Pre-Production
No ratings yet
What Is Pre-Production
7 pages
unit V
No ratings yet
unit V
67 pages
FSD Unit2
No ratings yet
FSD Unit2
41 pages
Module 4 Nosql
No ratings yet
Module 4 Nosql
8 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Stack Using Linked List
No ratings yet
Stack Using Linked List
20 pages
M2M Architecture in IOT
No ratings yet
M2M Architecture in IOT
5 pages
React Intro
No ratings yet
React Intro
45 pages
Lab Manual: Sri Ramakrishna Institute of Technology
No ratings yet
Lab Manual: Sri Ramakrishna Institute of Technology
49 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
The CAP Theorem
100% (1)
The CAP Theorem
3 pages
B Tree Assignment
No ratings yet
B Tree Assignment
4 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Distributed Deadlock Detection
No ratings yet
Distributed Deadlock Detection
18 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
DBMS Architecture: 1-Tier, 2-Tier & 3-Tier: What Is Database Architecture?
100% (1)
DBMS Architecture: 1-Tier, 2-Tier & 3-Tier: What Is Database Architecture?
3 pages
Ads Bot
No ratings yet
Ads Bot
2 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
Unit 5 - Chapter 2 - Introduction To MongoDB
No ratings yet
Unit 5 - Chapter 2 - Introduction To MongoDB
53 pages
WT Unit 3
No ratings yet
WT Unit 3
57 pages
UNIT-3 Javascript: Introduction Java Script
No ratings yet
UNIT-3 Javascript: Introduction Java Script
45 pages
NOSQL
No ratings yet
NOSQL
23 pages
DSA RTU 2022 Paper
No ratings yet
DSA RTU 2022 Paper
15 pages
CCA 3 QP 2021-Final
No ratings yet
CCA 3 QP 2021-Final
2 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Distributed Database Systems: January 2002
No ratings yet
Distributed Database Systems: January 2002
25 pages
CSE287 (Database Management Systems Laboratory) - Final
No ratings yet
CSE287 (Database Management Systems Laboratory) - Final
13 pages
Anna University IT 7th Sem Ebooks
No ratings yet
Anna University IT 7th Sem Ebooks
3 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages

Bda - Unit 2

Uploaded by

Bda - Unit 2

Uploaded by

You might also like