0% found this document useful (0 votes)
18 views

Big Data Unit 5

Uploaded by

nvnaitik7999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Big Data Unit 5

Uploaded by

nvnaitik7999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|38620603

Big Data Analytics - Module 5

Big Data Analytics (APJ Abdul Kalam Technological University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Akhilesh Singh ([email protected])
lOMoARcPSD|38620603

UNIT V - NOSQL DATA MANAGEMENT FOR BIG DATA AND VISUALIZATION


NoSQL Databases : Schema-less Models‖: Increasing Flexibility for Data Manipulation-Key
Value Stores- Document Stores - Tabular Stores - Object Data Stores - Graph Databases Hive
- Sharding –- Hbase – Analyzing big data with twitter - Big data for E-Commerce Big data
for blogs - Review of Basic Data Analytic Methods using R.

NOSQL DATABASE
 The availability of a high-performance, elastic distributed data environment enables
creative algorithms to exploit variant modes of data management in different ways.
 Data management frameworks are bundled under the term “NoSQL databases”.
 Combine traditional SQL (or SQL-like query languages) with alternative means of
querying and access.
 NoSQL data systems hold out the promise of greater flexibility in database management
while reducing the dependence on more formal database administration.
 Types: Key –Value Stores:
 Values (or sets of values, or even more complex entity objects) are associated with
distinct character strings called keys.
 Programmers may see similarity with the data structure known as a hash table.
Key Value
BMW {“1-Series”, “3-Series”, “5-Series”, “5-Series GT”, “7-Series”, “X3”, “X5”,
“X6”, “Z4”}
Buick {“Enclave”, “LaCrosse”, “Lucerne”, “Regal”}
 The key is the name of the automobile make, while the value is a list of names of models
associated with that automobile make.
Operations:
 Get(key), which returns the value associated with the provided key.
 Put(key, value), which associates the value with the key.
 Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list of
keys.
 Delete(key), which removes the entry for the key from the data store
Characteristics:
 Uniqueness of the key - to find the values you are looking for, you must use the exact
key.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 In this data management approach, if you want to associate multiple values with a single
key, you need to consider the representations of the objects and how they are associated
with the key.
 Key-value stores are essentially very long, and likely thin tables.
 The table’s rows can be sorted by the key value to simplify finding the key during a
query.
 The keys can be hashed using a hash function that maps the key to a particular location
(sometimes called a “bucket”) in the table.
 The representation can grow indefinitely, which makes it good for storing large amounts
of data that can be accessed relatively quickly, as well as environments requiring
incremental appends of data.
 Examples include capturing system transaction logs, managing profile data about
individuals.
 The simplicity of the representation allows massive amounts of indexed data values to be
appended to the same key value table, which can then be sharded, or distributed across
the storage nodes.
 Under the right conditions, the table is distributed in a way that is aligned with the way
the keys are organized.
 While key value pairs are very useful for both storing the results of analytical algorithms
(such as phrase counts among massive numbers of documents) and for producing those
results for reports, the model does pose some potential drawbacks.
Drawbacks:
 The model will not inherently provide any kind of traditional database capabilities (such
as atomicity of transactions, or consistency when multiple transactions are executed
simultaneously)—those capabilities must be provided by the application itself.
 Another is that as the model grows, maintaining unique values as keys may become more
difficult.
Types: Document Stores:
 A document store is similar to a key value store in that stored objects are associated (and
therefore accessed via) character string keys.
 The difference is that the values being stored, which are referred to as “documents,”
provide some structure and encoding of the managed data.
 A document store is similar to a key value store in that stored objects are associated (and
therefore accessed via) character string keys.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 The difference is that the values being stored, which are referred to as “documents,”
provide some structure and encoding of the managed data.
 Common encodings - XML
Example:
{StoreName:“Retail Store #34”, {Street:“1203 O ST”, City:“Lincoln”, State:“NE”,
ZIP:“68508”} }
{StoreName:”Retail Store #65”, {MallLocation:”Westfield Wheaton”, City:”Wheaton”,
State:”IL”} }
{StoreName:”Retail Store $102”, {Latitude:” 40.748328”, Longitude:” -73.985560”} }
 The document representation embeds the model so that the meanings of the document
values can be inferred by the application.
 One of the differences between a key value store and a document store is that while the
former requires the use of a key to retrieve data, the latter often provides a means (either
through a programming API or using a query language) for querying the data based on
the contents.
Types: Tabular Stores
 Tabular, or table-based stores are largely derived from Google’s original Bigtable design
to manage structured data.
 The HBase model, a Hadoop-related NoSQL data management system that evolved from
bigtable.
 The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table
that is indexed by a row key, a column key that indicates the specific attribute for which a
data value is stored, and a timestamp that may refer to the time at which the row’s column
value was stored.
 As an example, various attributes of a web page can be associated with the web page’s
URL: the HTML content of the page, URLs of other web pages that link to this web page,
and the author of the content.
 Columns in a Bigtable model are grouped together as “families,” and the timestamps
enable management of multiple versions of an object.
 The timestamp can be used to maintain history—each time the content changes, new
column attachments can be created with the timestamp of when the content was
downloaded.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

Types: Object Data Stores:


 Object databases can be similar to document stores except that the document stores
explicitly serializes the object so the data values are stored as strings, while object
databases maintain the object structures as they are bound to object-oriented
programming languages.
 Object database management systems are more likely to provide traditional ACID
(atomicity, consistency, isolation, and durability) compliance—characteristics that are
bound to database reliability. Object databases are not relational databases and are not
queried using SQL.
Types: Graph Databases:
 Graph databases provide a model of representing individual entities and numerous kinds
of relationships that connect those entities.
 It is consisting of a collection of vertices that represent the modeled entities, connected by
edges that capture the way that two entities are related.

HIVE
 Apache Hive enables users to process data without explicitly writing MapReduce code.
 One key difference to Pig is that the Hive language, HiveQL (Hive Query Language),
resembles Structured Query Language (SQL) rather than a scripting language.
 A Hive table structure consists of rows and columns.
 The rows typically correspond to some record, transaction, or particular entity (for
example, customer) detail.
 The values of the corresponding columns represent the various attributes or
characteristics for each row.
 Hadoop and its ecosystem are used to apply some structure to unstructured data.
 Therefore, if a table structure is an appropriate way to view the restructured data, Hive
may be a good tool to use.
 Additionally, a user may consider using Hive if the user has experience with SQL and the
data is already in HDFS.
 Another consideration in using Hive may be how data will be updated or added to the
Hive tables.
 If data will simply be added to a table periodically, Hive works well, but if there is a need
to update data in place, it may be beneficial to consider another tool, such as Hbase.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 A Hive query is first translated into a MapReduce job, which is then submitted to the
Hadoop cluster.
 Thus, the execution of the query has to compete for resources with any other submitted
job.
 Hive is intended for batch processing
 Data easily fits into a table structure.
 Data is already in HDFS.
 Developers are comfortable with SQL programming and queries.
 There is a desire to partition datasets based on time.
 Batch processing is acceptable
Basics:
From the command prompt, a user enters the interactive Hive environment by simply
entering hive:
$ hive
hive>
From this environment, a user can define new tables, query them, or summarize their
contents.
hive> create table customer ( cust_id bigint, first_name string, last_name string,
email_address string) row format delimited fields terminated by ‘\t’;
 HiveQL query is executed to count the number of records in the newly created table,
customer.
 The table is currently empty, the query returns a result of zero, the last line of the
provided output.
 The query is converted and run as a MapReduce job, which results in one map task and
one reduce task being executed.
 hive> select count(*) from customer;
 When querying large tables, Hive outperforms and scales better than most conventional
database queries.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

HBASE
 HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
 Apache Hbase is capable of providing real time read and write access to datasets with
billions of rows and millions of columns.
 HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data.
 It leverages the fault tolerance provided by the Hadoop File System.
 It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
 One can store the data in HDFS either directly or through HBase.
 Data consumer reads/accesses the data in HDFS randomly using HBase.
 HBase sits on top of the Hadoop File System and provides read and write access.
Storage Mechanism in HBase:
 HBase is a column-oriented database and the tables in it are sorted by row.
 The table schema defines only column families, which are the key value pairs.
 A table has multiple column families and each column family can have any number of
columns.
 Subsequent column values are stored contiguously on the disk. Each cell value of the
table has a timestamp.
Architecture:
 Tables are split into regions and are served by the region servers.
 Regions are vertically divided by column families into “Stores”.
 Stores are saved as files in HDFS

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

Master Server:
 Assigns regions to the region servers
 Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing
Regions:
 Tables are split up and spread across the region servers.
Regions Server:
 Communicate with the client and handle data related operations.
 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.
 Memstore – Cache Memory
 The data is transferred and saved in Hfiles as blocks and the memstore is flushed.
Zoo Keeper:
 It provides services like maintaining configuration information, naming, providing
distributed synchronization, etc.
 It keeps track of all the region servers in the HBase cluster and tracking information like,
how many region servers are there and which region servers are holding which DataNode.
Services:
 Establishing client communication with region servers.
 Tracking server failure and network partitions.
 Maintain Configuration Information

SHARDING
 Sharding is a database architecture pattern related to horizontal partitioning.
 The practice of separating one table’s rows into multiple different tables, known as
partitions.
 Each partition has the same schema and columns, but also entirely different rows.
 The data held in each is unique and independent of the data held in other partitions.
Before Sharding:

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

After Sharding:

 In a vertically-partitioned table, entire columns are separated out and put into new,
distinct tables.
 The data held within one vertical partition is independent from the data in all the others,
and each holds both distinct rows and columns.
Horizontal or Range Based Sharding:
 In this case, the data is split based on the value ranges that are inherent in each entity.
 For example, the if you store the contact info for your online customers, you might
choose to store the info for customers whose last name starts with A-H on one shard,
while storing the rest on another shard.
ID Name Mail ID
1 A [email protected]
2 B [email protected]
3 C [email protected]
4 D [email protected]

ID Name Mail ID ID Name Mail ID


1 A [email protected] 3 C [email protected]
2 B [email protected] 4 D [email protected]
Advantages:
 Each shard also has the same schema as the original database.
 It works well for relative non static data -- for example to store the contact info for
students in a college because the data is unlikely to see huge churn.
Disadvantages:
 The disadvantage of this scheme is that the last names of the customers may not be evenly
distributed.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 In that case, your first shard will be experiencing a much heavier load than the second
shard and can become a system bottleneck.
Vertical Sharding:
 In this case, different features of an entity will be placed in different shards on different
machines.
ID Name Mail ID
1 A [email protected]
2 B [email protected]
3 C [email protected]
4 D [email protected]

ID Name ID Mail ID
1 A 1 [email protected]
2 B 2 [email protected]
3 C 3 [email protected]
4 D 4 [email protected]

Benefits:
 It handles the critical part of your data differently from the not so critical part of your data
and build different replication and consistency models around it.
Disadvantages:
 It increases the development and operational complexity of the system.
 If your Site/system experiences additional growth then it may be necessary to further
shard a feature specific database across multiple server.
Key or hash based sharding:
 In this case, an entity has a value which can be used as an input to a hash function and a
resultant hash value generated. This hash value determines which database server(shard)
to use.
 The main drawback of this method is that elastic load balancing (dynamically
adding/removing database servers) becomes very difficult and expensive.
 A large number of the requests cannot be serviced and you'll incur a downtime till the
migration completes.
Directory based sharding:
 Directory based shard partitioning involves placing a lookup service in front of the
sharded databases.
 The lookup service knows the current partitioning scheme and keeps a map of each entity
and which database shard it is stored on.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 The client application first queries the lookup service to figure out the shard (database
partition) on which the entity resides/should be placed.
 Then it queries / updates the shard returned by the lookup service.
 The lookup service is usually implemented as a web service.

Steps:
 Keep the modulo 4 hash function in the lookup service.
 Determine the data placement based on the new hash function - modulo 10.
 Write a script to copy all the data based on #2 into the six new shards and possibly on the
4 existing shards. Note that it does not delete any existing data on the 4 existing shards.
 Once the copy is complete, change the hash function to modulo 10 in the lookup service
 Run a cleanup script to purge unnecessary data from 4 existing shards based on step#2.
The reason being that the purged data is now existing on other shards.
 There are two practical considerations which needs to be solved on a per system basis:
 While the migration is happening, the users might still be updating their data. Options
include putting the system in read-only mode or placing new data in a separate server that
is placed into correct shards once migration is done.
 The copy and cleanup scripts might have an effect on system performance during the
migration. It can be circumvented by using system cloning and elastic load balancing -
but both are expensive.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

ANALYSIS OF BIG DATA FOR THE REAL TIME APPLICATION LIKE E-


COMMERCE
It allows businesses to gain access to significantly larger amount of data in order to convert
growth into revenue, streamline operation processes and gain more customers.
Optimize Customer Shopping Experience:
 Customer behaviour pattern, purchase histories, browsing interests
 To re-target the buyers by displaying or recommending products that they are interested
in.
Higher Customer Satisfaction
The stores to understand their customers better and built a lasting relationship with them.
Streaming Analytics
 To gain valuable customer insights.
 Store provides a lot of insights that will help in personalizing the shopper’s experience
and generate more revenue.
Recommendation Engine:
 Analyze all the actions of a particular customer: Product pages visited products they liked
added into their carts and finally bought / abandoned.
 The system can also compare the behaviour pattern of a certain visitor to those of the
other visitors.
 It analyzes, recommends the products that a visitor may like.
Personalized Shopping Experience:
 A key to successful e-commerce marketing
 To react to their customer’s actions properly and in real time.
 Analyze all customer activities in an e-shop and a create a picture of customer behavior
pattern.
Everything in the cart is tracked:
 System encourage customer to finish the purchase with discounts.
 Example: A customer bought a winter coat two weeks ago and visited some product
pages with winter gloves, scarfs and hats at that time.
Voice of the customer:
 To add sentiment analysis to the standard approach of analyzing products and brands by
their sales value, volume, revenues, number of orders, etc.
 Sentiment analysis is the evaluation of comments that customers left about different
products and brands.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

 The system identifies whether each comment is positive or negative.


 Positive Comments: Happy, Great, Recommend or Satisfied
 Negative Comments: Bad, Terrible.
Dynamic Pricing:
 Setting price rules, monitoring competitors and adjusting prices in real time.
Demand Forecasting:
 Creating customer profiles
 Looking at their customer’s behavior
 How many items they usually purchase and which products they buy.
 To collect, analyze and visualizing the analysis result.
 Retailer will analyze external big data

EXPLORATORY DATA ANALYSIS IN R


The study of the data in terms of basic statistical measures and creation of graphs and plots to
visualize and identify relationships and patterns.
R is a programming language and software framework for statistical analysis and graphics.
read.csv() – to import CSV file.
head() – to display first six records of file.
summary() – provides some descriptive statistics, such as the mean and median, for each data
column.
R software uses a command line interface.
To improve the ease of writing, executing and debugging R code, several additional GUIs
have been written for R.
Example: RStudio, R Commander
Data Import and Export:
setwd() – to set the working directory for the subsequent import and export operations.
Example: setwd(“path of the dorctory”)
read.table and read.delim() – to import other common file types such as TXT
Attribute and Data Type:
The characteristics or attributes provide the qualitative and quantitative measures for each
item or subject of interest.
Categorical Numeric
Nominal Ordinal Interval Ratio

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

Definition The values Attributes imply The difference Both difference


represent labels a sequence between two and ratio two
that distinguish values is values are
one from meaningful meaningful.
another.
Example: ZIP Code Academic Calendar Dates Age, length
Grades
Attribute and Data Type:
 class() – represents the abstract class of an object.
 typeof() – determines the way an object is stored in memory
 is.data_type(object) – verify if object is of a certain datatype
 as.data_type(object) – convert data type of object to another
 Predefined constants: pi, letters, LETTERS, month.name. month.abb
Data Types:
Logical – True or False
Integer – Set of all integers
Numeric – Set of all real numbers
character – “a”, “b”
Basic object:
Vector – Ordered collection of same data types
List – Ordered collection of objects
Data Frame – Generic tabular object
Basic object – Vectors:
An ordered collection of basic data types of given length
All the elements of a vector must be of same data type.
Example: x=c(1, 2, 3)
Basic object – List:
A generic object consisting of an ordered collection of objects.
A list could consists of a numeric vector, a logical value, a matrix, a complex vector.
To access top level components, use double slicing operator “[[]]” or [] and for lower / inner
level components use “[]” along with “[[]]”.
Basic object – Data Frame:
Used to store tabular data.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

df[val,val2] – row “val1”, column “val2”


val1, val2 can also be array of values like “1:2” or “c(1:2)”
df[val2] – refers to column “val2” only
Subset():
Extracts subset of data based on conditions
runif(75,0,10) – generated 75 numbers between 0 to 10 with random numbers.
Visualizing a single variable:
 plot(data) – suitable for low volume data
 barplot(data) – Vertical or horizontal bars
 dotchart(data) – dot plot
 hist(data) – histogram
Statistical Methods for Evaluation:
It is used during the initial data exploration and data preparation, model building, evaluation
of the final models.
Hypothesis Testing:
 To form an statement and test it with data.
 When performing hypothesis tests, the common assumption is that there is no difference
between two samples – Null Hypothesis (H0)
Application Null Hypothesis Alternative Hypothesis
Accuracy Forecast Model X does not predict Model X predicts better than
better than the existing model the existing model
Recommendation engine Algorithm Y does not Algorithm Y produces better
produce better recommendation than the
recommendation than the current algorithm being used.
current algorithm being used.
Wilcoxon Rank-Sum Test:
 It is a nonparametric hypothesis test that checks whether two populations are identically
distributed.
 wilcox.test() – ranks the observation, determines the respective rank-sums corresponding
to each population’s sample and then determines the probability of such rank-sums of
such magnitude being observed assuming that the population distributions are identical.

Downloaded by Akhilesh Singh ([email protected])


lOMoARcPSD|38620603

Type I and Type II Errors:


 Type I Error: The rejection of the null hypothesis when the null hypothesis is TRUE.
Probability is denoted by the Greek letter α.
 Type II Error: The acceptance of a null hypothesis when the null hypothesis is FALSE.
Probability is denoted by the Greek letter β.
ANOVA:
 Analysis of Variance.
 ANOVA is a generalization of the hypothesis testing of the difference of two population
means.
 The null hypothesis of ANOVA is that all the population means are equal.
 The alternative hypothesis is that at least one pair of the population means is not equal.

Downloaded by Akhilesh Singh ([email protected])

You might also like