0% found this document useful (0 votes)
11 views

Unit 5

Uploaded by

lokeshram310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 5

Uploaded by

lokeshram310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CS8091 BIG DATA ANALYTICS

UNIT V NOSQL DATA MANAGEMENT FOR BIG DATA AND VISUALIZATION

NoSQL Databases : Schema-less Models: Increasing Flexibility for Data Manipulation-Key


Value Stores- Document Stores - Tabular Stores - Object Data Stores - Graph Databases Hive -
Sharding – Hbase – Analyzing big data with twitter - Big data for E-Commerce Big data for
blogs - Review of Basic Data Analytic Methods using R.

NOSQL
The availability of a high-performance, elastic distributed data environment enables creative
algorithms to exploit variant modes of data management in different ways. In fact, some
algorithms will not be able to consume data in traditional RDBMS systems and will be acutely
dependent on alternative means for data management. Many of these alternate data management
frameworks are bundled under the term ―NoSQL databases.‖
The term ―NoSQL‖ may convey two different connotations—one implying that the data
management system is not an SQL-compliant one, while the more accepted implication is that
the term means ―Not only SQL,‖ suggesting environments that combine traditional SQL with
alternative means of querying and access.

“Schema-Less Models”: Increasing Flexibility for Data Manipulation


NoSQL data systems reduces the dependence on more formal database administration. NoSQL
databases have more relaxed modeling constraints, which may benefit both the application
developer and the end-user analysts. The interactive analyses are not throttled by the need to cast
each query in terms of a relational table-based environment. Different NoSQL frameworks are
optimized for different types of analyses.
Ex. Key-value stores, graph databases etc.
The general concepts for NoSQL include schema less modeling in which the semantics of the
data are embedded within a flexible connectivity and storage model; this provides for automatic
distribution of data and elasticity with respect to the use of computing, storage, and network
bandwidth in ways that don’t force specific binding of data to be persistently stored in particular
physical locations. NoSQL databases also provide for integrated data caching that helps reduce
data access latency and speed performance.
Because of the ―relaxed‖ approach to modeling and management that does not enforce
shoehorning data into strictly defined relational structures, the models themselves do not
necessarily impose any validity rules; this potentially introduces risks associated with
ungoverned data management activities such as inadvertent inconsistent data replication,
reinterpretation of semantics, and currency and timeliness issues.

1
CS8091 BIG DATA ANALYTICS

KEY-VALUE STORES
Key-value store is a simple type of NoSQL data store. It is a schema-less model in which values
(or sets of values) are associated with distinct character strings called keys.
Example data represented in a Key-value store

The key is the name of the automobile make, while the value is a list of names of models
associated with that automobile make.
The key-value store does not impose any constraints about data typing or data structure—the
value associated with the key is the value, and it is up to the consuming business applications to
assert expectations about the data values and their semantics and interpretation.
The core operations performed on a key-value store include:
 Get(key), which returns the value associated with the provided key.
 Put(key, value), which associates the value with the key.
 Multi-get(key1, key2,.., keyN), which returns the list of values associated with the list of
keys.
 Delete(key), which removes the entry for the key from the data store.
One critical characteristic of a key-value store is uniqueness of the key—to find out any values
the exact key must be used. In order to associate multiple values with a single key, consideration
should be given to the representations of the objects and how they are associated with the key.
Key-value stores are essentially very long, and presumably thin tables (in that there are not many
columns associated with each row). The table’s rows can be sorted by the key value to simplify
finding the key during a query. Alternatively, the keys can be hashed using a hash function that
maps the key to a particular location (sometimes called a ―bucket‖) in the table. Additional
supporting data structures and algorithms (such as bit vectors and bloom filters) can be used to
even determine whether the key exists in the data set at all.
The simplicity of the representation allows massive amounts of indexed data values to be
appended to the same key-value table, which can then be sharded, or distributed across the
storage nodes. Under the right conditions, the table is distributed in a way that is aligned with the
way the keys are organized, so that the hashing function that is used to determine where any
specific key exists in the table can also be used to determine which node holds that key’s bucket.

2
CS8091 BIG DATA ANALYTICS

While key-value pairs are very useful for both storing the results of analytical algorithms (such
as phrase counts among massive numbers of documents) and for producing those results for
reports, the model does pose some potential drawbacks. One is that the model will not inherently
provide any kind of traditional database capabilities (such as atomicity of transactions, or
consistency when multiple transactions are executed simultaneously)—those capabilities must be
provided by the application itself. Another is that as the model grows, maintaining unique values
as keys may become more difficult, requiring the introduction of some complexity in generating
character strings that will remain unique among a myriad of keys.

DOCUMENT STORES
A document store is similar to a key-value store in that stored objects are associated (and
therefore accessed via) character string keys. The difference is that the values being stored,
which are referred to as ―documents,‖ provide some structure and encoding of the managed data.
There are different common encodings, including XML (Extensible Markup Language), JSON
(Java Script Object Notation), BSON (which is a binary encoding of JSON objects), or other
means of serializing data.
Example of document store.

Even though the three examples all represent locations, the representative models differ. The
document representation embeds the model so that the meanings of the document values can be
inferred by the application. One of the differences between a key-value store and a document
store is that while the former requires the use of a key to retrieve data, the latter often provides a
means (either through a programming API or using a query language) for querying the data
based on the contents. Because the approaches used for encoding the documents embed the
object metadata, one can use methods for querying by example.

3
CS8091 BIG DATA ANALYTICS

TABULAR STORES
Tabular, or table-based stores are largely descended from Google’s original Bigtable design1 to
manage structured data. Ex. Hadoop-related NoSQL data management system- HBase
The bigtable NoSQL model allows sparse data to be stored in a three-dimensional table that is
indexed by a row key (that is used in a fashion that is similar to the key-value and document
stores), a column key that indicates the specific attribute for which a data value is stored, and a
timestamp that may refer to the time at which the row’s column value was stored.
As an example, various attributes of a web page can be associated with the web page’s URL: the
HTML content of the page, URLs of other web pages that link to this web page, and the author
of the content. Columns in a Bigtable model are grouped together as ―families,‖ and the
timestamps enable management of multiple versions of an object. The timestamp can be used to
maintain history—each time the content changes, new column affiliations can be created with the
timestamp of when the content was downloaded.

OBJECT DATA STORES


Object data stores and object databases seem to bridge the worlds of schema-less data
management and the traditional relational models. On the one hand, approaches to object
databases can be similar to document stores except that the document stores explicitly serializes
the object so the data values are stored as strings, while object databases maintain the object
structures as they are bound to object-oriented programming languages such as C11, Objective-
C, Java, and Smalltalk. On the other hand, object database management systems are more likely
to provide traditional ACID (atomicity, consistency, isolation, and durability) compliance—
characteristics that are bound to database reliability. Object databases are not relational databases
and are not queried using SQL.

GRAPH DATABASES
Graph databases provide a model of representing individual entities and numerous kinds of
relationships that connect those entities. More precisely, it employs the graph abstraction for
representing connectivity, consisting of a collection of vertices (which are also referred to as
nodes or points) that represent the modeled entities, connected by edges (which are also referred
to as links, connections, or relationships) that capture the way that two entities are related. Graph
analytics performed on graph data stores are somewhat different than more frequently used
querying and reporting.

CONSIDERATIONS

4
CS8091 BIG DATA ANALYTICS

The decision to use a NoSQL data store instead of a relational model must be aligned with the
data consumers’ expectations for compliance with their expectations of relational models. As
should be apparent, many NoSQL data management environments are engineered for two key
criteria:
1. fast accessibility, whether that means inserting data into the model or pulling it out via some
query or access method;
2. scalability for volume, so as to support the accumulation and management of massive amounts
of data.
The different approaches are amenable to extensibility, scalability, and distribution, and these
characteristics blend nicely with programming models (like MapReduce) with straightforward
creation and execution of many parallel processing threads. Distributing a tabular data store or a
key-value store allows many queries/accesses to be performed simultaneously, especially when
the hashing of the keys maps to different data storage nodes. Employing different data allocation
strategies will allow the tables to grow indefinitely without requiring significant rebalancing. In
other words, these data organizations are designed for high performance computing for reporting
and analysis. However, most NoSQL environments are not generally designed for transaction
processing, and it would require some whittling down of the list of vendors to find those that
support ACID transactions.

HIVE
One of the often-noted issues with MapReduce is that although it provides a methodology for
developing and executing applications that use massive amounts of data, it is not more than that.
The data can be managed within files using HDFS, many business applications expect
representations of data in structured database tables. That was the motivation for the
development of Hive.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive is
specifically engineered for data warehouse querying and reporting and is not intended for use as
within transaction processing systems that require real-time query execution or transaction
semantics for consistency at the row level.
Hive is layered on top of the file system and execution framework for Hadoop and enables
applications and users to organize data in a structured data warehouse and therefore query the
data using a query language called HiveQL that is similar to SQL. The Hive system provides
tools for extracting/ transforming/loading data (ETL) into a variety of different data formats. And
because the data warehouse system is built on top of Hadoop, it enables native access to the
MapReduce model, allowing programmers to develop custom Map and Reduce functions that
can be directly integrated into HiveQL queries. Hive provides scalability and extensibility for

5
CS8091 BIG DATA ANALYTICS

batch-style queries for reporting over large datasets that are typically being expanded while
relying on the fault tolerant aspects of the underlying Hadoop execution model.

HBASE
HBase is an example of a non-relational data management environment that distributes massive
datasets over the underlying Hadoop framework. HBase is derived from Google’s BigTable and
is a column-oriented data layout that, when layered on top of Hadoop, provides a fault-tolerant
method for storing and manipulating large data tables. Data stored in a columnar layout is
amenable to compression, which increases the amount of data that can be represented while
decreasing the actual storage footprint. HBase also supports in-memory execution.
HBase is not a relational database, and it does not support SQL queries. There are some basic
operations for HBase:
 Get (which access a specific row in the table),
 Put (which stores or updates a row in the table),
 Scan (which iterates over a collection of rows in the table), and
 Delete (which removes a row from the table).
Because it can be used to organize datasets, coupled with the performance provided by the
aspects of the columnar orientation, HBase is a reasonable alternative as a persistent storage
paradigm when running MapReduce applications.

Review of Basic Data Analytic Methods Using R


The six phases of the Data Analytics Lifecycle
 Phase 1: Discovery
 Phase 2: Data Preparation
 Phase 3: Model Planning
 Phase 4: Model Building
 Phase 5: Communicate Results
 Phase 6: Operationalize
Introduction to R
 R is a programming language and software framework for statistical analysis and
graphics
 It is available for use under the GNU General Public License
 R software and installation instructions can be obtained via the Comprehensive R
Archive and Network

6
CS8091 BIG DATA ANALYTICS

The flow of a basic R script to address an analytical problem


 Import dataset
 Examine the contents of the dataset
 Execute the model building task
Ex.
The annual sales in U.S. dollars for 10,000 retail customers

The data file is imported using the read.csv( ) function. Once the file has been imported, it is
useful to examine the contents to ensure that the data was loaded properly as well as to become
familiar with the data. In the example, the head( ) function, by default, displays the first six
records of sales.

7
CS8091 BIG DATA ANALYTICS

The summary ( ) function provides some descriptive statistics, such as the mean and median, for
each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd
quartiles are provided. Because the gender column contains two possible characters, an "F"
(female) or "M" (male), the summary () function provides the count of each character's
occurrence.

The plot( ) function generates a scatterplot of the number of orders (sales$num_of_orders)


against the annual sales (sales$sales_total). The $ is used to reference a specific column in the
dataset sales. The resulting plot is shown below

Each point corresponds to the number of orders and the total sales for each customer. The plot
indicates that the annual sales are proportional to the number of orders placed. Although the
observed relationship between these two variables is not purely linear, the analyst decided to
apply linear regression using the lm( ) function as a first step in the modeling process.

Results stores more information that can be examined with the summary( ) function. The
summary () function is an example of a generic function. A generic function is a group of

8
CS8091 BIG DATA ANALYTICS

functions sharing the same name but behaving differently depending on the number and the type
of arguments they receive. The generic function hist( ) is used to generate a histogram of the
residuals stored in results.

R Graphical User Interfaces


R provides two types of basic user interfaces
1. Command line interface (CLI)
2. Graphical user interface (GUI)
To improve the ease of writing, executing, and debugging R code, several additional GUI’s have
been written for R. They are
 R commander
 Rattle
 RStudio
Window panes of RStudio GUI
 Script - Serves as an area to write and save R code
 Workspace - Lists the datasets and variables in the R environment
 Plot - Displays the plots generated by the R code and provides a
straightforward mechanism to export the plots
 Console - Provides a history of the executed R code and the output
Obtaining help in RStudio
 The console pane can be used to obtain help information on R.
By entering either? followed by the function name or entering help (function name) we
can get the help information
Ex.
 ? lm
 help ( lm)

9
CS8091 BIG DATA ANALYTICS

Updating the contents of a variable


 Functions such as edit( ) and fix( ) allow the user to update the contents of an R variable.
 Such changes can also be implemented with RStudio by selecting the appropriate variable
from the workspace pane.
Saving and loading the workspace environment
 R allows one to save the workspace environment, including variables and loaded
libraries, into an . Rdata file using the save.image( ) function. An existing . Rdata file
can be loaded using the load.image( ) function.

Data Import and Export


Importing of R dataset
The dataset was imported into R using the read.csv( ) function
Ex. sales  read.csv("c : /data/yearly_sales.csv")
R uses a forward slash ( / ) as the separator character in the directory and file paths. Windows
users, who may be using a backslash ( \ ) as a separator.

To simplify the import of multiple files with long path names, the setwd( ) function can be used
to set the working directory for the subsequent import and export operations, as shown in the
following R code.
Ex.
setwd( "c: / data/ ")
sales  read.csv( "yearly_sales.csv")

Other import functions include read.table( ) and read.delim( ) ,which are intended to import
other common file types such as TXT.
Ex.
sales_table  read.table ( "yearly_sales.csv", header=TRUE, sep="," )
sales_delim  read.delim ("yearly_sales.csv", sep=",")

The read.delim( ) function expects the column separator to be a tab ("\ t"). In the event that the
numerical data in a data file uses a comma for the decimal, R also provides two additional
functions namely read.csv2( ) and read.delim2( )
Import function details

10
CS8091 BIG DATA ANALYTICS

Exporting of R datasets to an external file

The functions used are


 write. table ( )
 write.csv ( )
 write.csv2 ( )
Ex.
To add an additional column to the sales dataset and exports the modified dataset to an external
file

Sometimes it is necessary to read data from a database management system. R packages such as
DBI and RODBC are available for this purpose. These packages provide database interfaces
for communication between R and DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL,
and Pivotal Greenplum. The following R code demonstrates how to install the RODBC package
with the install.packages( ) function. The library( ) function loads the package into the R
workspace. Finally, a connector (conn) is initialized for connecting to a Pivotal Greenplum
database training2 via open database connectivity (ODBC) with user user.

install.packages ( "RODBC" )
library(RODBC)
conn odbcConnect ("training2", uid="user" , pwd= "password " )

The connector needs to be present to submit a SQL query to an ODBC database by using the
sqlQuery( ) function from the RODBC package.
Ex.
Retrieving specific columns from the housing table in which household income (hinc) is greater
than $1,000,000.

housing_data  sqlQuery(conn, "select serialno, state, persons, room from housing


where hinc > 1000000")

head(housing_data) will display the first 6 results

Although plots can be saved using the RStudio GUI, plots can also be saved using R code by
specifying the appropriate graphic devices. Using the jpeg( ) function, the following R code
creates a new JPEG file, adds a histogram plot to the file, and then closes the file. Such
techniques are useful when automating standard reports. Other functions, such as png( ) , bmp( )
,pdf( ) ,and postscript( ) ,are available in R to save plots in the desired format.

11
CS8091 BIG DATA ANALYTICS

Ex.

Attribute and Data Types


There are four categories of attributes namely
 nominal
 ordinal
 interval
 ratio
NOIR Attribute Types

There are three categories of data types namely


 numeric
 character
 logical
Ex.

To examine the characteristics of the variables class( ) and typeof( ) functions can be used. The
class( ) function represents the abstract class of an object. The typeof( ) function determines the
way an object is stored in memory.
Ex.

12
CS8091 BIG DATA ANALYTICS

To test the variables and coerce a variable into a specific type there are functions available in R

Ex. Functions like is.integer( ) and as.integer( )

Vectors
Vectors are a basic building block for data in R. Simple R variables are actually vectors. The
tests for vectors can be conducted using the is.vector( ) function.

Creation and manipulation of vectors

A vector can be created using the combine function, c( ) or the colon operator :

Ex.

To initialize a vector of a specific length and then populate the content of the vector later. The
vector ( ) function is used
Ex.

13
CS8091 BIG DATA ANALYTICS

Arrays and Matrices


The array ( ) function can be used to restructure a vector as an array.

Ex.

To build a three-dimensional array to hold the quarterly sales for three regions over a two-year
period and then assign the sales amount of $158,000 to the second region for the first quarter of
the first year.

Matrix
A two-dimensional array is known as a matrix. The parameters nrow and ncol define the number
of rows and columns
Ex.

Operations supported
 addition
 subtraction
 multiplication

14
CS8091 BIG DATA ANALYTICS

 transpose function
 inverse function

Data frames
Data frames provide a structure for storing and accessing several variables of possibly different
data types. Because of their flexibility to handle many data types, data frames are the preferred
input format for many of the modeling functions available in R.
The function is.data.frame( ) function indicates, a data frame was created by the read.csv( )
function

To provide the structure of the data frame use the function str( )
Ex.

A subset of the data frame can be retrieved through subsetting operators.


Ex.

 # extract the fourth column of the sales data frame


sales[ ,4]
 # extract the gender column of the sales data frame
sales$gender
 # retrieve the first two rows of the data frame
sales[l:2,]
 # retrieve all the records whose gender is female
sales[sales$gender=="F",]

Lists
A list is a collection of objects that can be of various types.
Ex.
R code to create an assortment, a list of different object types
# build an assorted list of a string, a numeric, a list, a vector, and a matrix

housinglist(“own”, “rent”)
assortmentlist(“football”, 7.5, housing,v,M)
assortment

15
CS8091 BIG DATA ANALYTICS

[[1]]
[ 1 ] ―football‖

[[2]]
[ 1 ] 7.5

[[3]]
[[3]][[1]]
[ 1 ] ―own‖
[[3]][[2]]
[ 1 ] ―rent‖

[[4]]
[1] 1 2 3 4 5

[ [5] ]

The use of the single set of brackets only accesses an item in the list, not its content

Factor
A factor denotes a categorical variable, typically with a few finite levels such as "F" and "M" in
the case of gender. Factors can be ordered or not ordered.
Ex.
class(sales$gender) # returns "factor"
is.ordered(sales$gender) # returns FALSE
head(sales$gender) # display first six values and the levels
FFMMFF
Levels F M

The cbind( ) function is used to combine variables column-wise. The rbind( ) function is used
to combine datasets row-wise.

Contingency Tables
In R, table refers to a class of objects used to store the observed counts across the factors for a
given dataset. Such a table is commonly referred to as a contingency table and is the basis for
performing a statistical test on the independence of the factors used to build the table.

16
CS8091 BIG DATA ANALYTICS

Ex.
To build a contingency table based on the sales$gender and sales$ spender factors.

Based on the observed counts in the table, the summary {) function performs a chi-squared test
on the independence of the two factors. Because the reported p-value is greater than 0.05, the
assumed independence of the two factors is not rejected.

Exploratory Data Analysis


Exploratory data analysis is a data analysis approach to reveal the important characteristics of a
dataset, mainly through visualization.
Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone.
Ex.
Variables x and y of the data frame data can instead be visualized in a scatterplot

The code to generate data as well as scatterplot

17
CS8091 BIG DATA ANALYTICS

Statistical Properties of Anscombe's Quartet


 Mean of x - 9
 Variance of x - 11
 Mean of y - 7.50
 Variance of y - 4.12 or 4.13
 Correlation between x and y - 0.816
 Linear regression line - y=3.00+0.50x

Visualizing a Single Variable


R has many functions available to examine a single variable. Some of these functions are listed
below

Dotchart and Barplot


Dotchart and barplot portray continuous values with labels from a discrete variable. A dotchart
can be created in R with the function dotchart (x, label= ...), where x is a numeric vector and
label is a vector of categorical labels for x. A barplot can be created with the barplot (height)
function, where height represents a vector or matrix.

18
CS8091 BIG DATA ANALYTICS

(a) Dotchart on the miles per gallon of cars and (b) Barplot on the distribution of car cylinder counts

Code for the dotchart and barplot

data (mtcars)
dotchart (mtcars$mpg, labels=row .names (mtcars), cex=.7, main= "Miles Per Gallon
(MPG) of Car Models", xlab= "MPG”)
barplot (table (mtcars$cyl), main="Distribution of Car Cylinder Counts", xlab= "Number
of Cylinders")

Histogram and Density Plot


To represent the household income

The histogram shows a clear concentration of low household incomes on the left and the long tail
of the higher incomes on the right.

The density plot of represents the logarithm of household income values, which emphasizes the
distribution

The code to generate the two plots

19
CS8091 BIG DATA ANALYTICS

The rug( ) function creates a one-dimensional density plot on the bottom of the graph to
emphasize the distribution of the observation.

Density plot

library ("ggplot2")
data (diamonds) # load the diamonds dataset from ggplot2
# Only keep the premium and ideal cuts of diamonds

niceDiamonds  diamonds [diamonds$cut=="Premium"| diamonds$cut==" Ideal , ]


# plot density plot of diamond prices
ggplot ( niceDiamonds, aes(x=price , fill =cut) ) + geom_density(alpha = .3, color=NA)

Examining Multiple Variables


Scatterplot

A scatterplot is a simple and widely used visualization for finding the relationship among
multiple variables. A scatterplot can represent data with up to five variables using x-axis, y-axis,

20
CS8091 BIG DATA ANALYTICS

size, color, and shape. But usually only two to four variables are portrayed in a scatterplot to
minimize confusion.
If the functional relationship between the variables is somewhat pronounced, the data may
roughly lie along a straight line, a parabola, or an exponential curve. If variable y is related
exponentially to x, then the plot of x versus log (y) is approximately linear. If the plot looks more
like a cluster without a pattern, the corresponding variables may have a weak relationship.

Examining two variables with regression

Dotchart and Barplot


Dotchart and barplot can visualize multiple variables. Both of them use color as an additional
dimension for visualizing the data.

Dotchart to represent multiple variables

21
CS8091 BIG DATA ANALYTICS

Barplot to visualize multiple variables

Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value of a discrete
variable. The box-and-whisker plot in Figure visualizes mean household incomes as a function of
region in the United States.

22
CS8091 BIG DATA ANALYTICS

Hexbinplot for Large Data sets

A hexbinplot combines the ideas of scatterplot and histogram. Similar to a scatterplot, a


hexbinplot visualizes data in the x-axis and y-axis. Data is placed into hex bins, and the third
dimension uses shading to represent the concentration of data in each hexbin.
Hexbinplot of household income against years of education

23
CS8091 BIG DATA ANALYTICS

Statistical Methods for Evaluation


Hypothesis Testing

When comparing populations, such as testing or evaluating the difference of the means from two
samples of data a common technique to assess the difference or the significance of the difference
is hypothesis testing.

The basic concept of hypothesis testing is to form an assertion and test it with data. When
performing hypothesis tests, the common assumption is that there is no difference between two
samples. This assumption is used as the default position for building the test or conducting a
scientific experiment. Statisticians refer to this as the null hypothesis (H0). The alternative
hypothesis (HA) is that there is a difference between two samples.

For example, if the task is to identify the effect of drug A compared to drug B on patients, the
null hypothesis and alternative hypothesis would be this
• H0: Drug A and drug B have the same effect on patients.
• HA: Drug A has a greater effect than drug Bon patients.

Difference of Means
The difference in means can be tested using Student's t-test or the Welch's t-test.

Student's t-test
Student 's t-test assumes that distributions of the two populations have equal but unknown
variances. Suppose n1 and n2 samples are randomly and independently selected from two
populations, pop1 and pop2, respectively. If each population is normally distributed with the
same mean (µ1= µ2) and with the same variance, then T (the t-statistic), given in Equation,
follows a t -distribution with n1 + n2 - 2 degrees of freedom (df).

Ex. R code

24
CS8091 BIG DATA ANALYTICS

Welch 's t-test

When the equal population variance assumption is not justified in performing Student's t-test for
the difference of means, Welch's t-test can be used based on T expressed in Equation

Ex. R code

The degrees of freedom for Welch's t-test is defined in Equation

Wilcoxon Rank-Sum Test

The Wilcoxon rank-sum test is a nonparametric hypothesis test that checks whether two
populations are identically distributed
Ex.

Type I and Type II Errors

A hypothesis test may result in two types of errors, depending on whether the test accepts or
rejects the null hypothesis. These two errors are known as type I and type II errors.

 A type I error is the rejection of the null hypothesis when the null hypothesis is TRUE.
The probability of the type I error is denoted by the Greek letter α
 A type II error is the acceptance of a null hypothesis when the null hypothesis is FALSE.
The probability of the type II error is denoted by the Greek letter β

25
CS8091 BIG DATA ANALYTICS

ANOVA (Analysis of Variance)

ANOVA is a generalization of the hypothesis testing of the difference of two population means.
ANOVA tests if any of the population means differ from the other population means. The null
hypothesis of ANOVA is that all the population means are equal. The alternative hypothesis is
that at least one pair of the population means is not equal.

The aov ( ) function performs the ANOVA

Ex.

Consider an example that every customer who visits a retail website gets one of two promotional
offers or gets no promotion at all. The goal is to see if making the promotional offers makes a
difference. ANOVA could be used, and the null hypothesis is that neither promotion makes a
difference. The code that follows randomly generates a total of 500 observations of purchase
sizes on three different offer options.

The summary of the offertest data frame shows that 170 offerl, 161 offer2, and 169 nopromo (no
promotion) offers have been made. It also shows the range of purchase size (purchase_amt) for
each of the three offer options.

26
CS8091 BIG DATA ANALYTICS

summary (model)
The summary ( ) function shows a summary of the model

27

You might also like