Big Data Testing
Big Data Testing
Big Data is a big topic in software development today. When it comes to practice, software
testers may not yet fully understand what Big Data is exactly. What testers do know is that you
need a plan for testing it. The problem here is the lack of a clear understanding about what to test
and how deep inside a tester should go. There are some key questions that must be answered
before going down this path. Since most Big Data lacks a traditional structure, what does Big
Data quality look like? And what are the most appropriate software testing tools?
As a software tester, it is imperative to first have a clear definition of Big Data. Many of us
improperly believe that Big Data is just a large amount of information. This is a completely
incorrect approach. For example, a 2 petabyte Oracle database alone doesnt constitute a Big
Data situation just a high load one. To be very precise, Big Data is a series of approaches, tools
and methods for processing of high volumes of structured and (most importantly) of unstructured
data. The key difference between Big Data and ordinary high load systems is the ability to
create flexible queries.
The Big Data trend first appeared five years ago in U.S., when researchers from Google
announced their global achievement in the scientific journal, Nature. Without any significant
results of medical tests, they were able to track the spread of flu in the U.S. by analyzing
numbers of Google search queries to track influenza-like illness in a population.
Today, Big Data can be described by three Vs: Volume, Variety and Velocity. In other words,
you have to process an enormous amount of data of various formats at high speed. The
processing of Big Data, and, therefore its software testing process, can therefore be split into
three basic components.
The process is illustrated below by an example based on the open source Apache Hadoop
software framework:
1. Loading the initial data into the Hadoop Distributed File System (HDFS).
2. Execution of Map-Reduce operations.
3. Rolling out the output results from the HDFS.
Loading the Initial Data into HDFS
In this first step, the data is retrieved from various sources (social media, web logs, social
networks etc.) and uploaded into the HDFS, being split into multiple files:
Verify that the required data was extracted from the original system and there was no data
corruption.
Validate that the data files were loaded into the HDFS correctly.
Check the files partition and copy them to different data units.
Determine the most complete set of data that needs to be checked. For a step-by-step
validation, you can use tools such as Datameer, Talend or Informatica.
Check required business logic on standalone unit and then on the set of units.
Validate the Map-Reduce process to ensure that the key-value pair is generated
correctly.
Check the aggregation and consolidation of data after performing "reduce" operation.
Compare the output data with initial files to make sure that the output file was generated
and its format meets all the requirements.
The most appropriate language for the verification of data is Hive. Testers prepare requests with
the Hive (SQL-style) Query Language (HQL) that they send to Hbase to verify that the output
complies with the requirements. Hbase is a NoSQL database that can serve as the input and
output for Map-Reduce jobs.
You can also use other Big Data processing programs as an alternative to Map-Reduce.
Frameworks like Spark or Storm are good examples of substitutes for this programming model,
as they provide similar functionality and are compatible with the Hadoop community.
Rolling out the Output Results from HDFS
This final step includes unloading the data that was generated by the second step and loading it
into the downstream system, which may be a repository for data to generate reports or a
transactional analysis system for further processing: Conduct inspection of data aggregation to
make sure that the data has been loaded into the required system and thus was not distorted.
Validate that the reports include all the required data, and all indicators are referred to concrete
measures and displayed correctly.
Testing data in a Big Data project can be obtained in two ways: copying actual production data or
creating data exclusively for testing purposes the former being the preferred method for
software testers. In this case, the conditions are as realistic as possible and thus it becomes easier
to work with a larger number of test scenarios. However, not all companies are willing to provide
real data when they prefer to keep some information confidential. In this case, you must create
testing data yourself or make a request for artificial info. The main drawback of this scenario is
that artificial business scenarios created by using limited data inevitably restrict testing. Only real
users themselves can detect defects in that case.
Big data creates a new layer in the economy which is all about information, turning information,
or data, into revenue. This will accelerate growth in the global economy and create jobs. In
2013, big data is forecast to drive $34 billion of IT spending - Gartner
Data science is all about trying to create a process that allows you to chart out new ways of
thinking about problems that are novel, or trying to use the existing data in a creative atmosphere
with a pragmatic approach.
Businesses are struggling to grapple with the phenomenal information explosion. Conventional
database systems and business intelligence applications have given way to horizontal databases,
columnar designs and cloud-enabled schemas powered by sharding techniques.
Particularly, the role of QA is very challenging in this context, as this is still in a nascent stage.
Testing Big Data applications requires a specific mindset, skillset and deep understanding of the
technologies and pragmatic approaches to data science. Big Data from a testers perspective is an
interesting aspect. Understanding the evolution of Big Data, What is Big Data meant for, Why
Test Big Applications is fundamentally important.
The following are some of the needs and challenges that make it imperative for Big Data
applications to be tested thoroughly.
An in-depth understanding of the 4 Nouns of Big Data is a key to successful Big Data Testing.
Data Integration - Drawing large and disparate data sets together in real time.
Current data integration platforms which have been built for an older generation of data
challenges, limit IT's ability to support the business. In order to keep up, organizations are
beginning to look at next-generation data integration techniques and platforms.
Ability to understand, analyze and create test sets that encompass multiple data sets is vital to
ensure comprehensive Big Data Testing.
Testing Data Intensive Applications and Business Intelligence Solutions
Cigniti leverages its experience of having tested large scale data warehousing and business
intelligence applications to offer a host of Big Data Testing services and solutions.
To know more about how Cigniti can help you take advantage of Large Data Sets through a
comprehensive testing of your Big Data Application, write to [email protected]
Big data has purpose, little data has hope While current trends suggest Big Data driven business
as an avenue that requires substantial investments, the future will see a growth of Big Data apps
by ISVs and Small and Medium Enterprise segment as well. Moreover, as business grows,
enterprises need to accommodate and manage the increasing volume, variety and velocity of the
data that flows into the IT systems.
The conventional columnar designs and horizontal databases demand continuous expansion to
store and retrieve this data. The sheer volume in itself weighs on the cloud enabled schemas and
sharding techniques, forcing enterprises to look for new ways to accept, model and discard the
data. Findings of an MIT research project by Andrew McAfee and Erik Brynjolfsson indicate
that companies which inject big data and analytics into their operations show productivity rates
and profitability that are 5 to 6 percent higher than those of their peers.
The possibility of unknown scenarios in Big Data testing is gigantic when compared to testing
techniques for conventional applications. The scope and range of the data harness in Big Data
applications will demand new benchmarks of Software Quality Assurance.
To accommodate Big Data test requirements, processes and infrastructure will be redesigned to
achieve new levels of scalability, reusability and compatibility to ensure comprehensive,
continuous and context driven test capabilities. To handle the volume and ensure live data
integration, Big Data testing needs to empower developers and enterprises with freedom to
experiment and innovate
One data layer
From a Big Data perspective, enterprises will seek validation of application design, data security,
source verification and compliance with industry standards. The parameters of performance,
speed, security and load will add magnitude and precision to sculpt and reorganize data volumes
into blocks that match the emerging requirements.
Over time, the database and storage layers will merge into a single data layer with options of
retrieval and transmission exported out of the layer.
Business leaders now look at data maps to estimate and draft plans for emerging scenarios. The
transformation of data into comprehensive reports in real time will add value to business
decisions and enrich operations with higher levels of speed and accuracy. The test capabilities
will acquire ability to de-complicate data sources/types/structures and channel them along
specified contexts to align with objectives.
In a story titled The Top 7 Things Obama Taught Us About the Future of Business, the Forbes
reported that the Obama campaign used a test tool called 'Optimize' to improve efficiency. Dan
Siroker, Co-founder of Optimize, was quoted as saying we ran over 240 A/B tests to try
different messaging, calls to action, and in attempt to raise more money. Because of our efforts,
we increased effectiveness 49 percent.
Why Big Data is a good opportunity for Software Testers?
Consider this. A joint report by NASSCOM and CRISIL Global Research & Analytics suggests
that by 2015, Big Data is expected to become a USD 25 billion industry, growing at a CAGR of
45 per cent. Managing data growth is the number two priority for IT organizations over the next
12-18 months. In order to sustain growth, enterprises will adopt next generation data integration
platforms and techniques fueling the demand for Quality Assurance mechanisms around the new
data perspectives.
Be a smart tester and ride the next wave of IT on Big Data
Testers can formulate service models through operational exposure to data acquisition techniques
on Hadoop and related platforms. Test approaches can be developed by studying the deployment
strategies of Mahout, Java, Python, Pig, Hive etc. Contextualization of data from diverse sources
to streamlined outputs helps testers understand the channels of business logic in the data science.
Big Data is an emerging discipline which will leave a profound impact on the global economy.
The ability to explore the power of Big Data testing is like being in a hotspot that will see action
in terms of innovations that match emerging test requirements.