0% found this document useful (0 votes)
367 views

Big Data Testing

The document discusses three key components of testing Big Data applications: 1) Loading initial data into HDFS and verifying it is complete and correctly partitioned, 2) Executing Map-Reduce operations to process the data and validating the outputs meet requirements, and 3) Rolling out results from HDFS and ensuring they are accurately loaded into downstream systems. Big Data testing requires understanding the technologies, developing automated testing tools, and obtaining real production data when possible for the most realistic testing.

Uploaded by

bipulpwc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
367 views

Big Data Testing

The document discusses three key components of testing Big Data applications: 1) Loading initial data into HDFS and verifying it is complete and correctly partitioned, 2) Executing Map-Reduce operations to process the data and validating the outputs meet requirements, and 3) Rolling out results from HDFS and ensuring they are accurately loaded into downstream systems. Big Data testing requires understanding the technologies, developing automated testing tools, and obtaining real production data when possible for the most realistic testing.

Uploaded by

bipulpwc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Testing Big Data: Three Fundamental Components

Big Data is a big topic in software development today. When it comes to practice, software
testers may not yet fully understand what Big Data is exactly. What testers do know is that you
need a plan for testing it. The problem here is the lack of a clear understanding about what to test
and how deep inside a tester should go. There are some key questions that must be answered
before going down this path. Since most Big Data lacks a traditional structure, what does Big
Data quality look like? And what are the most appropriate software testing tools?
As a software tester, it is imperative to first have a clear definition of Big Data. Many of us
improperly believe that Big Data is just a large amount of information. This is a completely
incorrect approach. For example, a 2 petabyte Oracle database alone doesnt constitute a Big
Data situation just a high load one. To be very precise, Big Data is a series of approaches, tools
and methods for processing of high volumes of structured and (most importantly) of unstructured
data. The key difference between Big Data and ordinary high load systems is the ability to
create flexible queries.
The Big Data trend first appeared five years ago in U.S., when researchers from Google
announced their global achievement in the scientific journal, Nature. Without any significant
results of medical tests, they were able to track the spread of flu in the U.S. by analyzing
numbers of Google search queries to track influenza-like illness in a population.
Today, Big Data can be described by three Vs: Volume, Variety and Velocity. In other words,
you have to process an enormous amount of data of various formats at high speed. The
processing of Big Data, and, therefore its software testing process, can therefore be split into
three basic components.
The process is illustrated below by an example based on the open source Apache Hadoop
software framework:
1. Loading the initial data into the Hadoop Distributed File System (HDFS).
2. Execution of Map-Reduce operations.
3. Rolling out the output results from the HDFS.
Loading the Initial Data into HDFS
In this first step, the data is retrieved from various sources (social media, web logs, social
networks etc.) and uploaded into the HDFS, being split into multiple files:

Verify that the required data was extracted from the original system and there was no data
corruption.

Validate that the data files were loaded into the HDFS correctly.

Check the files partition and copy them to different data units.

Determine the most complete set of data that needs to be checked. For a step-by-step
validation, you can use tools such as Datameer, Talend or Informatica.

Execution of Map-Reduce Operations


In this step, you process the initial data using a Map-Reduce operation to obtain the desired
result. Map-reduce is a data processing concept for condensing large volumes of data into useful
aggregated results:

Check required business logic on standalone unit and then on the set of units.

Validate the Map-Reduce process to ensure that the key-value pair is generated
correctly.

Check the aggregation and consolidation of data after performing "reduce" operation.

Compare the output data with initial files to make sure that the output file was generated
and its format meets all the requirements.

The most appropriate language for the verification of data is Hive. Testers prepare requests with
the Hive (SQL-style) Query Language (HQL) that they send to Hbase to verify that the output
complies with the requirements. Hbase is a NoSQL database that can serve as the input and
output for Map-Reduce jobs.
You can also use other Big Data processing programs as an alternative to Map-Reduce.
Frameworks like Spark or Storm are good examples of substitutes for this programming model,
as they provide similar functionality and are compatible with the Hadoop community.
Rolling out the Output Results from HDFS
This final step includes unloading the data that was generated by the second step and loading it
into the downstream system, which may be a repository for data to generate reports or a
transactional analysis system for further processing: Conduct inspection of data aggregation to
make sure that the data has been loaded into the required system and thus was not distorted.
Validate that the reports include all the required data, and all indicators are referred to concrete
measures and displayed correctly.
Testing data in a Big Data project can be obtained in two ways: copying actual production data or
creating data exclusively for testing purposes the former being the preferred method for
software testers. In this case, the conditions are as realistic as possible and thus it becomes easier
to work with a larger number of test scenarios. However, not all companies are willing to provide
real data when they prefer to keep some information confidential. In this case, you must create
testing data yourself or make a request for artificial info. The main drawback of this scenario is
that artificial business scenarios created by using limited data inevitably restrict testing. Only real
users themselves can detect defects in that case.

As speed is one of Big Datas main characteristics, it is mandatory to do performance testing. A


huge volume of data and an infrastructure similar to the production infrastructure is usually
created for performance testing. Furthermore, if this is acceptable, data is copied directly from
production.
To determine the performance metrics and to detect errors, you can use, for instance, the Hadoop
performance monitoring tool. There are fixed indicators like operating time, capacity and
system-level metrics like memory usage within performance testing.
To be successful, Big Data testers have to learn the components of the Big Data ecosystem from
scratch. Since the market has created fully automated testing tools for Big Data validation, the
tester has no other option but to acquire the same skill set as the Big Data developer in the
context of leveraging the Big Data technologies like Hadoop. This requires a tremendous
mindset shift for both the testers as well as testing units within organizations. In order to be
competitive, companies should invest in Big Data-specific training needs and developing the
automation solutions for Big Data validation.
In conclusion, Big Data processing holds much promise for todays businesses. If you apply the
right test strategies and follow best practices, you will improve Big Data testing quality, which
will help to identify defects in early stages and reduce overall cost.

Big Data Testing

Big data creates a new layer in the economy which is all about information, turning information,
or data, into revenue. This will accelerate growth in the global economy and create jobs. In
2013, big data is forecast to drive $34 billion of IT spending - Gartner
Data science is all about trying to create a process that allows you to chart out new ways of
thinking about problems that are novel, or trying to use the existing data in a creative atmosphere
with a pragmatic approach.
Businesses are struggling to grapple with the phenomenal information explosion. Conventional
database systems and business intelligence applications have given way to horizontal databases,
columnar designs and cloud-enabled schemas powered by sharding techniques.
Particularly, the role of QA is very challenging in this context, as this is still in a nascent stage.
Testing Big Data applications requires a specific mindset, skillset and deep understanding of the
technologies and pragmatic approaches to data science. Big Data from a testers perspective is an
interesting aspect. Understanding the evolution of Big Data, What is Big Data meant for, Why
Test Big Applications is fundamentally important.

Big Data Testing Needs and Challenges

The following are some of the needs and challenges that make it imperative for Big Data
applications to be tested thoroughly.
An in-depth understanding of the 4 Nouns of Big Data is a key to successful Big Data Testing.

Increasing need for Live integration of information: With multiple


sources of information from different data, it has become imminent to
facilitate live integration of information. This forces enterprises to have
constantly clean and reliable data, which can only be ensured through end to
end testing of the data sources and integrators.

Instant Data Collection and Deployment: Power of Predictive analytics


and the ability to take Decisive Actions have pushed enterprises to adopt
instant data collection solutions. These decisions bring in significant business
impact by leveraging the insights from the minute patterns in large data sets.
Add that to the CIOs profile which demands deployment of instant solutions
to stay in tune with changing dynamics of business. Unless the applications
and data feeds are tested and certified for live deployment, these challenges
cannot be met with the assurance that is essential for every critical
operation.

Real-time scalability challenges: Big Data Applications are built to match


the level of scalability and monumental data processing that is involved in a
given scenario. Critical errors in the architectural elements governing the
design of Big Data Applications can lead to catastrophic situations. Hardcore

testing involving smarter data sampling and cataloguing techniques coupled


with high end performance testing capabilities are essential to meet the
scalability problems that Big Data Applications pose.

Data Integration - Drawing large and disparate data sets together in real time.
Current data integration platforms which have been built for an older generation of data
challenges, limit IT's ability to support the business. In order to keep up, organizations are
beginning to look at next-generation data integration techniques and platforms.

Ability to understand, analyze and create test sets that encompass multiple data sets is vital to
ensure comprehensive Big Data Testing.
Testing Data Intensive Applications and Business Intelligence Solutions
Cigniti leverages its experience of having tested large scale data warehousing and business
intelligence applications to offer a host of Big Data Testing services and solutions.

Testing New Age Big Data Applications - Cigniti Testlets


Cigniti Testlets offer point solutions for all the problems that a new age Big Data Application
would have to be go through before being certified with QA levels that match industry standards.

To know more about how Cigniti can help you take advantage of Large Data Sets through a
comprehensive testing of your Big Data Application, write to [email protected]

Big Data testing: The challenge and the opportunity


The possibility of unknown scenarios in Big Data testing is gigantic when compared to testing
techniques for conventional applications. The scope and range of the data harness in Big Data
applications will demand new benchmarks of Software Quality Assurance
The inherent production of digital data across the economies and institutions is seen as an
enormous source of information, which can help build a reliable knowledge base for critical
decisions. As the IT enables global economy moves ahead, enterprises look at new ways of
utilizing existing and growing data. At such moments, the Big Data perspective bridges the
current and emerging trends.

Big data has purpose, little data has hope While current trends suggest Big Data driven business
as an avenue that requires substantial investments, the future will see a growth of Big Data apps
by ISVs and Small and Medium Enterprise segment as well. Moreover, as business grows,
enterprises need to accommodate and manage the increasing volume, variety and velocity of the
data that flows into the IT systems.
The conventional columnar designs and horizontal databases demand continuous expansion to
store and retrieve this data. The sheer volume in itself weighs on the cloud enabled schemas and
sharding techniques, forcing enterprises to look for new ways to accept, model and discard the
data. Findings of an MIT research project by Andrew McAfee and Erik Brynjolfsson indicate
that companies which inject big data and analytics into their operations show productivity rates
and profitability that are 5 to 6 percent higher than those of their peers.
The possibility of unknown scenarios in Big Data testing is gigantic when compared to testing
techniques for conventional applications. The scope and range of the data harness in Big Data
applications will demand new benchmarks of Software Quality Assurance.
To accommodate Big Data test requirements, processes and infrastructure will be redesigned to
achieve new levels of scalability, reusability and compatibility to ensure comprehensive,
continuous and context driven test capabilities. To handle the volume and ensure live data
integration, Big Data testing needs to empower developers and enterprises with freedom to
experiment and innovate
One data layer
From a Big Data perspective, enterprises will seek validation of application design, data security,
source verification and compliance with industry standards. The parameters of performance,
speed, security and load will add magnitude and precision to sculpt and reorganize data volumes
into blocks that match the emerging requirements.
Over time, the database and storage layers will merge into a single data layer with options of
retrieval and transmission exported out of the layer.
Business leaders now look at data maps to estimate and draft plans for emerging scenarios. The
transformation of data into comprehensive reports in real time will add value to business
decisions and enrich operations with higher levels of speed and accuracy. The test capabilities
will acquire ability to de-complicate data sources/types/structures and channel them along
specified contexts to align with objectives.

In a story titled The Top 7 Things Obama Taught Us About the Future of Business, the Forbes
reported that the Obama campaign used a test tool called 'Optimize' to improve efficiency. Dan
Siroker, Co-founder of Optimize, was quoted as saying we ran over 240 A/B tests to try
different messaging, calls to action, and in attempt to raise more money. Because of our efforts,
we increased effectiveness 49 percent.
Why Big Data is a good opportunity for Software Testers?
Consider this. A joint report by NASSCOM and CRISIL Global Research & Analytics suggests
that by 2015, Big Data is expected to become a USD 25 billion industry, growing at a CAGR of
45 per cent. Managing data growth is the number two priority for IT organizations over the next
12-18 months. In order to sustain growth, enterprises will adopt next generation data integration
platforms and techniques fueling the demand for Quality Assurance mechanisms around the new
data perspectives.
Be a smart tester and ride the next wave of IT on Big Data
Testers can formulate service models through operational exposure to data acquisition techniques
on Hadoop and related platforms. Test approaches can be developed by studying the deployment
strategies of Mahout, Java, Python, Pig, Hive etc. Contextualization of data from diverse sources
to streamlined outputs helps testers understand the channels of business logic in the data science.
Big Data is an emerging discipline which will leave a profound impact on the global economy.
The ability to explore the power of Big Data testing is like being in a hotspot that will see action
in terms of innovations that match emerging test requirements.

You might also like