0% found this document useful (0 votes)

451 views9 pages

Big Data Testing: Key Components

The document discusses three key components of testing Big Data applications: 1) Loading initial data into HDFS and verifying it is complete and correctly partitioned, 2) Executing Map-Reduce operations to process the data and validating the outputs meet requirements, and 3) Rolling out results from HDFS and ensuring they are accurately loaded into downstream systems. Big Data testing requires understanding the technologies, developing automated testing tools, and obtaining real production data when possible for the most realistic testing.

Uploaded by

bipulpwc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

451 views9 pages

Big Data Testing: Key Components

Uploaded by

bipulpwc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Testing Big Data: Three Fundamental Components

Big Data is a big topic in software development today. When it comes to practice, software
testers may not yet fully understand what Big Data is exactly. What testers do know is that you
need a plan for testing it. The problem here is the lack of a clear understanding about what to test
and how deep inside a tester should go. There are some key questions that must be answered
before going down this path. Since most Big Data lacks a traditional structure, what does Big
Data quality look like? And what are the most appropriate software testing tools?
As a software tester, it is imperative to first have a clear definition of Big Data. Many of us
improperly believe that Big Data is just a large amount of information. This is a completely
incorrect approach. For example, a 2 petabyte Oracle database alone doesnt constitute a Big
Data situation just a high load one. To be very precise, Big Data is a series of approaches, tools
and methods for processing of high volumes of structured and (most importantly) of unstructured
data. The key difference between Big Data and ordinary high load systems is the ability to
create flexible queries.
The Big Data trend first appeared five years ago in U.S., when researchers from Google
announced their global achievement in the scientific journal, Nature. Without any significant
results of medical tests, they were able to track the spread of flu in the U.S. by analyzing
numbers of Google search queries to track influenza-like illness in a population.
Today, Big Data can be described by three Vs: Volume, Variety and Velocity. In other words,
you have to process an enormous amount of data of various formats at high speed. The
processing of Big Data, and, therefore its software testing process, can therefore be split into
three basic components.
The process is illustrated below by an example based on the open source Apache Hadoop
software framework:
1. Loading the initial data into the Hadoop Distributed File System (HDFS).
2. Execution of Map-Reduce operations.
3. Rolling out the output results from the HDFS.
Loading the Initial Data into HDFS
In this first step, the data is retrieved from various sources (social media, web logs, social
networks etc.) and uploaded into the HDFS, being split into multiple files:

Verify that the required data was extracted from the original system and there was no data
corruption.

Validate that the data files were loaded into the HDFS correctly.

Check the files partition and copy them to different data units.

Determine the most complete set of data that needs to be checked. For a step-by-step
validation, you can use tools such as Datameer, Talend or Informatica.

Execution of Map-Reduce Operations

In this step, you process the initial data using a Map-Reduce operation to obtain the desired
result. Map-reduce is a data processing concept for condensing large volumes of data into useful
aggregated results:

Check required business logic on standalone unit and then on the set of units.

Validate the Map-Reduce process to ensure that the key-value pair is generated
correctly.

Check the aggregation and consolidation of data after performing "reduce" operation.

Compare the output data with initial files to make sure that the output file was generated
and its format meets all the requirements.

The most appropriate language for the verification of data is Hive. Testers prepare requests with
the Hive (SQL-style) Query Language (HQL) that they send to Hbase to verify that the output
complies with the requirements. Hbase is a NoSQL database that can serve as the input and
output for Map-Reduce jobs.
You can also use other Big Data processing programs as an alternative to Map-Reduce.
Frameworks like Spark or Storm are good examples of substitutes for this programming model,
as they provide similar functionality and are compatible with the Hadoop community.
Rolling out the Output Results from HDFS
This final step includes unloading the data that was generated by the second step and loading it
into the downstream system, which may be a repository for data to generate reports or a
transactional analysis system for further processing: Conduct inspection of data aggregation to
make sure that the data has been loaded into the required system and thus was not distorted.
Validate that the reports include all the required data, and all indicators are referred to concrete
measures and displayed correctly.
Testing data in a Big Data project can be obtained in two ways: copying actual production data or
creating data exclusively for testing purposes the former being the preferred method for
software testers. In this case, the conditions are as realistic as possible and thus it becomes easier
to work with a larger number of test scenarios. However, not all companies are willing to provide
real data when they prefer to keep some information confidential. In this case, you must create
testing data yourself or make a request for artificial info. The main drawback of this scenario is
that artificial business scenarios created by using limited data inevitably restrict testing. Only real
users themselves can detect defects in that case.

As speed is one of Big Datas main characteristics, it is mandatory to do performance testing. A

huge volume of data and an infrastructure similar to the production infrastructure is usually
created for performance testing. Furthermore, if this is acceptable, data is copied directly from
production.
To determine the performance metrics and to detect errors, you can use, for instance, the Hadoop
performance monitoring tool. There are fixed indicators like operating time, capacity and
system-level metrics like memory usage within performance testing.
To be successful, Big Data testers have to learn the components of the Big Data ecosystem from
scratch. Since the market has created fully automated testing tools for Big Data validation, the
tester has no other option but to acquire the same skill set as the Big Data developer in the
context of leveraging the Big Data technologies like Hadoop. This requires a tremendous
mindset shift for both the testers as well as testing units within organizations. In order to be
competitive, companies should invest in Big Data-specific training needs and developing the
automation solutions for Big Data validation.
In conclusion, Big Data processing holds much promise for todays businesses. If you apply the
right test strategies and follow best practices, you will improve Big Data testing quality, which
will help to identify defects in early stages and reduce overall cost.

Big Data Testing

Big data creates a new layer in the economy which is all about information, turning information,
or data, into revenue. This will accelerate growth in the global economy and create jobs. In
2013, big data is forecast to drive $34 billion of IT spending - Gartner
Data science is all about trying to create a process that allows you to chart out new ways of
thinking about problems that are novel, or trying to use the existing data in a creative atmosphere
with a pragmatic approach.
Businesses are struggling to grapple with the phenomenal information explosion. Conventional
database systems and business intelligence applications have given way to horizontal databases,
columnar designs and cloud-enabled schemas powered by sharding techniques.
Particularly, the role of QA is very challenging in this context, as this is still in a nascent stage.
Testing Big Data applications requires a specific mindset, skillset and deep understanding of the
technologies and pragmatic approaches to data science. Big Data from a testers perspective is an
interesting aspect. Understanding the evolution of Big Data, What is Big Data meant for, Why
Test Big Applications is fundamentally important.

Big Data Testing Needs and Challenges

The following are some of the needs and challenges that make it imperative for Big Data
applications to be tested thoroughly.
An in-depth understanding of the 4 Nouns of Big Data is a key to successful Big Data Testing.

Increasing need for Live integration of information: With multiple

sources of information from different data, it has become imminent to
facilitate live integration of information. This forces enterprises to have
constantly clean and reliable data, which can only be ensured through end to
end testing of the data sources and integrators.

Instant Data Collection and Deployment: Power of Predictive analytics

and the ability to take Decisive Actions have pushed enterprises to adopt
instant data collection solutions. These decisions bring in significant business
impact by leveraging the insights from the minute patterns in large data sets.
Add that to the CIOs profile which demands deployment of instant solutions
to stay in tune with changing dynamics of business. Unless the applications
and data feeds are tested and certified for live deployment, these challenges
cannot be met with the assurance that is essential for every critical
operation.

Real-time scalability challenges: Big Data Applications are built to match

the level of scalability and monumental data processing that is involved in a
given scenario. Critical errors in the architectural elements governing the
design of Big Data Applications can lead to catastrophic situations. Hardcore

testing involving smarter data sampling and cataloguing techniques coupled

with high end performance testing capabilities are essential to meet the
scalability problems that Big Data Applications pose.

Data Integration - Drawing large and disparate data sets together in real time.
Current data integration platforms which have been built for an older generation of data
challenges, limit IT's ability to support the business. In order to keep up, organizations are
beginning to look at next-generation data integration techniques and platforms.

Ability to understand, analyze and create test sets that encompass multiple data sets is vital to
ensure comprehensive Big Data Testing.
Testing Data Intensive Applications and Business Intelligence Solutions
Cigniti leverages its experience of having tested large scale data warehousing and business
intelligence applications to offer a host of Big Data Testing services and solutions.

Testing New Age Big Data Applications - Cigniti Testlets

Cigniti Testlets offer point solutions for all the problems that a new age Big Data Application
would have to be go through before being certified with QA levels that match industry standards.

To know more about how Cigniti can help you take advantage of Large Data Sets through a
comprehensive testing of your Big Data Application, write to [email protected]

Big Data testing: The challenge and the opportunity

The possibility of unknown scenarios in Big Data testing is gigantic when compared to testing
techniques for conventional applications. The scope and range of the data harness in Big Data
applications will demand new benchmarks of Software Quality Assurance
The inherent production of digital data across the economies and institutions is seen as an
enormous source of information, which can help build a reliable knowledge base for critical
decisions. As the IT enables global economy moves ahead, enterprises look at new ways of
utilizing existing and growing data. At such moments, the Big Data perspective bridges the
current and emerging trends.

Big data has purpose, little data has hope While current trends suggest Big Data driven business
as an avenue that requires substantial investments, the future will see a growth of Big Data apps
by ISVs and Small and Medium Enterprise segment as well. Moreover, as business grows,
enterprises need to accommodate and manage the increasing volume, variety and velocity of the
data that flows into the IT systems.
The conventional columnar designs and horizontal databases demand continuous expansion to
store and retrieve this data. The sheer volume in itself weighs on the cloud enabled schemas and
sharding techniques, forcing enterprises to look for new ways to accept, model and discard the
data. Findings of an MIT research project by Andrew McAfee and Erik Brynjolfsson indicate
that companies which inject big data and analytics into their operations show productivity rates
and profitability that are 5 to 6 percent higher than those of their peers.
The possibility of unknown scenarios in Big Data testing is gigantic when compared to testing
techniques for conventional applications. The scope and range of the data harness in Big Data
applications will demand new benchmarks of Software Quality Assurance.
To accommodate Big Data test requirements, processes and infrastructure will be redesigned to
achieve new levels of scalability, reusability and compatibility to ensure comprehensive,
continuous and context driven test capabilities. To handle the volume and ensure live data
integration, Big Data testing needs to empower developers and enterprises with freedom to
experiment and innovate
One data layer
From a Big Data perspective, enterprises will seek validation of application design, data security,
source verification and compliance with industry standards. The parameters of performance,
speed, security and load will add magnitude and precision to sculpt and reorganize data volumes
into blocks that match the emerging requirements.
Over time, the database and storage layers will merge into a single data layer with options of
retrieval and transmission exported out of the layer.
Business leaders now look at data maps to estimate and draft plans for emerging scenarios. The
transformation of data into comprehensive reports in real time will add value to business
decisions and enrich operations with higher levels of speed and accuracy. The test capabilities
will acquire ability to de-complicate data sources/types/structures and channel them along
specified contexts to align with objectives.

In a story titled The Top 7 Things Obama Taught Us About the Future of Business, the Forbes
reported that the Obama campaign used a test tool called 'Optimize' to improve efficiency. Dan
Siroker, Co-founder of Optimize, was quoted as saying we ran over 240 A/B tests to try
different messaging, calls to action, and in attempt to raise more money. Because of our efforts,
we increased effectiveness 49 percent.
Why Big Data is a good opportunity for Software Testers?
Consider this. A joint report by NASSCOM and CRISIL Global Research & Analytics suggests
that by 2015, Big Data is expected to become a USD 25 billion industry, growing at a CAGR of
45 per cent. Managing data growth is the number two priority for IT organizations over the next
12-18 months. In order to sustain growth, enterprises will adopt next generation data integration
platforms and techniques fueling the demand for Quality Assurance mechanisms around the new
data perspectives.
Be a smart tester and ride the next wave of IT on Big Data
Testers can formulate service models through operational exposure to data acquisition techniques
on Hadoop and related platforms. Test approaches can be developed by studying the deployment
strategies of Mahout, Java, Python, Pig, Hive etc. Contextualization of data from diverse sources
to streamlined outputs helps testers understand the channels of business logic in the data science.
Big Data is an emerging discipline which will leave a profound impact on the global economy.
The ability to explore the power of Big Data testing is like being in a hotspot that will see action
in terms of innovations that match emerging test requirements.

Extensive Experience in Data Integration Experience in Developing and Scripts Using
No ratings yet
Extensive Experience in Data Integration Experience in Developing and Scripts Using
9 pages
ETL Specific
No ratings yet
ETL Specific
12 pages
Delphix WP TDM Survival Checklist
No ratings yet
Delphix WP TDM Survival Checklist
3 pages
Azure Synapse & SQL Data Expertise
No ratings yet
Azure Synapse & SQL Data Expertise
1 page
A Comprehensive Approach To Data Warehouse Testing
No ratings yet
A Comprehensive Approach To Data Warehouse Testing
8 pages
Data Warehouse Developer Resume
No ratings yet
Data Warehouse Developer Resume
7 pages
Saurav KR Das Resume
No ratings yet
Saurav KR Das Resume
3 pages
Plan A Plan B Plan C: SSRS Training
No ratings yet
Plan A Plan B Plan C: SSRS Training
8 pages
Big Data Testing Essentials
No ratings yet
Big Data Testing Essentials
10 pages
Datastage Developer Expertise
No ratings yet
Datastage Developer Expertise
4 pages
Prashanth Talend
No ratings yet
Prashanth Talend
4 pages
LUF-MDM-002 Informatica MDM Hub Installation and Configuration Guide v01.1
100% (1)
LUF-MDM-002 Informatica MDM Hub Installation and Configuration Guide v01.1
50 pages
ETL Testing
No ratings yet
ETL Testing
32 pages
ETL Testing Engineer From Python
No ratings yet
ETL Testing Engineer From Python
3 pages
Data Architect or ETL Architect or BI Architect or Data Warehous
No ratings yet
Data Architect or ETL Architect or BI Architect or Data Warehous
4 pages
PROFESSIONAL Informatica CV
No ratings yet
PROFESSIONAL Informatica CV
20 pages
OBIEE Semantic Layer
No ratings yet
OBIEE Semantic Layer
3 pages
Benefits of Data Archiving in Data Warehouses
100% (1)
Benefits of Data Archiving in Data Warehouses
12 pages
Data Analyst
No ratings yet
Data Analyst
12 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Resume
No ratings yet
Resume
3 pages
Quora - Informatica DW BI Ques ANS
No ratings yet
Quora - Informatica DW BI Ques ANS
7 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
UBS OCF - IDQ Capabilities Review
No ratings yet
UBS OCF - IDQ Capabilities Review
15 pages
BI Testing for Business Users
No ratings yet
BI Testing for Business Users
5 pages
Informatica MDM for IT Professionals
No ratings yet
Informatica MDM for IT Professionals
8 pages
Jurgen Ziemer Resume
No ratings yet
Jurgen Ziemer Resume
2 pages
CLC Africa Training Catalogue
No ratings yet
CLC Africa Training Catalogue
20 pages
AyushiPatra Resume
No ratings yet
AyushiPatra Resume
1 page
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Total Experience: 6 Year(s) 0 Month(s) Annual Salary: Rs 7.0 Lac(s)
No ratings yet
Total Experience: 6 Year(s) 0 Month(s) Annual Salary: Rs 7.0 Lac(s)
7 pages
SQL Database Testing Guide
No ratings yet
SQL Database Testing Guide
4 pages
Interview Questions
No ratings yet
Interview Questions
70 pages
Data Integration Specification V4
No ratings yet
Data Integration Specification V4
74 pages
Assignment3 ETL PDF
No ratings yet
Assignment3 ETL PDF
6 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
QA Lead with 8+ Years in Finance & Tech
No ratings yet
QA Lead with 8+ Years in Finance & Tech
8 pages
Transforming Financial Services To A Customer-Centric Business
No ratings yet
Transforming Financial Services To A Customer-Centric Business
8 pages
Data Engineering Interview Prep
No ratings yet
Data Engineering Interview Prep
8 pages
ETLQA/Tester Datawarehouse QA/Tester Should Have
No ratings yet
ETLQA/Tester Datawarehouse QA/Tester Should Have
12 pages
Informatica Tutorials
No ratings yet
Informatica Tutorials
2 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Resume - Tanmoy Munshi PDF
No ratings yet
Resume - Tanmoy Munshi PDF
2 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
8 pages
BI EDW Cookbook
100% (1)
BI EDW Cookbook
56 pages
Informatica Deployment Guide
No ratings yet
Informatica Deployment Guide
9 pages
Kancharana Prasanth Data Analyst and Business Intelligence (BI) Developer (Power Bi//SQL/T-SQL/MSBI/SSIS/SSRS/ADF) MOBILE: 7989848629
No ratings yet
Kancharana Prasanth Data Analyst and Business Intelligence (BI) Developer (Power Bi//SQL/T-SQL/MSBI/SSIS/SSRS/ADF) MOBILE: 7989848629
5 pages
Why ETL
No ratings yet
Why ETL
15 pages
Informatica Powermart / Powercenter 6.X Upgrade Features: Ted Williams
No ratings yet
Informatica Powermart / Powercenter 6.X Upgrade Features: Ted Williams
53 pages
Informatica MDM Sample Resume 3
No ratings yet
Informatica MDM Sample Resume 3
6 pages
Director Analytics Supply Chain in New York City Resume Pradeep Nair
No ratings yet
Director Analytics Supply Chain in New York City Resume Pradeep Nair
2 pages
FSLDM Data Modeller
No ratings yet
FSLDM Data Modeller
1 page
CDH To CDP Migration-July29v3
0% (1)
CDH To CDP Migration-July29v3
22 pages
Data Migration Strategy Guide
No ratings yet
Data Migration Strategy Guide
12 pages
Big Data Testing: What You Need To Know
No ratings yet
Big Data Testing: What You Need To Know
4 pages
Big Data Testing: Strategies & Steps
No ratings yet
Big Data Testing: Strategies & Steps
1 page
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
Big Data Tools
No ratings yet
Big Data Tools
3 pages
Research Paper Testing
No ratings yet
Research Paper Testing
2 pages
Pega Business Process Management Data Sheet
No ratings yet
Pega Business Process Management Data Sheet
2 pages
Software Testing Essentials
100% (4)
Software Testing Essentials
132 pages
QTP Shortcuts
No ratings yet
QTP Shortcuts
2 pages
QTP Shortcuts
No ratings yet
QTP Shortcuts
2 pages
Gas Leakage Detection and Monitoring PDF
No ratings yet
Gas Leakage Detection and Monitoring PDF
5 pages
Cooling Load Calculation: Principles
No ratings yet
Cooling Load Calculation: Principles
30 pages
FOTON - Tunland 8 (001-111)
No ratings yet
FOTON - Tunland 8 (001-111)
111 pages
UT Dallas Syllabus For Ecs3390.004.10f Taught by Christopher Ryan (cxr088000)
No ratings yet
UT Dallas Syllabus For Ecs3390.004.10f Taught by Christopher Ryan (cxr088000)
10 pages
Atronic Commboard 68k Rev 2.10
No ratings yet
Atronic Commboard 68k Rev 2.10
16 pages
Water and Power Development Authority: Golen Gol Hydropower Project Chitral
No ratings yet
Water and Power Development Authority: Golen Gol Hydropower Project Chitral
9 pages
Class 10 Science Previous Year Questions - Electricity
No ratings yet
Class 10 Science Previous Year Questions - Electricity
26 pages
Spe 172813 MS PDF
No ratings yet
Spe 172813 MS PDF
9 pages
Python for Beginners
No ratings yet
Python for Beginners
17 pages
Govt. of Odisha - Central Monitoring Mechanism For Right To Information (RTI CMM v-3.0) - Pages
No ratings yet
Govt. of Odisha - Central Monitoring Mechanism For Right To Information (RTI CMM v-3.0) - Pages
2 pages
Manufacturing Leadership Expert
No ratings yet
Manufacturing Leadership Expert
6 pages
Underground Tank Cleaning Guide
No ratings yet
Underground Tank Cleaning Guide
26 pages
Gaetano Salvemini - La Rivoluzione Francese (1788-1792), 1905 PDF
100% (1)
Gaetano Salvemini - La Rivoluzione Francese (1788-1792), 1905 PDF
409 pages
123 Not Done Till 18th Feb'14
No ratings yet
123 Not Done Till 18th Feb'14
64 pages
E-Series DX PDF
No ratings yet
E-Series DX PDF
4 pages
Informed Consent for UKZN Research
No ratings yet
Informed Consent for UKZN Research
2 pages
ISO27k Model Policy On Change Management and Control
No ratings yet
ISO27k Model Policy On Change Management and Control
10 pages
K.M., Et. Al., v. John Hickenlooper, Et. Al.: Stipulation To Dismiss
No ratings yet
K.M., Et. Al., v. John Hickenlooper, Et. Al.: Stipulation To Dismiss
4 pages
Electromagnetic Lock Guide
No ratings yet
Electromagnetic Lock Guide
4 pages
Understanding Airplanes
100% (1)
Understanding Airplanes
360 pages
Laboratory Work No. 4: Connectivity To The Network
No ratings yet
Laboratory Work No. 4: Connectivity To The Network
10 pages
Electronic Circuit Symbols Guide
100% (1)
Electronic Circuit Symbols Guide
53 pages
Manual Jumpy
No ratings yet
Manual Jumpy
225 pages
Meat-Ds - Spin Flash Dryer System-Uk
No ratings yet
Meat-Ds - Spin Flash Dryer System-Uk
4 pages
Definition and Process Description
100% (1)
Definition and Process Description
4 pages
Transform Your Tablet. Transform Your Experience.: A Beautiful and Powerful Tablet For The Family
No ratings yet
Transform Your Tablet. Transform Your Experience.: A Beautiful and Powerful Tablet For The Family
2 pages
Menu Driver 5.0 Select Training Manual
No ratings yet
Menu Driver 5.0 Select Training Manual
74 pages
LCD TV Service Manual
No ratings yet
LCD TV Service Manual
60 pages
Technical Training v7 Exercises
No ratings yet
Technical Training v7 Exercises
36 pages
Cao 2021
No ratings yet
Cao 2021
58 pages

Big Data Testing: Key Components

Uploaded by

Big Data Testing: Key Components

Uploaded by

Testing Big Data: Three Fundamental Components

Execution of Map-Reduce Operations

As speed is one of Big Datas main characteristics, it is mandatory to do performance testing. A

Big Data Testing

Big Data Testing Needs and Challenges

Increasing need for Live integration of information: With multiple

Instant Data Collection and Deployment: Power of Predictive analytics

Real-time scalability challenges: Big Data Applications are built to match

testing involving smarter data sampling and cataloguing techniques coupled

Testing New Age Big Data Applications - Cigniti Testlets

Big Data testing: The challenge and the opportunity

You might also like