Terminologies Used in Big Data Environments
Terminologies Used in Big Data Environments
As-a-service infrastructure
Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather than
selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as a
service”, rather than as a product. This reduces the upfront capital investment necessary for customers to
begin putting their data, or platforms, to work for them, as the provider bears all of the costs of setting up
and hosting the infrastructure. As a customer, as-a-service infrastructure can greatly reduce the initial cost
and setup time of getting Big Data initiatives up and running.
Data science
Data science is the professional field that deals with turning data into value such as new insights or
predictive models. It brings together expertise from fields including statistics, mathematics, computer
science, communication as well as domain expertise such as business knowledge. Data scientist has
recently been voted the No 1 job in the U.S., based on current demand and salary and career
opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data, because it is so large,
this is generally done by computational methods in an automated way using methods such as decision
trees, clustering analysis and, most recently, machine learning. This can be thought of as using the brute
mathematical power of computers to spot patterns in data which would not be visible to the human eye
due to the complexity of the dataset.
Hadoop
Hadoop is a framework for Big Data computing which has been released into the public domain as open
source software, and so can freely be used by anyone. It consists of a number of modules all tailored for a
different vitalstep ofthe Big Data process – from file storage (Hadoop File System – HDFS) to database
(HBase) to carrying out data operations (Hadoop MapReduce – see below). It has become so popular due
to its power and flexibility that it has developed its own industry of retailers (selling tailored versions),
support service providers and consultants.
Predictive modeling
At its simplest, this is predicting what will happen next based on data about what has happened
previously. In the Big Data age, because there is more data around than ever before, predictions are
becoming more and more accurate. Predictive modelling is a core component of most Big Data initiatives,
which are formulated to help us choose the course of action which will lead to the most desirable
outcome. The speed of modern computers and the volume of data available means that predictions can be
made based on a huge number of variables, allowing an ever-increasing number of variables to be
assessed for the probability that it will lead to success.
MapReduce
MapReduce is a computing procedure for working with large datasets, which was devised due to
difficulty of reading and analysing really Big Data using conventional computing methodologies. As its
name suggest, it consists of two procedures – mapping (sorting information into the format needed for
analysis – i.e. sorting a list of people according to their age) and reducing (performing an operation, such
checking the age of everyone in the dataset to see who is over 21).
NoSQL
NoSQL refers to a database format designed to hold more than data which is simply arranged into tables,
rows, and columns, as isthe case in a conventional relational database. This database format has proven
very popular in Big Data applications because Big Data is often messy, unstructured and does not easily
fit into traditional database frameworks.
Python
Python is a programming language which has become very popular in the Big Data space due to its ability
to work very well with large, unstructured datasets (see Part II for the difference between structured and
unstructured data). It is considered to be easier to learn for a data science beginner than other languages
such as R (see also Part II) and more flexible.
R Programming
R is another programming language commonly used in Big Data, and can be thought of as more
specialised than Python, being geared towards statistics. Its strength lies in its powerful handling of
structured data. Like Python, it has an active community of users who are constantly expanding and
adding to its capabilities by creating new libraries and extensions.
Recommendation engine
Real-time
Real-time means “as it happens” and in Big Data refers to a system or process which is able to give data-
driven insights based on what is happening at the present moment. Recent years have seen a large push
for the development of systems capable of processing and offering insights in real-time (or near-real-
time), and advances in computing power as well as development of techniques such as machine learning
have made it a reality in many applications today.
Reporting
The crucial “last step” of many Big Data initiative involves getting the right information to the people
who need it to make decisions, at the right time. When this step is automated, analytics is applied to the
insights themselves to ensure that they are communicated in a way that they will be understood and easy
to act on. This will usually involve creating multiple reports based on the same data or insights but each
intended for a different audience (for example, in-depth technical analysis for engineers, and an overview
of the impact on the bottom line for c-level executives).
Spark
Spark is another open source framework like Hadoop but more recently developed and more suited to
handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike Hadoop
it does not include its own filesystem, though it is designed to work with Hadoop’s HDFS or a number of
other options. However, for certain data related processes it is able to calculate at over 100 times the
speed of Hadoop, thanks to its in-memory processing capability. This means it is becoming an
increasingly popular choice for projects involving deep learning, neural networks and other compute-
intensive tasks.
Structured Data
Structured data is simply data that can be arranged neatly into charts and tables consisting of rows,
columns or multi-dimensioned matrixes. This is traditionally the way that computers have stored data,
and information in this format can easily and simply be processed and mined for insights. Data gathered
from machines is often a good example ofstructured data, where various data points – speed, temperature,
rate of failure, RPM etc. – can be neatly recorded and tabulated for analysis.
Unstructured Data
Unstructured data is any data which cannot easily be put into conventional charts and tables. This can
include video data, pictures, recorded sounds, text written in human languages and a great deal more. This
data has traditionally been far harder to draw insight from using computers which were generally
designed to read and analyze structured information. However, since it has become apparent that a huge
amount of value can be locked away in this unstructured data, great efforts have been made to create
applications which are capable of understanding unstructured data – for example visual recognition and
natural language processing.
Visualization
Humans find it very hard to understand and draw insights from large amounts of text or numerical data –
we can do it, but it takes time, and our concentration and attention is limited. For this reason effort has
been made to develop computer applications capable of rendering information in a visual form – charts
and graphics which highlight the most important insights which have resulted from our Big Data projects.
A subfield of reporting (see above), visualizing is now often an automated process, with visualizations
customized by algorithm to be understandable to the people who need to act or take decisions based on
them.