0% found this document useful (0 votes)
10 views

Data Wrangling - data lake

The document discusses the challenges of data wrangling in the context of creating and maintaining a curated data lake, which is essential for effective data analysis and accessibility. It highlights the logistical difficulties of obtaining and interpreting raw data, the need for thorough curation, and the importance of governance to ensure compliance with licensing and data quality. The authors argue that a curated data lake is necessary to realize the potential of big data analytics beyond traditional siloed data systems.

Uploaded by

phenrique
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Wrangling - data lake

The document discusses the challenges of data wrangling in the context of creating and maintaining a curated data lake, which is essential for effective data analysis and accessibility. It highlights the logistical difficulties of obtaining and interpreting raw data, the need for thorough curation, and the importance of governance to ensure compliance with licensing and data quality. The authors argue that a curated data lake is necessary to realize the potential of big data analytics beyond traditional siloed data systems.

Uploaded by

phenrique
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Wrangling: The Challenging Journey from the Wild to

the Lake
Ignacio Terrizzano, Peter Schwarz, Mary Roth, John E. Colino
IBM Research
650 Harry Rd
San Jose, CA 95120
{igterriz, pschwarz, torkroth, jcolino}@us.ibm.com

ABSTRACT 1. INTRODUCTION
Much has been written about the explosion of data, also known as We have all been inundated with facts and statistics about the data
the “data deluge”. Similarly, much of today's research and decision deluge that surrounds us from consumer-generated and freely
making are based on the de facto acceptance that knowledge and available social media data, from the vast corpus of open data, and
insight can be gained from analyzing and contextualizing the vast from the growing body of sensor data as we enter the era of the
(and growing) amount of “open” or “raw” data. The concept that Internet of Things [34]. Along with the bombardment of statistics
the large number of data sources available today facilitates analyses about this data deluge, there appears to be a de facto acceptance that
on combinations of heterogeneous information that would not be there is critical new business value or scientific insight that can be
achievable via “siloed” data maintained in warehouses is very gained from analyzing the zettabytes of data now at our fingertips,
powerful. The term data lake has been coined to convey the concept if only enterprise data can be freed from its silos and easily mixed
of a centralized repository containing virtually inexhaustible with external “raw” data for self-serve, ad-hoc analysis by an
amounts of raw (or minimally curated) data that is readily made audience broader than the enterprise IT staff.
available anytime to anyone authorized to perform analytical Financial institutions, for example, now speak of offering
activities. The often unstated premise of a data lake is that it relieves personalized services, such as determining if a client is exposed to
users from dealing with data acquisition and maintenance issues, legal risks due to the contents of his or her portfolio. Such analysis
and guarantees fast access to local, accurate and updated data requires access to internal data, external news reports and market
without incurring development costs (in terms of time and money) data about the companies that make up the portfolio, as well as
typically associated with structured data warehouses. However publicly available regulatory information. As another example, a
appealing this premise, practically speaking, it is our experience, Fortune 1000 information processing company that provides
and that of our customers, that “raw” data is logistically difficult to outsourcing services to manage their clients' data processing
obtain, quite challenging to interpret and describe, and tedious to systems would also like to offer them analytic sandboxes and
maintain. Furthermore, these challenges multiply as the number of customized access to demographic data and economic data by
sources grows, thus increasing the need to thoroughly describe and geography, all of which is available from sources like the U.S.
curate the data in order to make it consumable. In this paper, we Census Bureau and the U.S. Bureau of Labor Statistics. As yet
present and describe some of the challenges inherent in creating, another example, IBM Research itself has recognized that
filling, maintaining, and governing a data lake, a set of processes gathering a large body of contextual data and making it readily
that collectively define the actions of data wrangling, and we accessible to its research scientists is strategically important for
propose that what is really needed is a curated data lake, where the innovation [12].
lake contents have undergone a curation process that enable its use
and deliver the promise of ad-hoc data accessibility to users beyond IBM estimates that a staggering 70% of the time spent on analytic
the enterprise IT staff. projects is concerned with identifying, cleansing, and integrating
data due to the difficulties of locating data that is scattered among
Categories and Subject Descriptors many business applications, the need to reengineer and reformat it
in order to make it easier to consume, and the need to regularly
H.1.1 [Systems and Information] Value of Information. refresh it to keep it up-to-date [5]. This cost, along with recent
H.3.2 [Information and Storage Retrieval] Information Storage. trends in the growth and availability of data, have led to the concept
of a capacious repository for raw data called a data lake. According
General Terms to a recent definition, and as shown in Figure 1, a data lake is a set
Management, Documentation, Design, Legal Aspects. of centralized repositories containing vast amounts of raw data
(either structured or unstructured), described by metadata,
Keywords organized into identifiable data sets, and available on demand [5].
Data lake, data wrangling, data curation, data integration, metadata,
Data in the lake supports discovery, analytics, and reporting,
schema mapping, analytics sandboxes
usually by deploying cluster tools like Hadoop. Unlike traditional
warehouses, the format of the data is not described (that is, its
This article is published under a Creative Commons Attribution schema is not available) until the data is needed. By delaying the
License(http://creativecommons.org/licenses/by/3.0/), which permits categorization of data from the point of entry to the point of use
distribution and reproduction in any medium as well as allowing [10], analytical operations that transcend the rigid format of an
derivative works, provided that you attribute the original work to the adopted schema become possible. Query and search operations on
author(s) and CIDR 2015. the data can be performed using traditional database technologies
7th Biennial Conference on Innovative Data Systems Research (CIDR (when structured), as well as via alternate means such as indexing
’15) January 4-7, 2015, Asilomar, California, USA. and NoSQL derivatives.
Anayltics projects
Given the challenges present when working with vast amounts of
raw data, particularly upon first use, we propose that what is needed
Structured private
Services to provide self-service, agile access to data is a curated data lake.
Text
Project-specific
Streams shared
Time Series
Data and
Analytics
In this paper, we present a number of challenges inherent in
Geo spatial
Lake
Search
creating, filling, maintaining, and governing a curated data lake, a
External
Social Media
Govern
Provision
set of processes that collectively define the actions of data
third party
… open Track
wrangling (see Figure 2). These are challenges not only reported
by our customers, but are also challenges that we face ourselves in
Ingestion/Acquisition flow creating a work-in-progress data lake to be used both internally for
IBM research staff as well as in client engagements.
Figure 1: Data Lake Logical Architecture

While this definition of a data lake is not difficult to understand, it


originates from an essential premise: the data in the lake is readily
available and readily consumable by users who have less technical
skill than traditional IT staff. This implies that the data lake
somehow relieves such users from the well-defined but tedious and
technical tasks that are required to prepare data associated with a
traditional data integration platform and data warehouse
architecture, such as defining standard data models, extracting and
transforming data to a common data model, performing cleansing,
validation, and error handling, and documenting the process [5].

This premise, however, is in stark contrast with a repository of raw


data. As Gartner recently noted, there exist pitfalls in creating and
using an enterprise-level data lake [28]. Gartner portrays a data
lake as a “catch all” repository and, as such, cites problems inherent
from data quality, provenance, and governance, all of which have
been historically associated with traditional data warehouses. We
argue here that a “raw” data lake does not enhance the agility and
accessibility of data, since much of the necessary data massaging is
simply postponed, potentially to a time far removed from the
moment that the data was acquired. And, in addition, we believe a
data lake introduces legal ramifications, from adherence to
licensing terms to determination of liability and ownership of Figure 2: Data Wrangling Process
derived data. A raw data lake places the burden of such tasks Overview
squarely on the data consumer. The steps associated with a
traditional data integration platform exist for a reason; data in its We begin by describing in Section 2 the motivation for the creation
raw format is rarely immediately consumable for use in a specific of a data lake at IBM Research, as well as a high-level architecture.
application. For example, economic data from the U.S. Bureau of This allows us to draw on our experiences in order to contextualize
Labor Statistics [4] (BLS) represents geographic regions using its the challenges presented starting in Section 3. Section 3 describes
own set of codes, without which the statistical data is difficult to concerns around data procurement, focusing on data selection,
interpret and use in a meaningful way. obtainment, and description, paying particular attention to issues
around licensing and governance. We then continue in Section 4
The term data curation is increasingly being used to describe the by describing the difficulties in readying data for use, a process
actions necessary to maintain and utilize digital data during its called data grooming which encompasses data massaging and
useful life-cycle for current and future interested users. normalization. Section 5 details the concerns around usage of data
in the lake, and the challenges around ensuring compliance by
Digital curation involves selection and appraisal by authorized users. Section 6 provides a brief description of data
creators and archivists; evolving provision of intellectual preservation, a key concern of data maintenance. Finally, we
access; redundant storage; data transformations; and, for summarize and conclude in Sections 7 and 8 by introducing related
some materials, a commitment to long-term preservation. work and description of future work.
Digital curation is stewardship that provides for the
reproducibility and re-use of authentic digital data and 2. THE IBM RESEARCH DATA LAKE
other digital assets. Development of trustworthy and The IBM Research Accelerated Discovery Lab [12] was started in
durable digital repositories; principles of sound metadata late 2012 to support multiple, independent analytic projects that
creation and capture; use of open standards for file may involve participants from several institutions. It is currently
formats and data encoding; and the promotion of supporting over a dozen projects from several domains, including,
information management literacy are all essential to the for example, a project to use literature-based discovery over
longevity of digital resources and the success of curation medical journals and patent databases to support cancer research,
efforts. [1] and a project that tracks the cost of water in different geographies
by analyzing public utility reports and news articles to define a
global water index.

A key service provided by the lab is a lake of contextual data that


can be used across different research projects. For example, data
provided by the government agencies we have been using as
examples in this paper can supply location-specific demographic
data, economic data grouped by location and industry, climate data,
SEC filings and the like, all of which are useful in many contexts.
Important contextual information available from other sources
includes worldwide patent data, medical journals, and many kinds
of geo-spatial data.

Figure 3 shows a high level overview of our data lake architecture.


The lake is intended to support over 500 researchers across multiple
research labs. As shown in the figure, IBM researchers develop
applications that run in both internal and external cloud Figure 3: IBM’s Accelerated Discovery Lab Data Lake High-
environments and require access to data stored in the data lake. Level Architecture
Because a firewall between the cloud environments prevents
processes on the external cloud from accessing the internal cloud, Figure 4 shows an overview of a dashboard displaying the current
we have chosen a master/slave architecture for the data lake status of the lake. The dashboard shows a point-in-time status of
storage, with a pipeline to transfer data on an as-needed basis from data as it moves through the various steps described below. A data
the master lake located in the internal cloud environment to the set represents a collection of logically related data objects (such as
slave lake located in the external cloud. files or tables) that correspond to a single topic. For example, an
average price data set [2] available from the Bureau of Labor and
As will be described in Section 3.1, compliance with data licensing Statistics includes a set of tables for gasoline prices, food prices and
terms and other controls on data usage is a critical and nontrivial household energy prices. At the time this snapshot was taken, the
exercise, and failure to do so can introduce significant liability for dashboard shows that 75 data sets have been considered in
an enterprise. To assist in this task, we have developed a categories such as biomedical, social and economic data for
governance tool that tracks requests for acquisition of new data for inclusion in the data lake. 59 data sets are still in various steps of
the lake and access requests for data already present in the lake. the process described below, and 16 have completed the process
The tool collects input from all stakeholders in the governance and are available for use in the lake.
process recording then in a secure system-of-record, along with all
relevant licenses, wrangling guidelines, usage guidelines and data-
user agreements. The tool runs in the internal cloud, and is accessed
via a proxy from the external cloud.

Figure 4: Data Lake Dashboard


You … cannot use our self-service program if your
3. PROCURING DATA Application targets current or potential paying customers
Data procurement is the first step performed by data wranglers; it
of LinkedIn products or people engaging in activities
describes the process of obtaining data and metadata and preparing
related to those products—in other words, Applications
them for eventual inclusion in a data lake. The potential to achieve
used for hiring, marketing, or selling.
new insights from Big Data depends in part on the ability to
combine data from different domains in novel ways. In each By accepting such an agreement, an employee inevitably exposes
domain, however, a plethora of data is often available, frequently the enterprise to a certain level of risk. Many of the terms therein
from multiple sources. Given a particular domain, the data (e.g. “potential paying customers of LinkedIn”) are not precisely-
wrangler’s first task is likely to be the identification of the specific defined, and while an employee may believe that their intended use
sources and data sets that will be of most value to the enterprise. of the data does not violate this license, there is always a chance
This can be quite challenging. that a lawsuit may be filed and a court may disagree. Furthermore,
ungoverned redistribution of the data, even within the enterprise,
For example, basic information about the U.S. economy is available greatly increases the likelihood that some users of the data may
from the Bureau of Labor Statistics [4], the Bureau of Economic violate the license terms.
Analysis [3] and probably from several other sources as well. Even
a single source may offer a wide variety of similar information. The Other licenses constrain data use in other ways. Some licenses limit
National Climatic Data Center [24], for example, offers climate or prohibit retention of data. Many licenses require data consumers
data as recorded by land stations, weather balloons, satellites, to cite the source of any data displayed by an application.
paleo-climatological readings, and several other options. Data may
be provided at different levels of granularity (in time or space), for Other sources of risk arise even when the usage of the data adheres
different time periods or locations, and in different formats. The closely to the license terms. Data from third parties may contain
more a data wrangler is aware of how the data will ultimately be errors, and licenses typically include a disclaimer that limits the
consumed, the better he or she will be able to make good choices provider’s liability for damages caused by such errors. If the data
about which data to wrangle, but the ad-hoc nature of Big Data is subsequently used by the enterprise in a manner that affects their
analysis means the wrangler must anticipate these needs, rather customers or clients, any liability for errors must either be passed
than react to them. Another consideration in selecting a data source on to the customer or accepted by the enterprise as a risk. Similarly,
centers on whether the provider supplies data in bulk, or only a few licenses may contain terms that indemnify the provider against
items at a time, such as in response to a narrow query. The latter damages due to the accidental disclosure of Sensitive Personal
type of source may provide very valuable information, but Information (SPI). For certain kinds of data, notably health care
performance may preclude its use in Big Data analytics. data, accidental disclosure can result in very large fines.
Furthermore, patterns in the queries issued by an enterprise may
reveal to the supplier information the enterprise would prefer to Still another source of risk that may be incurred when an enterprise
keep in confidence. uses third-party data centers on issues of copyright. Many suppliers
of data distribute or redistribute material subject to copyright and
Beyond the utility of a data set, the wrangler must also consider the control the subsequent use of that data, either through license terms
terms under which it is made available and the mechanisms needed that prohibit further redistribution or by constraining the licensee to
to obtain it. These are the topics of the following two sections. redistribute the data subject to specific terms. For example, the
Creative Commons Attribution – ShareAlike 3.0 License [35],
under which Wikipedia is distributed, provides free access to
3.1 Vetting Data for Licensing and Legal Wikipedia content, but specifies that material obtained under this
Use license may only be redistributed under the same (no-cost) license.
Such restrictions may be incompatible with an enterprise’s business
Once the data to be obtained has been identified and selected, the
model. Furthermore, the same restriction applies to adaptations of
next step is to determine the terms and conditions under which it
the original data, or derived works. The Wikipedia terms state:
may be licensed. Often, license terms are available on a web page,
but locating the license that applies to a specific data set is not
always easy, and once terms are located, the typical data scientist If you alter, transform, or build upon this work, you may
is not qualified to understand them. To understand a license, the distribute the resulting work only under the same, similar
reader must be able to discern: or a compatible license.

 What data is being licensed, and how or where is it being made Risk arises because it is often unclear whether a particular use of
available? data does or does not constitute a derived work. For example,
 Can the data be obtained at no cost, or is there a charge consider a process that uses text annotators to analyze copyright
associated with access? If there is a charge, how is it applied data licensed under such terms, and builds a knowledge graph to
(e.g. one-time, periodic, per data item accessed etc.)? represent the extracted information. Is the knowledge graph a
 What kinds of use are permitted/prohibited by the license? “derived work” that must be distributed free of charge? Ultimately,
 What risks are incurred by the enterprise in accepting the the answer to such questions may have to come from a court, and
license? different jurisdictions may answer the question in different ways.

The latter two questions are closely related, and often difficult to Certain special classes of data introduce additional risks. If data
answer. Restrictions on the use of data abound. For example, the that contains, or might contain, Sensitive Personal Information
Terms of Use for the LinkedIn Self-Service API [22] include the (SPI) is placed in the data lake, controls must be in place to ensure
following clause: that it is only used for legal and authorized purposes, whether the
data is internal to the enterprise or acquired from third parties.
In a global enterprise, movement of data across international advisors to digest and interpret the license terms and identify other
boundaries introduces yet more complexity Export controls may risks, and management representatives empowered to weigh the
prohibit transmission of certain kinds of sensitive data, privacy risks and benefits and come to a decision. Assuming the benefits
laws vary from country to country, and data may be licensed under outweigh the risks, the end result of this process is a set of
different terms in different places. For example, the SNOMED guidelines that delineate how employees are permitted to obtain and
medical terminology system can be licensed free-of-charge in use third-party data from a particular source, expressed in clear
countries that are members of the International Health Terminology terms that a data scientist can understand and abide by.
Standards Development Organisation [19] (IHTSDO), but requires
a fee to be paid in other countries. We distinguish two sets of guidelines that are typically needed. The
first set, the wrangling guidelines, advises the team that will obtain
Lastly, data providers often make a distinction between research or the data from its source about rules they must follow to comply with
personal use and commercial use of the data they distribute. Even the license. For example, wrangler guidelines may include
many so-called “open” data sites allow their data to be used freely technical restrictions on how a provider’s web site may be accessed
for research, but require a special license to be negotiated for other (e.g. “only from a specified IP address, allowing at least 2 seconds
uses. For example, the City of Boston [6] restricts the use of their between download requests”). The wranglers may also be asked to
open data by businesses as follows: look for and exclude certain material, such as copyright images,
User may use the City's Data in the form provided by the that fall outside the scope of the license, and must be prepared to
City for User's own internal business or organizational remove any material if ordered to do so.
purposes and for no other purpose.
The second set of guidelines, usage guidelines, must be tailored to
Similarly, Yelp’s [36] terms of service contain an outright the specific use case(s) contemplated by the enterprise, and spell
prohibition on commercial use of the data in their RSS feed. As in out, in context, how employees may use the data while complying
other cases mentioned above, the lines between permitted and with the supplier’s license. Any employee wishing to obtain the
prohibited uses may be unclear and subject to interpretation. data from the lake must agree to these guidelines. In most cases,
permission to use the data will be granted only for a limited time,
What is needed to manage the various risks associated with third- after which re-approval will be needed. Similar usage guidelines
party data and prevent the data lake from becoming a data swamp are required for data internal to the enterprise that has been
is a data governance process that brings together the many contributed to the lake. In either case, controls must be in place to
stakeholders that are affected by the decision to use such data: ensure that the data is only used for appropriate purposes.
domain experts that can determine the data’s potential value, legal

Figure 5: Sample Data Governance Process


Figure 5 depicts a high-level overview of the governance process ways. Easiest to handle are the cases where the provider either adds
adopted at the IBM Accelerated Discovery Lab. It illustrates a new files with each version, or updates the content of existing files.
general process tailored to the two channels that require guidelines: More complex versioning approaches entail deletion of files from
the data acquisition channel, and the data access channel. Notice one version to another, and changes in schema or file structure from
the link between the two; once wrangling and usage guidelines are one version to another. The ability to handle the different types of
in place, access requests may be processed. The process shows the versioning approaches demands that the data wrangler’s scripts
lifetime of lake resident data, ending with decommission due to accurately reflect the versioning strategy adopted by the provider.
expiration of one of individual data access, licensing terms, or
staleness (i.e., renewal not needed).
3.3 Describing Data
In addition to producing the wrangling and usage guidelines, the Data alone is not useful. A data scientist searching a data lake for
data governance process should also create a permanent record of useful data must be able to find the data relevant to his or her needs,
all the information that went into the decision to use a particular and once a potentially useful data set is found, he or she will want
data set. Given that suppliers frequently change their licensing to know many things about it, e.g.:
terms, it must always be possible to ascertain exactly who agreed • How is this data represented?
to what, and when. • Where did this data come from? (Can I trust it?)
• How old is this data?
3.2 Obtaining Data • Can I connect this data to data I already have?
Once data is selected and licensing terms accepted, the next Answers to such questions require metadata of various kinds.
challenge is transferring the data physically from the source to the Schematic metadata is the basic information needed to ingest and
data lake. As we noted above, the data sources of interest for process the data, i.e., to answer the first of the four questions above.
populating a data lake are for the most part those that support bulk Turning again to the Bureau of Labor Statistics for an example, they
data download. Frequently, data sources themselves provide distribute information about wholesale prices in the US economy
guidance on how to best acquire their data. Bulk data is typically as a set of related files. The schematic metadata for this data set
delivered in files, either from a static inventory provided by the would include information about how the data is formatted (e.g.
source, or through an API that dynamically constructs files in string and/or column delimiters) and information about the schema
response to a query. Sometimes, the data set is so large that the (e.g. column names and types, foreign keys that relate the values in
only practical means of obtaining it is through the physical the various files). Unfortunately, this information is not supplied
shipment of disk drives or tapes. This is the case, for example, if in machine-processable form. Instead, this metadata must be
one requires a significant fraction of the National Elevation Dataset entered manually by the data wrangler or be re-discovered by
[25], a high-resolution rasterized topographical map of the United tooling.
States. In most cases, however, wranglers obtain files by writing
scripts that employ common tools and protocols like ftp, wget, A second type of metadata, semantic metadata, adds meaning to
rsync, and http that are readily available and widely understood. data independent of its representation. It enables the data scientist
to find potentially useful data sets and to answer questions like the
In addition to the common protocols noted above, a growing latter three of those listed above. Placing data sets into categories
number of sites support specialized protocols for transferring open and/or tagging them and their components (files, tables, columns,
data. The two most widely-used such protocols are CKAN [7] and documents, etc.) with keywords makes searching for information
Socrata [30]. While consumers must invest extra effort to easier, and information about the data set’s provenance can help to
implement these protocols, they supplement the raw data they resolve issues of reliability or timeliness. An advantage of
provide with valuable metadata that might otherwise have to be obtaining data sets using “open data” protocols like CKAN is that
collected and entered manually. We will have more to say about the data returned is supplemented with a core set of important
the importance of metadata in the next section. semantic metadata, including a title, description, categorization,
tags, revision history, license information and more.
Data wranglers also need to be concerned with overwhelming the
source server, and in many cases data providers request adherence Semantic metadata can also help to link disparate data sets to one
to procedural guidelines that aim to mitigate server overload. For another. Associating elements of a data set with concepts or objects
example, the Security and Exchange Commission's (SEC) EDGAR in an ontology that represents the real world can reveal connections
[8] (Electronic Data Gathering and Retrieval) service requests that between data sets that would not otherwise be apparent. For
bulk ftp downloads be made between 6pm and 9am ET. For some example, schematic metadata indicating that a numeric column
sources such as Yelp, consumers that violate access guidelines may actually contains postal codes, and understanding that a postal code
be subject to enterprise-wide denial of access under the licensing represents a geographic region, allows the data to be plotted on a
terms and infringement policy. Other sites, such as Wikipedia, map and potentially integrated with additional geospatial data from
actively manage download requests, for example by limiting the other sources. Providing such metadata manually is a tedious
number of simultaneous connections per IP address to two. process, and building better tools to do so automatically is an
important area of research. [16][23].
Once the copying starts, the wrangler's job is to verify both the
validity and fidelity of the download. Source providers may offer A third type of metadata is less frequently studied and less well
file identification checksums, file counts and sizes, or hash values understood. The first user of almost any non-trivial data set
for this purpose. discovers idiosyncrasies in the data that are crucial to understanding
it and using it effectively. Continuing to use the BLS data as an
Of course, obtaining data is rarely a one-time event. Providers tend example, rows representing data reported monthly use two columns
to frequently provide updates, causing versioning to become a to encode the year and month. Curiously, the month column
concern. Furthermore, providers may provide updates in different contains values that range from ‘M01’ to ‘M13’. Upon deeper
investigation, one learns that rows containing values between ‘M1’ which the Bureau of Labor Statistics represents dates. Integration
and ‘M12’ represent data for months January through December, of this data with other sources of economic data, or even something
whereas a row containing ‘M13’ contains the annual average value as simple as creating a graph that shows how a value (e.g. the price
of the statistic. A great deal of effort could be saved if subsequent of gasoline) varies over time, is difficult without normalizing the
users of this data set were able to consult the initial user, and dates to a standard format. Similarly, a data table must often be
become aware of this and other similar features of the data without pivoted to permit optimal processing. Economic data from the
having to rediscover them afresh. Bureau of Economic Analysis, for example, is structured so that
each year is represented by a column, with rows corresponding to
We call this type of metadata, information about who else has used specific measures, and rows containing subtotals interleaved with
the data, what their experiences were, where they did or did not find regular data rows. A conventional representation of this data would
value, and so forth, conversational metadata, and believe it to be of invert this relationship, making computation of aggregates and time
equal importance to the other types of metadata we have discussed series much simpler.
[21]. The conversation that revolves around a data set among a
group of data scientists bears a strong resemblance to the “buzz” Throughout the grooming process, a detailed record must be kept
that develops around a band or movie on social media, and we of exactly what was done at each stage. This is particularly the case
believe that tools for recording and searching this conversation if the grooming process alters the “information content” of the data
should follow a similar paradigm. To emphasize the need for such in any way. While normalization, annotation, etc. may add
metadata, Zeng and Qin [37] have noted that it is indispensable significant value to a data set, the consumer of the data must always
even if that secondary user is the same as the original one; human be able to observe and understand the provenance of the data they
memory is so short that even originators must rely on their own rely upon.
metadata. This problem will only get worse as the amount and
variety of available data increases.
5. PROVISIONING DATA
The previous sections have focused on getting data into the data
4. GROOMING DATA lake. We now turn to the means and policies by which consumers
As we have noted, data obtained in its raw form is often not suitable take data out of the data lake, a process we refer to as data
for direct use by analytics. We use the term data grooming to provisioning. It is our belief that running sophisticated analytics
describe the step-by-step process through which raw data is made directly against the data lake is usually impractical. In most cases,
consumable by analytic applications. Metadata plays a crucial role a data scientist will want to extract a data set (or subset) from the
throughout this process. The first steps in the grooming process use lake and customize the manner and location in which it is stored so
schematic metadata to transform raw data into data that can be that the analytics can execute as efficiently as possible. However,
processed by standard data management tools. Which tools are before undertaking a possibly complex and time-consuming
appropriate depends on the type of data being ingested: those used provisioning process, the data scientist should be able to do a
for searching and manipulating genomic sequences differ from preliminary exploration of the data, perhaps including simple
those used for geospatial data, which in turn differ from those used visualizations and the like, to determine the data’s utility and spot
for tabular data. anomalies that may require further consideration.

Even focusing just on tabular data, there are a myriad of ways in The technical issues that arise in getting data out of the data lake
which it can be represented. In some cases, the sought-after data are similar to those that arise with putting data into the lake, and are
may be embedded in PDF files or in other types of documents handled with similar tools and techniques, often in ways that are
designed for human readability rather than processing by machine. particular to the infrastructure of the enterprise. However, the point
Spreadsheets, for example, often contain artifacts like multi- when data is taken out of the data lake represents a critical event in
column headings that do not translate directly to the abstractions of the data’s life cycle. Once data leaves the lake, it becomes far more
database management software. In other cases, the information difficult to enforce controls over its use. A data scientist checking
may be represented in a custom format that must be converted, or out data must be made aware of, and have agreed to, the usage
at least understood, before it can be processed with conventional guidelines that were prepared for his or her use case.
data management tools. Other common formats include delimited
or fixed-format text files, JSON, XML and html. Even for formats Unless the target user is familiar with a raw data set, uncurated data
like these, which were designed for automated processing, is frequently very difficult to work with. Users are required to
information like delimiters, field widths and data types must either understand its content, structure, and format prior to deploying it
be supplied or deduced, and in either case become critical aspects for gainful purpose. Additionally, as described in [12], contextual
of the schematic metadata associated with the data set. data is often necessary to enhance analytical practices performed
on core domain data. That is, value is derived by combining
We also note that it is not uncommon for providers to change how pertinent domain data along with related (contextual) data from
their data is formatted, or to provide data for different time periods other sources. For example, a recent study on the spread of diseases
in different formats. For example, certain data collected by the analyzes DNA sample data swiped from surfaces in a city such as
National Climatic Data Center through 2011 conforms to one turnstiles, public railings, and elevator buttons to identify the
schema, but similar information collected from 2012 onward uses microbes present at each location, but it is contextual data such as
a different schema. Such changes disrupt a smoothly-running data demographic data and traffic patterns that bring insight into patterns
grooming pipeline, and must be detected and accommodated. of microbes across neighborhoods, income level, and populations.
In enterprise environments, open data only provides value when it
Once the data can be ingested, normalization of certain values can can be contextualized with the enterprise's private data. But
facilitate further processing and enable integration with other data identifying and leveraging contextual data is very difficult given
sets. For example, we have already noted the idiosyncratic way in that providers such as data.gov, BLS (Bureau of Labor Statistics),
NOAA (National Oceanic and Atmospheric Administration) and To date, however, we know of no software platform, business
most others typically organize their data in a hierarchy with either process to systematically define and provide provenance to support
categorical or data-driven delineations that make sense to the the legal and governance issues that enable curated data to flow into
applicable domain, and hence is not readily consumable unless and out of an enterprise with the agility needed to support a new
thoroughly described via metadata. class of applications that create derived works by reusing and
recombining enterprise and curated data, while still ensuring legal
compliance with the potentially myriad of license restrictions
6. PRESERVING DATA associated with the source data.
Managing a data lake also requires attention to maintenance issues
such as staleness, expiration, decommissions and renewals, as well
as the logistical issues of the supporting technologies (assuring 8. CONCLUSION
uptime access to data, sufficient storage space, etc.). We have shown that the creation and use of a data lake, while a
simple concept, presents numerous challenges every step of the
For completeness, we provide a high level description of the issues way. Even after overcoming the legal aspects of “open” data,
that arise around data archiving and preservation. The reason for which deal primarily with licensing and privacy issues, numerous
the light treatment is that the literature is quite rich in this regard, logistical and technical challenges arise in the filling of the lake
as evidenced by the copious amounts of references located in the with raw data. These challenges range from such issues such as
Research Data Curation Bibliography [2]. data selection, description, maintenance, and governance. We have
Data preservation has gained much momentum in recent years. In included examples of user scenarios as well as examples of terms
fact, scientific project proposals presented to NSF must now and conditions imposed by data providers.
include a Data Management Plan; essentially a description of how The daunting nature of populating a data lake may lead some to
data will be preserved [14]. question its purpose. However, given the vast amount of potential
A seminal paper [11] on scientific data preservation makes a observations, analytics, and discoveries that are derived from
distinction between ephemeral data, which can not be reproduced cheaply homogenizing data, combined with the evolution of new
and must hence be preserved, and stable data, which is derived and software tools that take advantage of data in its raw state, not only
therefore disposable. In non-scientific domains, such a distinction can the data lake not be ignored, we contend that it will gain
is not as simple given that issues of currency need be addressed. prominence in an enterprise's core operational business processes.
For example, it was widely publicized that Twitter experienced Further research in this area focuses on streamlining of processes
heavy soccer-related volume during this summer's World Cup, with around data procurement, both in terms of technical automation,
a steady decline since [32]. While it is highly conceivable that this and logistical optimization. Much of our immediate work
data will get much use as businesses wish to optimize social concentrates on automatic data interpretation. Given the varying
behaviors during sporting events, it is equally conceivable that the formats of data (tabular, csv, excel, geospatial, text, JSON, XML,
amount of analytics performed over this event's generated data will proprietary, http, and many others), we investigate a manner of
wane as it is replaced by information from more recent events. At automated analysis and description with the goal of expediting the
what point is the data no longer necessary, if ever? The manner by process of filling the lake.
which dormant data is handled becomes relevant as access to it may
come in spurts. Furthermore, identifying the point in time when Additionally, our focus also centers on the area of collaboration so
data is no longer necessary, either due to staleness, age, or lack of as to optimize the applicability of lake resident data. Given the
context requires setting up a preservation strategy. democratization of data that the lake provides, in addition to
analytical value (whether business oriented, scientific, decision
support etc.) further value can be mined from the very way that
7. RELATED WORK curated data is used, both within a domain and across domains. In
the former, experts within a domain should be able to
The concept of a data lake is a natural evolution from the solid
systematically share and leverage discoveries with colleagues. In
foundation of work in data federation and data integration,
the case of the latter, it is common for experts in one domain to
including Extract-Transform-Load (ETL) techniques, data
experience difficulty when communicating with experts from other
cleansing, schema integration and metadata management systems
domains, thus highlighting the importance of both semantic and
[13][15][29][33] provide a historical perspective of the research
conversational metadata, as described in Section 3.3, and
challenges in these areas. All of this work contributed to the mature
underlining the need for tools that facilitate data integration.
enterprise data integration platforms upon which many enterprises
rely to build and populate data warehouses [17][18].
However, such systems require a heavy investment in IT 9. ACKNOWLEDGMENTS
infrastructure and skilled developers and administrators and are
Our thanks to Mandy Chessell and Dan Wolfson, IBM
tailored for use with tightly controlled enterprise data. As such they
Distinguished Engineers, for their valuable insight into data lake
restrict the flow of data into the warehouse as well as its use within
and open data issues.
the enterprise. Many recent efforts have focused on providing
automation and tools to enable less skilled workers to clean,
integrate and link data [20][26][31], thus enabling the flow of
contextual data into an enterprise to be more fluid. 10. REFERENCES
[1] Angevaare, Inge. 2009. Taking Care of Digital Collections and
A closely related field is Digital Rights Management, which
Data: 'Curation' and Organisational Choices for Research
focuses on the distribution and altering of digital works [9] and the
Libraries. LIBER Quarterly: The Journal of European Research
application of transformation and fair use and in copyright law [27],
Libraries19, no. 1 (2009): 1-12.
such as is the case for artistic mashups in audio or video recordings.
http://liber.library.uu.nl/index.php/lq/article/view/7948
[2] Bailey, C. 2014. Research Curation Bibliography. http://digital- [19] International Health Terminology Standards Development
scholarship.org/rdcb/rdcb.htm Organisation. http://www.ihtsdo.org/licensing
[3] Bureau of Economic Analysis. http://www.bea.gov [20] Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and
[4] Bureau of Labor Statistics. http://www.bls.gov Jeffrey Heer. "Wrangler: Interactive visual specification of data
transformation scripts." InProceedings of the SIGCHI Conference
[5] Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., and on Human Factors in Computing Systems, pp. 3363-3372. ACM,
van der Starre, R. 2014. Governing and Managing Big Data for 2011.
Analytics and Decision Makers. IBM Redguides for Business
[21] Kandogan, E., Roth, M., Kieliszewski, C., Ozcan, F., Schloss,
Leaders.
B., Schmidt, M., “Data for All: A Systems Approach to Accelerate
http://www.redbooks.ibm.com/redpapers/pdfs/redp5120.pdf
the Path from Data to Insight.” 2013 IEEE International Congress
[6] http://www.cityofboston.gov/doit/databoston/data_disclaimer. on Big Data.
asp [22] LinkedIn.
[7] CKAN. http://www.ckan.com https://developer.linkedin.com/documents/linkedin-apis-terms-use
[8] EDGAR. U.S. Securities and Exchange Commission. [23] Murthy, K., et al. “Exploiting Evidence from Unstructured
http://www.sec.gov/edgar.shtml Data to Enhance Master Data Management”. PVLDB 5(12): 1862-
1873, 2012.
[9] Feigenbaum, Joan. "Security and Privacy in Digital Rights
Management, ACM CCS-9 Workshop, DRM 2002, Washington, [24] National Climatic Data Center. http://www.ncdc.noaa.gov/
DC, USA, November 18, 2002, Revised Papers, volume 2696 of [25] National Elevation Dataset. http://ned.usgs.gov
Lecture Notes in Computer Science." Lecture Notes in Computer
Science (2003). [26] http://openrefine.org
[10] http://www.forbes.com/sites/ciocentral/2011/07/21/big- [27] Power, Aaron. "15 Megabytes of Fame: A Fair Use Defense
data- requires-a-big-new-architecture/ for Mash-Ups as DJ Culture Reaches its Postmodern Limit." Sw.
UL Rev. 35 (2005): 577.
[11] Gray, J., Szalay, A., Thakar, A., Stoughton, C., vandenBerg,
J., 2002. Online Scientific Data Curation, Publication, and [28] Rivera, J., and van der Meulen, R. 2014. Gartner Says Beware
Archiving. Technical Report MSR-TR-2002-74. of the Data Lake Fallacy. Gartner Press Release.
http://www.sdss.jhu.edu/sx/pubs/msr-tr-2002-74.pdf http://www.gartner.com/newsroom/id/2809117
[12] Haas, L., Cefkin, M., Kieliszewski, C., Plouffe, W., Roth, [29] Roth, Mary, and Peter M. Schwarz. "Don't Scrap It, Wrap It!
Mary., The IBM Research Accelerated Discovery Lab. 2014 A Wrapper Architecture for Legacy Data Sources." In VLDB, vol.
SIGMOD. 97, pp. 25-29. 1997.
[13] Haas, Laura M., Mauricio A. Hernández, Howard Ho, Lucian [30] Socrata. http://www.socrata.com
Popa, and Mary Roth. "Clio grows up: from research prototype to [31] Stonebraker, Michael, Daniel Bruckner, Ihab F. Ilyas, George
industrial tool." In Proceedings of the 2005 ACM SIGMOD Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan,
international conference on Management of data, pp. 805-810. and Shan Xu. "Data Curation at Scale: The Data Tamer System."
ACM, 2005. In CIDR. 2013.
[14] Halbert, Martin. "Prospects for research data [32] https://blog.twitter.com/2014/insights-into-the-worldcup-
management." Research Data Management (2013). 1: conversation-on-twitter
http://www.clir.org/pubs/reports/pub160/pub160.pdf
[33] Vassiliadis, Panos. "A survey of Extract–transform–Load
[15] Halevy, Alon, Anand Rajaraman, and Joann Ordille. "Data technology."International Journal of Data Warehousing and
integration: the teenage years." In Proceedings of the 32nd Mining (IJDWM) 5, no. 3 (2009): 1-27.
international conference on Very large data bases, pp. 9-16. VLDB
[34] http://wikibon.org/blog/big-data-statistics/
Endowment, 2006.
[35] Wikipedia.
[16] Hassanzadeh, O., et. al. “Helix: Online Enterprise Data http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative
Analytics”. Proceedings of the 20th international conference _Commons_Attribution-ShareAlike_3.0_Unported_License
companion on the World wide web, pages 225-228, ACM, New
[36] http://www.yelp.com/static?p=tos
York, New York, 2011.
[37] Zeng, Marcia L., Qin, Jian. “Metadata”. New York: Neal-
[17] http://www-01.ibm.com/software/data/integration/
Schuman, 2008. ISBN: 978-1555706357
[18] http://www.informatica.com/ETL

You might also like