Data Wrangling - data lake
Data Wrangling - data lake
the Lake
Ignacio Terrizzano, Peter Schwarz, Mary Roth, John E. Colino
IBM Research
650 Harry Rd
San Jose, CA 95120
{igterriz, pschwarz, torkroth, jcolino}@us.ibm.com
ABSTRACT 1. INTRODUCTION
Much has been written about the explosion of data, also known as We have all been inundated with facts and statistics about the data
the “data deluge”. Similarly, much of today's research and decision deluge that surrounds us from consumer-generated and freely
making are based on the de facto acceptance that knowledge and available social media data, from the vast corpus of open data, and
insight can be gained from analyzing and contextualizing the vast from the growing body of sensor data as we enter the era of the
(and growing) amount of “open” or “raw” data. The concept that Internet of Things [34]. Along with the bombardment of statistics
the large number of data sources available today facilitates analyses about this data deluge, there appears to be a de facto acceptance that
on combinations of heterogeneous information that would not be there is critical new business value or scientific insight that can be
achievable via “siloed” data maintained in warehouses is very gained from analyzing the zettabytes of data now at our fingertips,
powerful. The term data lake has been coined to convey the concept if only enterprise data can be freed from its silos and easily mixed
of a centralized repository containing virtually inexhaustible with external “raw” data for self-serve, ad-hoc analysis by an
amounts of raw (or minimally curated) data that is readily made audience broader than the enterprise IT staff.
available anytime to anyone authorized to perform analytical Financial institutions, for example, now speak of offering
activities. The often unstated premise of a data lake is that it relieves personalized services, such as determining if a client is exposed to
users from dealing with data acquisition and maintenance issues, legal risks due to the contents of his or her portfolio. Such analysis
and guarantees fast access to local, accurate and updated data requires access to internal data, external news reports and market
without incurring development costs (in terms of time and money) data about the companies that make up the portfolio, as well as
typically associated with structured data warehouses. However publicly available regulatory information. As another example, a
appealing this premise, practically speaking, it is our experience, Fortune 1000 information processing company that provides
and that of our customers, that “raw” data is logistically difficult to outsourcing services to manage their clients' data processing
obtain, quite challenging to interpret and describe, and tedious to systems would also like to offer them analytic sandboxes and
maintain. Furthermore, these challenges multiply as the number of customized access to demographic data and economic data by
sources grows, thus increasing the need to thoroughly describe and geography, all of which is available from sources like the U.S.
curate the data in order to make it consumable. In this paper, we Census Bureau and the U.S. Bureau of Labor Statistics. As yet
present and describe some of the challenges inherent in creating, another example, IBM Research itself has recognized that
filling, maintaining, and governing a data lake, a set of processes gathering a large body of contextual data and making it readily
that collectively define the actions of data wrangling, and we accessible to its research scientists is strategically important for
propose that what is really needed is a curated data lake, where the innovation [12].
lake contents have undergone a curation process that enable its use
and deliver the promise of ad-hoc data accessibility to users beyond IBM estimates that a staggering 70% of the time spent on analytic
the enterprise IT staff. projects is concerned with identifying, cleansing, and integrating
data due to the difficulties of locating data that is scattered among
Categories and Subject Descriptors many business applications, the need to reengineer and reformat it
in order to make it easier to consume, and the need to regularly
H.1.1 [Systems and Information] Value of Information. refresh it to keep it up-to-date [5]. This cost, along with recent
H.3.2 [Information and Storage Retrieval] Information Storage. trends in the growth and availability of data, have led to the concept
of a capacious repository for raw data called a data lake. According
General Terms to a recent definition, and as shown in Figure 1, a data lake is a set
Management, Documentation, Design, Legal Aspects. of centralized repositories containing vast amounts of raw data
(either structured or unstructured), described by metadata,
Keywords organized into identifiable data sets, and available on demand [5].
Data lake, data wrangling, data curation, data integration, metadata,
Data in the lake supports discovery, analytics, and reporting,
schema mapping, analytics sandboxes
usually by deploying cluster tools like Hadoop. Unlike traditional
warehouses, the format of the data is not described (that is, its
This article is published under a Creative Commons Attribution schema is not available) until the data is needed. By delaying the
License(http://creativecommons.org/licenses/by/3.0/), which permits categorization of data from the point of entry to the point of use
distribution and reproduction in any medium as well as allowing [10], analytical operations that transcend the rigid format of an
derivative works, provided that you attribute the original work to the adopted schema become possible. Query and search operations on
author(s) and CIDR 2015. the data can be performed using traditional database technologies
7th Biennial Conference on Innovative Data Systems Research (CIDR (when structured), as well as via alternate means such as indexing
’15) January 4-7, 2015, Asilomar, California, USA. and NoSQL derivatives.
Anayltics projects
Given the challenges present when working with vast amounts of
raw data, particularly upon first use, we propose that what is needed
Structured private
Services to provide self-service, agile access to data is a curated data lake.
Text
Project-specific
Streams shared
Time Series
Data and
Analytics
In this paper, we present a number of challenges inherent in
Geo spatial
Lake
Search
creating, filling, maintaining, and governing a curated data lake, a
External
Social Media
Govern
Provision
set of processes that collectively define the actions of data
third party
… open Track
wrangling (see Figure 2). These are challenges not only reported
by our customers, but are also challenges that we face ourselves in
Ingestion/Acquisition flow creating a work-in-progress data lake to be used both internally for
IBM research staff as well as in client engagements.
Figure 1: Data Lake Logical Architecture
What data is being licensed, and how or where is it being made Risk arises because it is often unclear whether a particular use of
available? data does or does not constitute a derived work. For example,
Can the data be obtained at no cost, or is there a charge consider a process that uses text annotators to analyze copyright
associated with access? If there is a charge, how is it applied data licensed under such terms, and builds a knowledge graph to
(e.g. one-time, periodic, per data item accessed etc.)? represent the extracted information. Is the knowledge graph a
What kinds of use are permitted/prohibited by the license? “derived work” that must be distributed free of charge? Ultimately,
What risks are incurred by the enterprise in accepting the the answer to such questions may have to come from a court, and
license? different jurisdictions may answer the question in different ways.
The latter two questions are closely related, and often difficult to Certain special classes of data introduce additional risks. If data
answer. Restrictions on the use of data abound. For example, the that contains, or might contain, Sensitive Personal Information
Terms of Use for the LinkedIn Self-Service API [22] include the (SPI) is placed in the data lake, controls must be in place to ensure
following clause: that it is only used for legal and authorized purposes, whether the
data is internal to the enterprise or acquired from third parties.
In a global enterprise, movement of data across international advisors to digest and interpret the license terms and identify other
boundaries introduces yet more complexity Export controls may risks, and management representatives empowered to weigh the
prohibit transmission of certain kinds of sensitive data, privacy risks and benefits and come to a decision. Assuming the benefits
laws vary from country to country, and data may be licensed under outweigh the risks, the end result of this process is a set of
different terms in different places. For example, the SNOMED guidelines that delineate how employees are permitted to obtain and
medical terminology system can be licensed free-of-charge in use third-party data from a particular source, expressed in clear
countries that are members of the International Health Terminology terms that a data scientist can understand and abide by.
Standards Development Organisation [19] (IHTSDO), but requires
a fee to be paid in other countries. We distinguish two sets of guidelines that are typically needed. The
first set, the wrangling guidelines, advises the team that will obtain
Lastly, data providers often make a distinction between research or the data from its source about rules they must follow to comply with
personal use and commercial use of the data they distribute. Even the license. For example, wrangler guidelines may include
many so-called “open” data sites allow their data to be used freely technical restrictions on how a provider’s web site may be accessed
for research, but require a special license to be negotiated for other (e.g. “only from a specified IP address, allowing at least 2 seconds
uses. For example, the City of Boston [6] restricts the use of their between download requests”). The wranglers may also be asked to
open data by businesses as follows: look for and exclude certain material, such as copyright images,
User may use the City's Data in the form provided by the that fall outside the scope of the license, and must be prepared to
City for User's own internal business or organizational remove any material if ordered to do so.
purposes and for no other purpose.
The second set of guidelines, usage guidelines, must be tailored to
Similarly, Yelp’s [36] terms of service contain an outright the specific use case(s) contemplated by the enterprise, and spell
prohibition on commercial use of the data in their RSS feed. As in out, in context, how employees may use the data while complying
other cases mentioned above, the lines between permitted and with the supplier’s license. Any employee wishing to obtain the
prohibited uses may be unclear and subject to interpretation. data from the lake must agree to these guidelines. In most cases,
permission to use the data will be granted only for a limited time,
What is needed to manage the various risks associated with third- after which re-approval will be needed. Similar usage guidelines
party data and prevent the data lake from becoming a data swamp are required for data internal to the enterprise that has been
is a data governance process that brings together the many contributed to the lake. In either case, controls must be in place to
stakeholders that are affected by the decision to use such data: ensure that the data is only used for appropriate purposes.
domain experts that can determine the data’s potential value, legal
Even focusing just on tabular data, there are a myriad of ways in The technical issues that arise in getting data out of the data lake
which it can be represented. In some cases, the sought-after data are similar to those that arise with putting data into the lake, and are
may be embedded in PDF files or in other types of documents handled with similar tools and techniques, often in ways that are
designed for human readability rather than processing by machine. particular to the infrastructure of the enterprise. However, the point
Spreadsheets, for example, often contain artifacts like multi- when data is taken out of the data lake represents a critical event in
column headings that do not translate directly to the abstractions of the data’s life cycle. Once data leaves the lake, it becomes far more
database management software. In other cases, the information difficult to enforce controls over its use. A data scientist checking
may be represented in a custom format that must be converted, or out data must be made aware of, and have agreed to, the usage
at least understood, before it can be processed with conventional guidelines that were prepared for his or her use case.
data management tools. Other common formats include delimited
or fixed-format text files, JSON, XML and html. Even for formats Unless the target user is familiar with a raw data set, uncurated data
like these, which were designed for automated processing, is frequently very difficult to work with. Users are required to
information like delimiters, field widths and data types must either understand its content, structure, and format prior to deploying it
be supplied or deduced, and in either case become critical aspects for gainful purpose. Additionally, as described in [12], contextual
of the schematic metadata associated with the data set. data is often necessary to enhance analytical practices performed
on core domain data. That is, value is derived by combining
We also note that it is not uncommon for providers to change how pertinent domain data along with related (contextual) data from
their data is formatted, or to provide data for different time periods other sources. For example, a recent study on the spread of diseases
in different formats. For example, certain data collected by the analyzes DNA sample data swiped from surfaces in a city such as
National Climatic Data Center through 2011 conforms to one turnstiles, public railings, and elevator buttons to identify the
schema, but similar information collected from 2012 onward uses microbes present at each location, but it is contextual data such as
a different schema. Such changes disrupt a smoothly-running data demographic data and traffic patterns that bring insight into patterns
grooming pipeline, and must be detected and accommodated. of microbes across neighborhoods, income level, and populations.
In enterprise environments, open data only provides value when it
Once the data can be ingested, normalization of certain values can can be contextualized with the enterprise's private data. But
facilitate further processing and enable integration with other data identifying and leveraging contextual data is very difficult given
sets. For example, we have already noted the idiosyncratic way in that providers such as data.gov, BLS (Bureau of Labor Statistics),
NOAA (National Oceanic and Atmospheric Administration) and To date, however, we know of no software platform, business
most others typically organize their data in a hierarchy with either process to systematically define and provide provenance to support
categorical or data-driven delineations that make sense to the the legal and governance issues that enable curated data to flow into
applicable domain, and hence is not readily consumable unless and out of an enterprise with the agility needed to support a new
thoroughly described via metadata. class of applications that create derived works by reusing and
recombining enterprise and curated data, while still ensuring legal
compliance with the potentially myriad of license restrictions
6. PRESERVING DATA associated with the source data.
Managing a data lake also requires attention to maintenance issues
such as staleness, expiration, decommissions and renewals, as well
as the logistical issues of the supporting technologies (assuring 8. CONCLUSION
uptime access to data, sufficient storage space, etc.). We have shown that the creation and use of a data lake, while a
simple concept, presents numerous challenges every step of the
For completeness, we provide a high level description of the issues way. Even after overcoming the legal aspects of “open” data,
that arise around data archiving and preservation. The reason for which deal primarily with licensing and privacy issues, numerous
the light treatment is that the literature is quite rich in this regard, logistical and technical challenges arise in the filling of the lake
as evidenced by the copious amounts of references located in the with raw data. These challenges range from such issues such as
Research Data Curation Bibliography [2]. data selection, description, maintenance, and governance. We have
Data preservation has gained much momentum in recent years. In included examples of user scenarios as well as examples of terms
fact, scientific project proposals presented to NSF must now and conditions imposed by data providers.
include a Data Management Plan; essentially a description of how The daunting nature of populating a data lake may lead some to
data will be preserved [14]. question its purpose. However, given the vast amount of potential
A seminal paper [11] on scientific data preservation makes a observations, analytics, and discoveries that are derived from
distinction between ephemeral data, which can not be reproduced cheaply homogenizing data, combined with the evolution of new
and must hence be preserved, and stable data, which is derived and software tools that take advantage of data in its raw state, not only
therefore disposable. In non-scientific domains, such a distinction can the data lake not be ignored, we contend that it will gain
is not as simple given that issues of currency need be addressed. prominence in an enterprise's core operational business processes.
For example, it was widely publicized that Twitter experienced Further research in this area focuses on streamlining of processes
heavy soccer-related volume during this summer's World Cup, with around data procurement, both in terms of technical automation,
a steady decline since [32]. While it is highly conceivable that this and logistical optimization. Much of our immediate work
data will get much use as businesses wish to optimize social concentrates on automatic data interpretation. Given the varying
behaviors during sporting events, it is equally conceivable that the formats of data (tabular, csv, excel, geospatial, text, JSON, XML,
amount of analytics performed over this event's generated data will proprietary, http, and many others), we investigate a manner of
wane as it is replaced by information from more recent events. At automated analysis and description with the goal of expediting the
what point is the data no longer necessary, if ever? The manner by process of filling the lake.
which dormant data is handled becomes relevant as access to it may
come in spurts. Furthermore, identifying the point in time when Additionally, our focus also centers on the area of collaboration so
data is no longer necessary, either due to staleness, age, or lack of as to optimize the applicability of lake resident data. Given the
context requires setting up a preservation strategy. democratization of data that the lake provides, in addition to
analytical value (whether business oriented, scientific, decision
support etc.) further value can be mined from the very way that
7. RELATED WORK curated data is used, both within a domain and across domains. In
the former, experts within a domain should be able to
The concept of a data lake is a natural evolution from the solid
systematically share and leverage discoveries with colleagues. In
foundation of work in data federation and data integration,
the case of the latter, it is common for experts in one domain to
including Extract-Transform-Load (ETL) techniques, data
experience difficulty when communicating with experts from other
cleansing, schema integration and metadata management systems
domains, thus highlighting the importance of both semantic and
[13][15][29][33] provide a historical perspective of the research
conversational metadata, as described in Section 3.3, and
challenges in these areas. All of this work contributed to the mature
underlining the need for tools that facilitate data integration.
enterprise data integration platforms upon which many enterprises
rely to build and populate data warehouses [17][18].
However, such systems require a heavy investment in IT 9. ACKNOWLEDGMENTS
infrastructure and skilled developers and administrators and are
Our thanks to Mandy Chessell and Dan Wolfson, IBM
tailored for use with tightly controlled enterprise data. As such they
Distinguished Engineers, for their valuable insight into data lake
restrict the flow of data into the warehouse as well as its use within
and open data issues.
the enterprise. Many recent efforts have focused on providing
automation and tools to enable less skilled workers to clean,
integrate and link data [20][26][31], thus enabling the flow of
contextual data into an enterprise to be more fluid. 10. REFERENCES
[1] Angevaare, Inge. 2009. Taking Care of Digital Collections and
A closely related field is Digital Rights Management, which
Data: 'Curation' and Organisational Choices for Research
focuses on the distribution and altering of digital works [9] and the
Libraries. LIBER Quarterly: The Journal of European Research
application of transformation and fair use and in copyright law [27],
Libraries19, no. 1 (2009): 1-12.
such as is the case for artistic mashups in audio or video recordings.
http://liber.library.uu.nl/index.php/lq/article/view/7948
[2] Bailey, C. 2014. Research Curation Bibliography. http://digital- [19] International Health Terminology Standards Development
scholarship.org/rdcb/rdcb.htm Organisation. http://www.ihtsdo.org/licensing
[3] Bureau of Economic Analysis. http://www.bea.gov [20] Kandel, Sean, Andreas Paepcke, Joseph Hellerstein, and
[4] Bureau of Labor Statistics. http://www.bls.gov Jeffrey Heer. "Wrangler: Interactive visual specification of data
transformation scripts." InProceedings of the SIGCHI Conference
[5] Chessell, M., Scheepers, F., Nguyen, N., van Kessel, R., and on Human Factors in Computing Systems, pp. 3363-3372. ACM,
van der Starre, R. 2014. Governing and Managing Big Data for 2011.
Analytics and Decision Makers. IBM Redguides for Business
[21] Kandogan, E., Roth, M., Kieliszewski, C., Ozcan, F., Schloss,
Leaders.
B., Schmidt, M., “Data for All: A Systems Approach to Accelerate
http://www.redbooks.ibm.com/redpapers/pdfs/redp5120.pdf
the Path from Data to Insight.” 2013 IEEE International Congress
[6] http://www.cityofboston.gov/doit/databoston/data_disclaimer. on Big Data.
asp [22] LinkedIn.
[7] CKAN. http://www.ckan.com https://developer.linkedin.com/documents/linkedin-apis-terms-use
[8] EDGAR. U.S. Securities and Exchange Commission. [23] Murthy, K., et al. “Exploiting Evidence from Unstructured
http://www.sec.gov/edgar.shtml Data to Enhance Master Data Management”. PVLDB 5(12): 1862-
1873, 2012.
[9] Feigenbaum, Joan. "Security and Privacy in Digital Rights
Management, ACM CCS-9 Workshop, DRM 2002, Washington, [24] National Climatic Data Center. http://www.ncdc.noaa.gov/
DC, USA, November 18, 2002, Revised Papers, volume 2696 of [25] National Elevation Dataset. http://ned.usgs.gov
Lecture Notes in Computer Science." Lecture Notes in Computer
Science (2003). [26] http://openrefine.org
[10] http://www.forbes.com/sites/ciocentral/2011/07/21/big- [27] Power, Aaron. "15 Megabytes of Fame: A Fair Use Defense
data- requires-a-big-new-architecture/ for Mash-Ups as DJ Culture Reaches its Postmodern Limit." Sw.
UL Rev. 35 (2005): 577.
[11] Gray, J., Szalay, A., Thakar, A., Stoughton, C., vandenBerg,
J., 2002. Online Scientific Data Curation, Publication, and [28] Rivera, J., and van der Meulen, R. 2014. Gartner Says Beware
Archiving. Technical Report MSR-TR-2002-74. of the Data Lake Fallacy. Gartner Press Release.
http://www.sdss.jhu.edu/sx/pubs/msr-tr-2002-74.pdf http://www.gartner.com/newsroom/id/2809117
[12] Haas, L., Cefkin, M., Kieliszewski, C., Plouffe, W., Roth, [29] Roth, Mary, and Peter M. Schwarz. "Don't Scrap It, Wrap It!
Mary., The IBM Research Accelerated Discovery Lab. 2014 A Wrapper Architecture for Legacy Data Sources." In VLDB, vol.
SIGMOD. 97, pp. 25-29. 1997.
[13] Haas, Laura M., Mauricio A. Hernández, Howard Ho, Lucian [30] Socrata. http://www.socrata.com
Popa, and Mary Roth. "Clio grows up: from research prototype to [31] Stonebraker, Michael, Daniel Bruckner, Ihab F. Ilyas, George
industrial tool." In Proceedings of the 2005 ACM SIGMOD Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan,
international conference on Management of data, pp. 805-810. and Shan Xu. "Data Curation at Scale: The Data Tamer System."
ACM, 2005. In CIDR. 2013.
[14] Halbert, Martin. "Prospects for research data [32] https://blog.twitter.com/2014/insights-into-the-worldcup-
management." Research Data Management (2013). 1: conversation-on-twitter
http://www.clir.org/pubs/reports/pub160/pub160.pdf
[33] Vassiliadis, Panos. "A survey of Extract–transform–Load
[15] Halevy, Alon, Anand Rajaraman, and Joann Ordille. "Data technology."International Journal of Data Warehousing and
integration: the teenage years." In Proceedings of the 32nd Mining (IJDWM) 5, no. 3 (2009): 1-27.
international conference on Very large data bases, pp. 9-16. VLDB
[34] http://wikibon.org/blog/big-data-statistics/
Endowment, 2006.
[35] Wikipedia.
[16] Hassanzadeh, O., et. al. “Helix: Online Enterprise Data http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative
Analytics”. Proceedings of the 20th international conference _Commons_Attribution-ShareAlike_3.0_Unported_License
companion on the World wide web, pages 225-228, ACM, New
[36] http://www.yelp.com/static?p=tos
York, New York, 2011.
[37] Zeng, Marcia L., Qin, Jian. “Metadata”. New York: Neal-
[17] http://www-01.ibm.com/software/data/integration/
Schuman, 2008. ISBN: 978-1555706357
[18] http://www.informatica.com/ETL