© Cloudera, Inc. All rights reserved.
TRANSFORMING AND SCALING LARGE SCALE DATA
ANALYTICS: MOVING TO A CLOUD-BASED ENTERPRISE DATA
LAKE
Terry Padgett - Senior Solutions Architect, Cloudera
Nitin Naik - Chief Technology Officer, U.S. Census Bureau
2
Census Background
• United States leading provider of quality
data about its people and economy
• Decennial, Economic, Demographic,
and a multitude of other surveys
• Serves other federal agencies
• Processes large volumes of data
• Preparing for future analytic needs of the
enterprise
Enterprise Data Lake
Make changes to culture, processes, and technologies to practice
and accelerate the efforts required to remain a leader in data and
technology innovation.
• Optimize Survey Operations
• Reduce Respondent Burden
• Improve Data Products
• Consolidate Data and Code
• Manage Large Datasets
• Centralize Security
EDL Guiding Principles
• Scalability
• Availability
• Automation
• Security & Privacy
• Data Diversity
• Data Stewardship
• Identifiable, Locatable and Linkable Data
• Reproducibility
• Governance
Cloud First
• Establish the EDL in GovCloud
• On-demand Server Instances
• Cloud Object Stores
• Leverage Serverless Computing
Data Availability
• Short-term and shared data available through cloud object stores
• Long-term data available through archival stores
• Built-in resiliency of storage to prevent data loss
• EDL applications deployed as highly available
Deployment Automation
• Repeatable deployment across tiers
• Platform Infrastructure - Teraforms
• AMI Creation - Ansible, Puppet, scripting
• HDP Clusters - Cloudbreak
Census Business Process and EDL
Mapping of the Survey Lifecycle to the Data Lifecycle and the identification of the data flow, allows EDL to incrementally
build upon the key areas highlighted in green. The EDL will focus on the Enterprise Data Lifecycle Stages (Process,
Derive, ect.) and leverage technology advances (e.g. Data mashups, Machine Learning, Distributed computing).
Consolidation of Data
Collection Systems
Consolidation of Data Management / Store Systems
Consolidation of Data
Dissemination Systems
CEDCaP Enterprise Data Lake (EDL) CEDSCI
Survey
Design
Frame
Development
Sample
Design
Response Data
Collection
Instrument
Development
Data Editing
& Imputation
Disclosure
Avoidance
Research/
Analytics
Data Product
Dissemination
Estimation, Data
Review, & Analysis
DEFINE COLLECT
Survey
Design
Frame
Development
Sample
Design
Instrument
Development
Response Data
Collection
CAPTURE
3rd Party Data
Capture
PROCESS DERIVE PUBLISH RESEARCH
Data Editing &
Imputation
Estimation, Data
Review, & Analysis
Disclosure
Avoidance
Data Product
Dissemination
Research/
Analytics
DISSEMINATE
Areas currently in the scope of CEDCaP,
CEDSCI or other programs
Areas currently in the scope of EDL
LEGEND
Survey Lifecycle
Data Lifecycle
EDL Supported Areas (not in scope)
Data Lifecycle
Enterprise Data Lake Features
• Data Control
• Data Lineage
• Authorization Model
• Storage Management
• Data Sharing
• Dynamic Platform Provisioning
• Cloud-based
• Cost Control
9
Data Control
• Data Registration
• Datasets onboarded with mandatory metadata
• Registered in the existing Data Management System
• Project access controls generated
• Code Repositories
• Data Lineage: Atlas
• Authorization: Ranger
• Controls for projects
• Column protection
• Row filtering
© Cloudera, Inc. All rights reserved. 11
Data Platform
Common Shared Services
© Cloudera, Inc. All rights reserved. 12
Data Platform
Compute On Demand
© Cloudera, Inc. All rights reserved. 13
Data Platform
Transient Clusters
© Cloudera, Inc. All rights reserved. 14
Data Movement
DataPlane Service
© Cloudera, Inc. All rights reserved. 15
Data Sharing
• S3 as first-class storage -permanent data
• Local HDFS - working data
• Hive: Data Warehouse
© Cloudera, Inc. All rights reserved. 16
Data Science
• Spark
• R
• SAS
© Cloudera, Inc. All rights reserved. 17
Data Lineage Tracability
© Cloudera, Inc. All rights reserved. 18
Data Protection
• EBS volume encryption
• S3 server side encryption
• SSL/TLS
• Hadoop TDE
© Cloudera, Inc. All rights reserved.
THANK YOU

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based Enterprise Data Lake

  • 1.
    © Cloudera, Inc.All rights reserved. TRANSFORMING AND SCALING LARGE SCALE DATA ANALYTICS: MOVING TO A CLOUD-BASED ENTERPRISE DATA LAKE Terry Padgett - Senior Solutions Architect, Cloudera Nitin Naik - Chief Technology Officer, U.S. Census Bureau
  • 2.
    2 Census Background • UnitedStates leading provider of quality data about its people and economy • Decennial, Economic, Demographic, and a multitude of other surveys • Serves other federal agencies • Processes large volumes of data • Preparing for future analytic needs of the enterprise
  • 3.
    Enterprise Data Lake Makechanges to culture, processes, and technologies to practice and accelerate the efforts required to remain a leader in data and technology innovation. • Optimize Survey Operations • Reduce Respondent Burden • Improve Data Products • Consolidate Data and Code • Manage Large Datasets • Centralize Security
  • 4.
    EDL Guiding Principles •Scalability • Availability • Automation • Security & Privacy • Data Diversity • Data Stewardship • Identifiable, Locatable and Linkable Data • Reproducibility • Governance
  • 5.
    Cloud First • Establishthe EDL in GovCloud • On-demand Server Instances • Cloud Object Stores • Leverage Serverless Computing
  • 6.
    Data Availability • Short-termand shared data available through cloud object stores • Long-term data available through archival stores • Built-in resiliency of storage to prevent data loss • EDL applications deployed as highly available
  • 7.
    Deployment Automation • Repeatabledeployment across tiers • Platform Infrastructure - Teraforms • AMI Creation - Ansible, Puppet, scripting • HDP Clusters - Cloudbreak
  • 8.
    Census Business Processand EDL Mapping of the Survey Lifecycle to the Data Lifecycle and the identification of the data flow, allows EDL to incrementally build upon the key areas highlighted in green. The EDL will focus on the Enterprise Data Lifecycle Stages (Process, Derive, ect.) and leverage technology advances (e.g. Data mashups, Machine Learning, Distributed computing). Consolidation of Data Collection Systems Consolidation of Data Management / Store Systems Consolidation of Data Dissemination Systems CEDCaP Enterprise Data Lake (EDL) CEDSCI Survey Design Frame Development Sample Design Response Data Collection Instrument Development Data Editing & Imputation Disclosure Avoidance Research/ Analytics Data Product Dissemination Estimation, Data Review, & Analysis DEFINE COLLECT Survey Design Frame Development Sample Design Instrument Development Response Data Collection CAPTURE 3rd Party Data Capture PROCESS DERIVE PUBLISH RESEARCH Data Editing & Imputation Estimation, Data Review, & Analysis Disclosure Avoidance Data Product Dissemination Research/ Analytics DISSEMINATE Areas currently in the scope of CEDCaP, CEDSCI or other programs Areas currently in the scope of EDL LEGEND Survey Lifecycle Data Lifecycle EDL Supported Areas (not in scope) Data Lifecycle
  • 9.
    Enterprise Data LakeFeatures • Data Control • Data Lineage • Authorization Model • Storage Management • Data Sharing • Dynamic Platform Provisioning • Cloud-based • Cost Control 9
  • 10.
    Data Control • DataRegistration • Datasets onboarded with mandatory metadata • Registered in the existing Data Management System • Project access controls generated • Code Repositories • Data Lineage: Atlas • Authorization: Ranger • Controls for projects • Column protection • Row filtering
  • 11.
    © Cloudera, Inc.All rights reserved. 11 Data Platform Common Shared Services
  • 12.
    © Cloudera, Inc.All rights reserved. 12 Data Platform Compute On Demand
  • 13.
    © Cloudera, Inc.All rights reserved. 13 Data Platform Transient Clusters
  • 14.
    © Cloudera, Inc.All rights reserved. 14 Data Movement DataPlane Service
  • 15.
    © Cloudera, Inc.All rights reserved. 15 Data Sharing • S3 as first-class storage -permanent data • Local HDFS - working data • Hive: Data Warehouse
  • 16.
    © Cloudera, Inc.All rights reserved. 16 Data Science • Spark • R • SAS
  • 17.
    © Cloudera, Inc.All rights reserved. 17 Data Lineage Tracability
  • 18.
    © Cloudera, Inc.All rights reserved. 18 Data Protection • EBS volume encryption • S3 server side encryption • SSL/TLS • Hadoop TDE
  • 19.
    © Cloudera, Inc.All rights reserved. THANK YOU