HDInsight Essentials - Second Edition
()
About this ebook
- Learn how to quickly provision a Hadoop cluster using Windows Azure Cloud Services
- Build an end-to-end application for a big data problem using open source software
- Discover more about modern data architecture with this guide, to help you understand the transition from legacy relational Enterprise Data Warehouse
If you want to discover one of the latest tools designed to produce stunning Big Data insights, this book features everything you need to get to grips with your data. Whether you are a data architect, developer, or a business strategist, HDInsight adds value in everything from development, administration, and reporting.
Related to HDInsight Essentials - Second Edition
Related ebooks
HBase Administration Cookbook Rating: 0 out of 5 stars0 ratingsGetting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsNoSQL Essentials: Navigating the World of Non-Relational Databases Rating: 0 out of 5 stars0 ratingsSQL and NoSQL Interview Questions: Your essential guide to acing SQL and NoSQL job interviews (English Edition) Rating: 0 out of 5 stars0 ratingsFast Data Processing with Spark 2 - Third Edition Rating: 0 out of 5 stars0 ratingsOpen Source Database: Virtue Or Vice? Rating: 0 out of 5 stars0 ratingsMongoDB for Jobseekers: Reach new heights in your career with MongoDB (English Edition) Rating: 0 out of 5 stars0 ratingsHadoop Blueprints Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsBig Data and Analytics: The key concepts and practical applications of big data analytics (English Edition) Rating: 0 out of 5 stars0 ratingsLearn Hbase in 24 Hours Rating: 0 out of 5 stars0 ratingsHBase Essentials Rating: 0 out of 5 stars0 ratingsLearning HBase Rating: 0 out of 5 stars0 ratingsLearning Hadoop 2 Rating: 4 out of 5 stars4/5Mastering Java Persistence: From Basics to Expert Proficiency Rating: 0 out of 5 stars0 ratingsSplunk Developer's Guide Rating: 0 out of 5 stars0 ratingsExpert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics Rating: 0 out of 5 stars0 ratingsHadoop BIG DATA Interview Questions You'll Most Likely Be Asked Rating: 0 out of 5 stars0 ratingsNeo4j High Performance Rating: 0 out of 5 stars0 ratingsData Warehousing: Optimizing Data Storage And Retrieval For Business Success Rating: 0 out of 5 stars0 ratingsSpark Cookbook Rating: 0 out of 5 stars0 ratingsMastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition) Rating: 0 out of 5 stars0 ratingsBigData Analytics: Solution Or Resolution? Rating: 3 out of 5 stars3/5Hadoop Beginner's Guide Rating: 4 out of 5 stars4/5Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala Rating: 0 out of 5 stars0 ratings
System Administration For You
Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 5 out of 5 stars5/5Cybersecurity: The Beginner's Guide: A comprehensive guide to getting started in cybersecurity Rating: 5 out of 5 stars5/5Linux Commands By Example Rating: 5 out of 5 stars5/5Design and Build Modern Datacentres, A to Z practical guide Rating: 3 out of 5 stars3/5Practical Data Analysis Rating: 4 out of 5 stars4/5Wordpress 2023 A Beginners Guide : Design Your Own Website With WordPress 2023 Rating: 0 out of 5 stars0 ratingsLinux Bible Rating: 0 out of 5 stars0 ratingsPowerShell: A Beginner's Guide to Windows PowerShell Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Bash Command Line Pro Tips Rating: 5 out of 5 stars5/5Git Essentials Rating: 4 out of 5 stars4/5Ethical Hacking Rating: 4 out of 5 stars4/5CompTIA A+ Complete Practice Tests: Core 1 Exam 220-1101 and Core 2 Exam 220-1102 Rating: 0 out of 5 stars0 ratingsLinux: A Comprehensive Guide to Linux Operating System and Command Line Rating: 0 out of 5 stars0 ratingsLinux For Beginners : From Zero To System Admin Rating: 0 out of 5 stars0 ratingsWindows 10: The ultimate Windows 10 user guide and manual! Rating: 0 out of 5 stars0 ratingsData Communication and Networking: For Under-graduate Students Rating: 0 out of 5 stars0 ratingsLinux for Beginners: Linux Command Line, Linux Programming and Linux Operating System Rating: 4 out of 5 stars4/5Cloud Security For Dummies Rating: 0 out of 5 stars0 ratingsLearn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python Rating: 0 out of 5 stars0 ratingsBuilding a Plex Server with Raspberry Pi Rating: 0 out of 5 stars0 ratingsMastering Windows 365: Deploy and Manage Cloud PCs and Windows 365 Link devices, Copilot with Intune, and Intune Suite Rating: 0 out of 5 stars0 ratingsPowerShell: A Comprehensive Guide to Windows PowerShell Rating: 4 out of 5 stars4/5Instant Ubuntu Rating: 4 out of 5 stars4/5
Reviews for HDInsight Essentials - Second Edition
0 ratings0 reviews
Book preview
HDInsight Essentials - Second Edition - Rajesh Nadipalli
Table of Contents
HDInsight Essentials Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Instant updates on new Packt books
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop and HDInsight in a Heartbeat
Data is everywhere
Business value of big data
Hadoop concepts
Brief history of Hadoop
Core components
Hadoop cluster layout
HDFS overview
Writing a file to HDFS
Reading a file from HDFS
HDFS basic commands
YARN overview
YARN application life cycle
YARN workloads
Hadoop distributions
HDInsight overview
HDInsight and Hadoop relationship
Hadoop on Windows deployment options
Microsoft Azure HDInsight Service
HDInsight Emulator
Hortonworks Data Platform (HDP) for Windows
Summary
2. Enterprise Data Lake using HDInsight
Enterprise Data Warehouse architecture
Source systems
Data warehouse
Storage
Processing
User access
Provisioning and monitoring
Data governance and security
Pain points of EDW
The next generation Hadoop-based Enterprise data architecture
Source systems
Data Lake
Storage
Processing
User access
Provisioning and monitoring
Data governance, security, and metadata
Journey to your Data Lake dream
Ingestion and organization
Transformation (rules driven)
Access, analyze, and report
Tools and technology for Hadoop ecosystem
Use case powered by Microsoft HDInsight
Problem statement
Solution
Source systems
Storage
Processing
User access
Benefits
Summary
3. HDInsight Service on Azure
Registering for an Azure account
Azure storage
Provisioning an HDInsight cluster
Cluster topology
Provisioning using Azure PowerShell
Creating a storage container
Provisioning a new HDInsight cluster
HDInsight management dashboard
Dashboard
Monitor
Configuration
Exploring clusters using the remote desktop
Running a sample MapReduce
Deleting the cluster
HDInsight Emulator for the development
Installing HDInsight Emulator
Installation verification
Using HDInsight Emulator
Summary
4. Administering Your HDInsight Cluster
Monitoring cluster health
Name Node status
The Name Node Overview page
Datanode Status
Utilities and logs
Hadoop Service Availability
YARN Application Status
Azure storage management
Configuring your storage account
Monitoring your storage account
Managing access keys
Deleting your storage account
Azure PowerShell
Access Azure Blob storage using Azure PowerShell
Summary
5. Ingest and Organize Data Lake
End-to-end Data Lake solution
Ingesting to Data Lake using HDFS command
Connecting to a Hadoop client
Getting your files on the local storage
Transferring to HDFS
Loading data to Azure Blob storage using Azure PowerShell
Loading files to Data Lake using GUI tools
Storage access keys
Storage tools
CloudXplorer
Key benefits
Registering your storage account
Uploading files to your Blob storage
Using Sqoop to move data from RDBMS to Data Lake
Key benefits
Two modes of using Sqoop
Using Sqoop to import data (SQL to Hadoop)
Organizing your Data Lake in HDFS
Managing file metadata using HCatalog
Key benefits
Using HCatalog Command Line to create tables
Summary
6. Transform Data in the Data Lake
Transformation overview
Tools for transforming data in Data Lake
HCatalog
Persisting HCatalog metastore in a SQL database
Apache Hive
Hive architecture
Starting Hive in HDInsight
Basic Hive commands
Apache Pig
Pig architecture
Starting Pig in HDInsight node
Basic Pig commands
Pig or Hive
MapReduce
The mapper code
The reducer code
The driver code
Executing MapReduce on HDInsight
Azure PowerShell for execution of Hadoop jobs
Transformation for the OTP project
Cleaning data using Pig
Executing Pig script
Registering a refined and aggregate table using Hive
Executing Hive script
Reviewing results
Other tools used for transformation
Oozie
Spark
Summary
7. Analyze and Report from Data Lake
Data access overview
Analysis using Excel and Microsoft Hive ODBC driver
Prerequisites
Step 1 – installing the Microsoft Hive ODBC driver
Step 2 – creating Hive ODBC Data Source
Step 3 – importing data to Excel
Analysis using Excel Power Query
Prerequisites
Step 1 – installing the Microsoft Power Query for Excel
Step 2 – importing Azure Blob storage data into Excel
Step 3 – analyzing data using Excel
Other BI features in Excel
PowerPivot
Power View and Power Map
Step 1 – importing Azure Blob storage data into Excel
Step 2 – launch map view
Step 3 – configure the map
Power BI Catalog
Ad hoc analysis using Hive
Other alternatives for analysis
RHadoop
Apache Giraph
Apache Mahout
Azure Machine Learning
Summary
8. HDInsight 3.1 New Features
HBase
HBase positioning in Data Lake and use cases
Provisioning HDInsight HBase cluster
Creating a sample HBase schema
Designing the airline on-time performance table
Connecting to HBase using the HBase shell
Creating an HBase table
Loading data to the HBase table
Querying data from the HBase table
HBase additional information
Storm
Storm positioning in Data Lake
Storm key concepts
Provisioning HDInsight Storm cluster
Running a sample Storm topology
Connecting to Storm using Storm shell
Running the Storm Wordcount topology
Monitoring status of the Wordcount topology
Additional information on Storm
Apache Tez
Summary
9. Strategy for a Successful Data Lake Implementation
Challenges on building a production Data Lake
The success path for a production Data Lake
Identifying the big data problem
Proof of technology for Data Lake
Form a Data Lake Center of Excellence
Executive sponsors
Data Lake consumers
Development
Operations and infrastructure
Architectural considerations
Extensible and modular
Metadata-driven solution
Integration strategy
Security
Online resources
Summary
Index
HDInsight Essentials Second Edition
HDInsight Essentials Second Edition
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2013
Second edition: January 2015
Production reference: 1200115
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78439-942-9
www.packtpub.com
Credits
Author
Rajesh Nadipalli
Reviewers
Simon Elliston Ball
Anindita Basak
Rami Vemula
Commissioning Editor
Taron Pereira
Acquisition Editor
Owen Roberts
Content Development Editor
Rohit Kumar Singh
Technical Editors
Madhuri Das
Taabish Khan
Copy Editor
Rashmi Sawant
Project Coordinator
Mary Alex
Proofreaders
Ting Baker
Ameesha Green
Indexer
Rekha Nair
Production Coordinator
Melwyn D'sa
Cover Work
Melwyn D'sa
About the Author
Rajesh Nadipalli currently manages software architecture and delivery of Zaloni's Bedrock Data Management Platform, which enables customers to quickly and easily realize true Hadoop-based Enterprise Data Lakes. Rajesh is also an instructor and a content provider for Hadoop training, including Hadoop development, Hive, Pig, and HBase. In his previous role as a senior solutions architect, he evaluated big data goals for his clients, recommended a target state architecture, and conducted proof of concepts and production implementation. His clients include Verizon, American Express, NetApp, Cisco, EMC, and UnitedHealth Group.
Prior to Zaloni, Rajesh worked for Cisco Systems for 12 years and held a technical leadership position. His key focus areas have been data management, enterprise architecture, business intelligence, data warehousing, and Extract Transform Load (ETL). He has demonstrated success by delivering scalable data management and BI solutions that empower business to make informed decisions.
Rajesh authored the first version of the book HDInsight Essentials, Packt Publishing, released in September 2013, the first book in print for HDInsight, providing data architects, developers, and managers with an introduction to the new Hadoop distribution from Microsoft.
He has over 18 years of IT experience. He holds an MBA from North Carolina State University and a BSc degree in Electronics and Electrical from the University of Mumbai, India.
I would like to thank my family for their unconditional love, support, and patience during the entire process.
To my friends and coworkers at Zaloni, thank you for inspiring and encouraging me.
And finally a shout-out to all the folks at Packt Publishing for being really professional.
About the Reviewers
Simon Elliston Ball is a solutions engineer at Hortonworks, where he helps a wide range of companies get the best out of Hadoop. Before that, he was the head of big data at Red Gate, creating tools to make HDInsight and Hadoop easier to work with. He has also spoken extensively on big data and NoSQL at conferences around the world.
Anindita Basak works as a big data cloud consultant and a big data Hadoop trainer and is highly enthusiastic about Microsoft Azure and HDInsight along with Hadoop open source ecosystem. She works as a specialist for Fortune 500 brands including cloud and big data based companies in the US. She has been playing with Hadoop on Azure since the incubation phase (http://www.hadooponazure.com). Previously, she worked as a module lead for the Alten group and as a senior system analyst at Sonata Software Limited, India, in the Azure Professional Direct Delivery group of Microsoft. She worked as a senior software engineer on implementation and migration of various enterprise applications on the Azure cloud in healthcare, retail, and financial domains. She started her journey with Microsoft Azure in the Microsoft Cloud Integration Engineering (CIE) team and worked as a support engineer in Microsoft India (R&D) Pvt. Ltd.
With more than 6 years of experience in the Microsoft .NET technology stack, she is solely focused on big data cloud and data science. As a Most Valued Blogger, she loves to share her technical experience and expertise through her blog at http://anindita9.wordpress.com and http://anindita9.azurewebsites.net. You can find more about her on her LinkedIn page and you can follow her at @imcuteani on Twitter.
She recently worked as a technical reviewer for the books HDInsight Essentials and Microsoft Tabular Modeling Cookbook, both by Packt Publishing. She is currently working on Hadoop Essentials, also by Packt Publishing.
I