0% found this document useful (0 votes)
52 views

Data Platform and Analytics Foundational Training: (Speaker Name)

Uploaded by

Kathalina Suarez
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Platform and Analytics Foundational Training: (Speaker Name)

Uploaded by

Kathalina Suarez
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Microsoft C+E Technology Training

Data Platform and


Analytics
Foundational Training
Solution Area
Data Analytics
Solution
Big Data
Technology
Hadoop on Azure

[Speaker Name]
What is Big Data?
The business imperative

1. 2. 3.
Increasing data Increasing Changing
volumes complexity of data economics and
and analysis emerging
technologies
A new set of questions
What’s the social
sentiment for my
How do I better
brand or products?
predict future
outcomes?

LIVE DATA FEEDS

How do I optimize my
SOCIAL & WEB fleet based on weather ADVANCED
ANALYTICS and traffic patterns? ANALYTICS
What is big data?

Big data solutions deal with the complexities of:

VOLUME VARIETY VELOCITY


(Size) (Structure) (Speed)
What is big data?
Petabytes Clickstream Sensors/RFID/
devices
Social sentiment
Big data
Wikis/blogs Audio/video

Log files

Terabytes
Advertising Collaboration Web 2.0 Spatial &
Mobile GPS coordinates
eCommerce
Data market feeds
Web logs
ERP/CRM Digital marketing
eGov feeds
Gigabytes Weather
Payables Contacts Search marketing
Text/image
Payroll Deal tracking Recommendations
Inventory Sales
Megabytes pipeline

Data complexity: variety and velocity


Common big-data customer scenarios

IT infrastructure Legal Social network Traffic flow Web app


optimization discovery analysis optimization optimization

Churn Natural resource Weather Healthcare


analysis exploration forecasting outcomes

Fraud Life sciences Advertising Equipment Smart meter


detection research analysis monitoring monitoring

Store now, question later


Introducing
Apache Hadoop
Introducing Apache Hadoop
Apache Open Source Project
Highly scalable distributed file system (HDFS)
Distributed processing on data nodes

• Key attributes:
• Open source
• Highly scalable
• Runs on commodity hardware
• Redundant and reliable (no data loss)
• Batch processing centric—using a “Map-Reduce” processing paradigm
Hadoop is not…

C#
A replacement for A place to learn how A place for low-latency
data warehouse to code data
Business applications of Hadoop
Financial services Retail Telecom Manufacturing
New account risk screens 360° view of customer Call detail records (CDRs) Supplier consolidation
Fraud prevention Analysis of brand sentiment Infrastructure investment Supply chain and logistics
Trading risk Localized, personalized promotions Next product to buy (NPTB) Assembly-line quality assurance
Maximum deposit spread Website optimization Real-time bandwidth allocation Proactive maintenance
Insurance underwriting Optimal store layout New product development Crowdsource quality assurance
Accelerated loan processing

Healthcare Utilities, oil, and gas Public sector


Genomic data for medical trials Smart meter-stream analysis Analysis of public sentiment
Patient vitals monitoring Slow oil-well decline curves Protected critical networks
Reduced readmittance rates Optimized lease bidding Fraud and waste prevention
Storage of medical research data Compliance reporting Crowdsource reporting for repairs
Recruitment of cohorts for pharmaceutical trials Proactive equipment repair to infrastructure
Seismic image processing Fulfillment of open records requests
Hadoop Components
Hadoop – What is it?
A highly reliable, distributed, and parallel programming framework for analyzing big data

 A Java-based, open source Apache project


 Capable of running on a variety of hardware Hadoop core
platforms, including clusters of commodity
hardware MapReduce Tez
 The Hadoop core includes: (data processing framework)
 A scalable, reliable file system (HDFS)
 A framework that enables development of programs based
on MapReduce (MR) or directed acyclic graph (DAG) model YARN
(cluster resource manager)
 YARN, a distributed resource manager that allocates and
controls access to resource of cluster manager
 In addition to the core, Hadoop has a rich HDFS
ecosystem that supports SQL/NoSQL, (redundant, reliable storage)
streaming, real-time, and interactive
applications
Hadoop MapReduce concept
Divide large problem into sub-problems
Programming framework
(library and runtime) for ………
analyzing data sets stored in Map()
HDFS Perform same function on all sub-problems

Composed of user-supplied
Do work() Do work() Do work()
Map and Reduce functions:
• Map(): Subdivide and
Combine output from all sub-functions
conquer
• Reduce(): Combine and Reduce()
reduce cardinality

Output
Introducing
Azure HDInsight
Azure HDInsight – What is it?
A standard Apache Hadoop distribution offered as a managed service on Microsoft Azure

 Based on Hortonworks Data Platform (HDP) In addition to the core, HDInsight


supports the Hadoop ecosystem
 Provisioned as clusters on Azure that can run on
Windows or Linux servers
 Offers capacity-on-demand, pay-as-you-go pricing
model
 Integrates with:
Hive
 Azure Blob Storage and Azure Data Lake Store for Hadoop
File System (HDFS)
 Azure Portal for management and administration
 Visual Studio for application development tooling
HDInsight and Hadoop ecosystem
Stats Machine Legend

Relational
C#, F#, Graph
JavaScript

Server)
processing Learning
.NET Red =
workflow

(Pegasus)
Pipeline/

(SQL
(Oozie)

(RHadoop) (Mahout)
Core Hadoop
Metadata
(HCatalog) Blue =

PDW PolyBase
Data processing

processing
Scripting Query

(ODBC/SQOOP/REST)
Gray =
NoSQL Database
processing

Event-
driven
Real-time

(Pig) (Hive)

Data integration
(Storm)

Microsoft
(HBase)

integration
Distributed processing
points and
(MapReduce)
value adds
Event pipeline

(Excel, Power BI,


(Event hub/

Orange =
flume)

YARN

Intelligence
Business
Data movement

SSAS)
Distributed storage (HDFS) Green =
Monitoring & World's data Packages
Azure Storage Active Directory
deployment (Azure Data
Vault (ASV) (Security)
(System Center) Marketplace)
HDInsight: Built for Windows or Linux
 Managed and supported by Microsoft
 Familiarity of Windows
 Reuse of common tools, documentation, samples from Hadoop/Linux
ecosystem
 Addition of Hadoop projects that were authored on Linux to HDInsight
 Easier transition from on-premises to cloud
HDInsight supports Hive
SQL-like queries on Hadoop data in HDInsight
 HDInsight provides easy-to-use graphical query interface for Hive
 HiveQL is a SQL-like language (subset of SQL)
 Hive structures include well-understood database concepts such as tables, rows, columns, partitions
 Compiled into MapReduce jobs that are executed on Hadoop
Dramatic performance gains with Stinger/Tez
 Stinger is a Microsoft, Hortonworks, and OSS-driven initiative to bring interactive queries with Hive
 Query execution engine technology from Microsoft SQL Server to Hive
 Performance gains up to 100x

Microsoft contribution to Sample Query


Apache code

32x Speedup
1400s 40X
Speedup 100x
44.3s Speedup
35.1s 15s
HDP 2.1
Hive 10 HDP 1.3 / HDP 2.0
HDInsight supports HBase
NoSQL database on data in HDInsight
Columnar, NoSQL database
Runs on top of Hadoop Distributed File System (HDFS)
Provides flexibility for new columns to be added to column families at any time

HMaster
Coordination

Name node Region server Region server Region server Region server

JobTracker

Data node Data node Data node Data node

TaskTracker TaskTracker TaskTracker TaskTracker


HDInsight supports Mahout
Machine learning library
A library of machine learning algorithms to execute on data in HDFS
Algorithms are not dependent on size of data and can scale with large data sets
Library includes: collaborative filtering, classification, clustering, dimensionality reduction, topic models

HDInsight supports Storm


HDInsight supports Storm
Stream Analytics for near real-time processing
Consumes millions of real-time events from scalable event broker (i.e., Apache Kafka, Azure Event Hub)
Performs time-sensitive computation
Outputs to persistent stores, dashboards, or devices
Customizable with Java + .NET
Deeply integrated to Visual Studio
Event Collection Event queuing Transformation Long-term Presentation
producers system storage and action

Apache
Storm on HBase Web/thick client
Kafka/ HDInsight
Applications RabbitMQ/ dashboards
ActiveMQ HDFS

Azure DBs Search and query


Stream
Cloud gateways
Devices Event hubs processin
Azure Stream
(web APIs) Azure
g
Analytics Storage
Data analytics (Excel)
Sensor Live Dashboards
s

Field Storage
gateways adapters Devices to take action
Web and social
HDInsight supports Spark
In-memory processing on multiple workloads
Single execution model for multiple tasks (SQL Query, Spark Streaming, Machine Learning, and Graph)
Processing up to 100x faster performance
Developer friendly (Java, Python, Scala)
BI tool of choice (Power BI, Tableau, Qlik, SAP)
Notebook experience (Jupyter/iPython, Zeppelin)

Spark SQL Spark Streaming Machine Graph GraphX


Learning MLib
Microsoft makes Hadoop easier
Deep Visual Studio integration
Debug Hive jobs through Yarn logs or troubleshoot Storm topologies
Visualize Hadoop clusters, tables, and storage
Submit Hive queries, Storm topologies (C# or Java spouts/bolts)
IntelliSense
Azure HDInsight
Positioning
Why Microsoft Azure?
ML Search

Data Factory Event Hubs Database

HDInsight Stream DocumentDB


Analytics

Azure Storage

On-premises Hadoop Azure facts


Appliances Software
• >4 trillion objects in Azure
• 300,000-1M+ requests per second
• Double compute and storage every 6 months
No hardware challenges
HDInsight in the cloud bypasses
hardware costs
Hardware acquisition $0
Hardware maintenance
Performance tuning

HDInsight in the cloud bypasses


capacity planning
Spin up any number of Hadoop nodes on
demand No HW costs Unlimited scale
Go from tens to thousands of nodes
Mission-critical, enterprise-ready
Managed Hadoop, backed by SLA
Three nines of availability: 99.9% uptime

HDInsight auto-replicates data


Automatic geo-replication of data
Data only replicates within same geo-political (i.e., country, region)

Mission-critical Hadoop
Maintenance done for you
Minimal IT resources for
upgrades/patching HDInsight adds latest version of
Hadoop for you
OS patching and security updates done automatically
HDInsight on Hadoop 2.4
June 2014

Minimal IT resources to update


HDInsight on Hadoop 2.2
April 2014
HDInsight on Hadoop 1.1.2

Hadoop versions
Oct 2013

O/S upgrades
Hadoop versions are rapidly releasing throughout year
Always be on latest version of Hadoop, without effort O/S patching
Low cost
HDInsight is billed by usage
Billed for usage

$£€¥
Clusters can be deleted when no longer used

No additional price for support


Azure Support includes Hadoop support
What usually costs thousands of dollars per node is included
© 2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, Microsoft Azure, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The
information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT
MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

You might also like