Data Platform and Analytics Foundational Training: (Speaker Name)
Data Platform and Analytics Foundational Training: (Speaker Name)
[Speaker Name]
What is Big Data?
The business imperative
1. 2. 3.
Increasing data Increasing Changing
volumes complexity of data economics and
and analysis emerging
technologies
A new set of questions
What’s the social
sentiment for my
How do I better
brand or products?
predict future
outcomes?
How do I optimize my
SOCIAL & WEB fleet based on weather ADVANCED
ANALYTICS and traffic patterns? ANALYTICS
What is big data?
Log files
Terabytes
Advertising Collaboration Web 2.0 Spatial &
Mobile GPS coordinates
eCommerce
Data market feeds
Web logs
ERP/CRM Digital marketing
eGov feeds
Gigabytes Weather
Payables Contacts Search marketing
Text/image
Payroll Deal tracking Recommendations
Inventory Sales
Megabytes pipeline
• Key attributes:
• Open source
• Highly scalable
• Runs on commodity hardware
• Redundant and reliable (no data loss)
• Batch processing centric—using a “Map-Reduce” processing paradigm
Hadoop is not…
C#
A replacement for A place to learn how A place for low-latency
data warehouse to code data
Business applications of Hadoop
Financial services Retail Telecom Manufacturing
New account risk screens 360° view of customer Call detail records (CDRs) Supplier consolidation
Fraud prevention Analysis of brand sentiment Infrastructure investment Supply chain and logistics
Trading risk Localized, personalized promotions Next product to buy (NPTB) Assembly-line quality assurance
Maximum deposit spread Website optimization Real-time bandwidth allocation Proactive maintenance
Insurance underwriting Optimal store layout New product development Crowdsource quality assurance
Accelerated loan processing
Composed of user-supplied
Do work() Do work() Do work()
Map and Reduce functions:
• Map(): Subdivide and
Combine output from all sub-functions
conquer
• Reduce(): Combine and Reduce()
reduce cardinality
Output
Introducing
Azure HDInsight
Azure HDInsight – What is it?
A standard Apache Hadoop distribution offered as a managed service on Microsoft Azure
Relational
C#, F#, Graph
JavaScript
Server)
processing Learning
.NET Red =
workflow
(Pegasus)
Pipeline/
(SQL
(Oozie)
(RHadoop) (Mahout)
Core Hadoop
Metadata
(HCatalog) Blue =
PDW PolyBase
Data processing
processing
Scripting Query
(ODBC/SQOOP/REST)
Gray =
NoSQL Database
processing
Event-
driven
Real-time
(Pig) (Hive)
Data integration
(Storm)
Microsoft
(HBase)
integration
Distributed processing
points and
(MapReduce)
value adds
Event pipeline
Orange =
flume)
YARN
Intelligence
Business
Data movement
SSAS)
Distributed storage (HDFS) Green =
Monitoring & World's data Packages
Azure Storage Active Directory
deployment (Azure Data
Vault (ASV) (Security)
(System Center) Marketplace)
HDInsight: Built for Windows or Linux
Managed and supported by Microsoft
Familiarity of Windows
Reuse of common tools, documentation, samples from Hadoop/Linux
ecosystem
Addition of Hadoop projects that were authored on Linux to HDInsight
Easier transition from on-premises to cloud
HDInsight supports Hive
SQL-like queries on Hadoop data in HDInsight
HDInsight provides easy-to-use graphical query interface for Hive
HiveQL is a SQL-like language (subset of SQL)
Hive structures include well-understood database concepts such as tables, rows, columns, partitions
Compiled into MapReduce jobs that are executed on Hadoop
Dramatic performance gains with Stinger/Tez
Stinger is a Microsoft, Hortonworks, and OSS-driven initiative to bring interactive queries with Hive
Query execution engine technology from Microsoft SQL Server to Hive
Performance gains up to 100x
32x Speedup
1400s 40X
Speedup 100x
44.3s Speedup
35.1s 15s
HDP 2.1
Hive 10 HDP 1.3 / HDP 2.0
HDInsight supports HBase
NoSQL database on data in HDInsight
Columnar, NoSQL database
Runs on top of Hadoop Distributed File System (HDFS)
Provides flexibility for new columns to be added to column families at any time
HMaster
Coordination
Name node Region server Region server Region server Region server
JobTracker
Apache
Storm on HBase Web/thick client
Kafka/ HDInsight
Applications RabbitMQ/ dashboards
ActiveMQ HDFS
Field Storage
gateways adapters Devices to take action
Web and social
HDInsight supports Spark
In-memory processing on multiple workloads
Single execution model for multiple tasks (SQL Query, Spark Streaming, Machine Learning, and Graph)
Processing up to 100x faster performance
Developer friendly (Java, Python, Scala)
BI tool of choice (Power BI, Tableau, Qlik, SAP)
Notebook experience (Jupyter/iPython, Zeppelin)
Azure Storage
Mission-critical Hadoop
Maintenance done for you
Minimal IT resources for
upgrades/patching HDInsight adds latest version of
Hadoop for you
OS patching and security updates done automatically
HDInsight on Hadoop 2.4
June 2014
Hadoop versions
Oct 2013
O/S upgrades
Hadoop versions are rapidly releasing throughout year
Always be on latest version of Hadoop, without effort O/S patching
Low cost
HDInsight is billed by usage
Billed for usage
$£€¥
Clusters can be deleted when no longer used