HCIA-Big Data V3.0 Training Material
HCIA-Big Data V3.0 Training Material
0 Training Materials
Contents
1. Chapter 1 Big Data Development Trend and Kunpeng Big Data Solution ∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙ 4
7. Chapter 7 Flink, Stream and Batch Processing in a Single Engine ∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙ 274
1 Huawei Confidential
Revision Record Do Not Print this Page
2 Huawei Confidential
Chapter 1 Big Data Development Trend
and Kunpeng Big Data Solution
Foreword
This chapter consists of two parts.
The first part mainly describes what is big data and the opportunities and challenges we face in the
age of big data.
To keep up with the trend and help partners to improve computing capabilities and data
governance capabilities in intelligent transformation, Huawei proposes the strategy of Kunpeng
ecosystem. Therefore, the second part describes the Huawei Kunpeng Big Data solution, including
the Kunpeng server based on the Kunpeng chipset and HUAWEI CLOUD Kunpeng cloud services. In
addition, this part briefly describes the common public cloud services related to big data and data
analysis and processing in HUAWEI CLOUD Stack 8.0, and introduces the advantages and
application scenarios of HUAWEI CLOUD MRS.
4 Huawei Confidential
Objectives
5 Huawei Confidential
Contents
6 Huawei Confidential
Ushering in the 4th Industrial Revolution and Embracing
the Intelligent Era
Cloud computing,
big data, IoT, and AI
Computer and
communication
Electric power
Steam power
Age of Intelligence
(?)
Information Age
1940s - 2010s
Electricity Age United States
1860s - 1920s
Steam Age Europe and America
1760s - 1840s
United Kingdom
7 Huawei Confidential
Moving From Data Management to Data Operations
120 million data labels 50 years of oilfield data 450,000 collisions of data
every day analytics every day
10 Huawei Confidential
Everything Is Data and Data Is Everything
11 Huawei Confidential
Big Data Era
Definition on Wikipedia:
Big data refers to data sets with sizes beyond the ability of commonly used software tools to
capture, curate, manage, and process data within a tolerable elapsed time.
Volume Velocity
4Vs
Variety Value
12 Huawei Confidential
Big Data Processing VS. Traditional Data Processing
From database to big data
"Fishing in the pond" VS "Fishing in the ocean”
("fish" indicates the data to be processed)
13 Huawei Confidential
Contents
15 Huawei Confidential
Big Data Era Leading the Future
Data has penetrated into every inch of the industry and business domain.
Discerning essences (services), forecasting trends, and guiding the future are the core of
the big data era.
With a clear future target, seize every opportunity to harness big data to
secure future success.
16 Huawei Confidential
Application Scenarios of Enterprise - Level Big Data
Platforms (1)
Operations Management Profession
Operational Performance Audio and video
analytics management Seismic
Telecom signaling Report analysis exploration
Financial sub- Historical data Meteorological
ledger analysis cloud chart
Financial Social security Satellite remote
documents analysis sensing
Power distribution Tax analysis Radar data
Smart grid Decision-making IoT
support and
prediction
With strong appeals for data analytics in telecom carriers, financial institutions, and governments,
the Internet has adopted new technologies to process big data of low value density.
17 Huawei Confidential
Application Scenarios of Enterprise - Level Big Data
Platforms (2)
Marketing analysis, customer analysis, and internal operational management are the top three application scenarios of enterprise big data.
Marketing analysis Customer analysis Internal operational Supply chain mgmt Others
mgmt
18 Huawei Confidential
Big Data Market Analytics
It is predicted that the overall scale of the big data industry will exceed CNY 1 trillion by the end of 2020, in
which industry-specific solutions and big data applications account for the largest proportion.
▫ Overview of the big data industry in China ▫ Market scale proportion of big data market segments
Overall scale the big data Growth rate
industry (100 million CNY)
19 Huawei Confidential
Big Data Application Scenarios - Finance
Importance of data mining
• Obtain service analysis and
create data anytime and
anywhere.
• Seek meaningful experience.
• Review the details. Customer
• Involve in creating contents, New operations
products, and experience.
customers
• Receive services in a fixed Omni-channel
Conventional location and at a fixed time.
customers • Passively receive data.
• Trust market information. Scenario-
• Passively receive propagation.
focused Analytics
Efficiency
Customer
• Offer standard industrial services. marketing
• Focus on processes and procedures. Flexible
• Passively receive information from a single source.
•
personalized
Contact customers by customer managers.
• Interact with customers in fixed channels. services
New Customer
Conventional financial service
financial institutions
institutions
20 Huawei Confidential
Big Data Application Scenarios - Education
Big data analytics has now been widely applied to the education field.
Average time for
answering each question
Sequence of questions
answered during exams 12 Academic records
11 1
Hand-raising times 7 5
at class Literacy
6
Homework correctness
21 Huawei Confidential
Big Data Application Scenarios - Traffic Planning
Traffic planning: multi-dimensional analysis of crowds
North gate of the Workers' Stadium: >
Areas where the people flow 500 people/ hour
ever exceeded the specified
Sanlitun: > 800 people/hour
threshold
Beijing Workers' Stadium: > 800
people/hour
24 Huawei Confidential
Big Data Application Scenarios - Clean Energy
Clean energy powers China’s Qinghai
province for nine consecutive days.
Coal consumption ↓ 800,000 tons
CO2 emission ↓ 1.44 million tons
Data Offline Real-time Data
survey data data centers
25 Huawei Confidential
Contents
26 Huawei Confidential
I/O-intensive Tasks
I/O-intensive tasks are tasks involving network, disk, memory, and I/O.
Characteristics: The CPU usage is low, and most latency is caused by I/O wait
(CPU and memory computing is far quicker than I/O processing).
More I/O-intensive tasks indicate higher CPU efficiency. However, there is a
limit. Most applications are I/O-intensive, such as web applications.
During the execution of I/O-intensive tasks, 99% of the time is spent on I/O
wait. Therefore, it is top priority is to improve the network transmission and
read/write efficiency.
27 Huawei Confidential
CPU-intensive Tasks
Characteristics: A large number of computing tasks are performed, including Pi
calculation and decoding HD videos, which consumes CPU resources.
CPU-intensive tasks can be completed in parallel. However, more tasks mean
longer duration for switching tasks, and task processing on CPU will be less
efficient. Hence, to put the best of CPU performance, keep the number of
parallel CPU-intensive tasks equal to the number of CPU cores.
CPU-intensive tasks mainly consume CPU resources. Therefore, the code running
efficiency is critically important.
28 Huawei Confidential
Data-intensive Tasks
Unlike CPU-intensive applications where a single computing task occupies a large
number of computing nodes, data-intensive applications have the following
characteristics:
A large number of independent data analysis and processing tasks run on different nodes of a
loosely coupled computer cluster system.
High I/O throughput is required by massive volumes of data.
Most data-intensive applications have a data-flow-driven process.
Stream computing
Allows you to calculate and process stream data in real time. Major technologies: Spark, Storm, Flink,
Flume, and DStream
Graph computing
Allows you to process large volumes of graph structure data. Major technologies: GraphX, Gelly, Giraph,
and PowerGraph
31 Huawei Confidential
Hadoop Big Data Ecosystem
Ambari
(Installation and deployment tool)
Oozie
(Workflow scheduling system)
(ETL tool)
processing) computing)
ZooKeeper
processing)
HBase
Sqoop
Pig
Hive Mahout
(Data analysis
(Data warehouse) (Machine learning)
platform)
Yarn
(Log collection)
(Unified resource allocation manager)
Flume
HDFS
(Distributed file management system)
33 Huawei Confidential
Contents
34 Huawei Confidential
Traditional Data Processing Are Facing Great Challenges
35 Huawei Confidential
Challenge 1: Few Business Departments Have Clear Big
Data Requirements
Many enterprise business departments have no
idea about the values and application scenarios of
big data, and therefore have no accurate
requirements for big data. In addition, the
enterprise decision-makers are worried that
establishing a big data department may yield little
profits and therefore even delete a large amount
of historical data with potential values.
36 Huawei Confidential
Challenge 2: Data Silos Within Enterprises
The greatest challenge for enterprises to implement the big data strategy is that
different types of data are scattered in different departments. As a result, data
in the same enterprise cannot be efficiently shared, and the value of big data
cannot be brought into full play.
37 Huawei Confidential
Challenge 3: Poor Data Availability and Quality
Many large- and medium-sized enterprises
generate large volumes of data every day.
However, many of them fails to pay enough
attention to data preprocessing. As a result,
data is not processed in a standard way. In
the big data preprocessing phase, data needs
to be extracted and converted into data types
that can be easily processed, and cleaned by
removing noisy data. According to Sybase,
with the availability of high-quality data
improved by 10%, the profits of enterprises
would be improved by 20% consequently..
38 Huawei Confidential
Challenge 4: Unsatisfactory Data Management
Technologies and Architecture
Traditional databases are not suitable for processing PB-scale data.
It is difficult for traditional database systems to process semi-structured and
unstructured data.
The O&M of massive volumes of data requires data stability, high concurrency,
and server load reduction.
39 Huawei Confidential
Challenge 5: Data Security Risks
A rapid spread of the Internet increases the chance of breaching the privacy of
individuals and also leads to more crime methods that are difficult to be tracked and
prevented.
It is a key issue to ensure user information security in this big data era. In addition, the
increasing amount of big data poses higher requirements on the physical security of
data storage, and therefore higher requirements on multi-copy and DR mechanisms.
40 Huawei Confidential
Challenge 6: Lack of Big Data Talent
Each step of big data construction must be completed by professionals. Therefore, it is
necessary to develop a professional team that understands big data, knows much about
administration, and has experience in big data applications. Hundreds of thousands of
big data–related jobs are added each year around the world. In the future, there will be
a big data talent gap of more than 1 million. Therefore, universities and enterprises
make join efforts to explore and develop big data talent.
41 Huawei Confidential
Challenge 7: Trade-off Between Data Openness and
Privacy
As big data applications become increasingly important, data resource openness and sharing have
become the key to maintaining advantages against competitors. However, opening up data
inevitably risks exposing some users' private information. Hence, it is a major challenge in this big
data era to effectively protect citizens' and enterprises' privacy while promoting data openness,
application, and sharing, and gradually strengthen privacy legislation.
Legislation
protection
Personal
information in
electronic form
42 Huawei Confidential
Standing Out in the Competition Using Big Data
Big data can bring a huge commercial value, and it is believed to raise a revolution that
is well matched with the computer revolution in 20th century. Big data is affecting
commercial and economic fields and so on. It boosts a new blue ocean and hastens
generation of new economic growth points, becoming a focus during the competition
among enterprises.
43 Huawei Confidential
Opportunity 1: Big Data Mining Becomes the Core of
Business Analysis
The focus of big data has gradually shifted from storage and transmission to
data mining and application, which will have a profound impact on enterprises'
business models. Big data can directly bring enterprises profits and
incomparable competitive advantages through positive feedback.
On the one hand, big data technologies can effectively help enterprises integrate,
mine, and analyze the large volumes of data they have, build a systematic data
system, and improve the enterprise structure and management mechanism.
On the other hand, with the increase of personalized requirements of consumers, big
data is gradually applied in a wide range of fields and is shifting the development
paths and business models of most enterprises.
44 Huawei Confidential
Opportunity 2: Big Data Bolsters the Application of
Information Technologies
Big data processing and analysis is a support point of application of next-generation
information technologies.
Mobile Internet, IoT, social networking, digital home, and e-commerce are the application
forms of next-generation information technologies. These technologies continuously
aggregate generated information and process, analyze, and optimize data from different
sources in a unified and comprehensive manner, feedback or cross-feedback of results to
various applications further improves user experience and creates huge business, economic,
and social values.
Big data has the power to drive social transformation. However, to unleash this power, stricter
data governance, insightful data analysis, and an environment that stimulates management
innovation are required.
45 Huawei Confidential
Opportunity 3: Big Data Is a New Engine for Continuous
Growth of the Information Industry
Big data, with its tremendous business value and market demands, becomes a new engine that
drives the continuous growth of the information industry.
With the increasing recognition of big data’s value by industry users, market requirements will burst, and
new technologies, products, services, and business models will emerge continuously in the big data
market.
Big data drives the development of a new market with high growth for the information industry: In the
field of hardware and integrated devices, big data faces challenges such as effective storage, fast
read/write, and real-time analysis. This will have an important impact on the chip and storage industry
and also give rise to the market of integrated data storage and processing servers and in-memory
computing. In the software and service field, the value of big data brings urgent requirements for quick
data processing and analysis, which will lead to unprecedented prosperity of the data mining and business
intelligence markets.
46 Huawei Confidential
Quiz
1. Where is big data from? What are the characteristics of big data?
47 Huawei Confidential
Contents
48 Huawei Confidential
Internet of Everything - Massive Volumes of Data
Requires High Computing Power
Smart mobile devices are replacing The world is moving towards an era of all
traditional PCs. things connected
2018
The number of connected Industrial sensor
devices worldwide exceeded 23
1.6 billion devices
billion in 2018. Smart
Up by 1.2%
energy
Growing continuously Safe city
Access from
intelligent terminals Smart home
now
2018 IoT
250 million
devices Autonomous
Down by 1.3% driving
Data Centers
PC Declining for
7 years in a
row
Access using
PCs in the past
Transition from traditional PCs to Demand for new Massive volumes of data
intelligent mobile terminals computing power
49 Huawei Confidential
Application and Data Diversity Requires a New
Computing Architecture
Smartphones Texts
Videos
50 Huawei Confidential
Over Trillions of Dollars of Computing Market Space
As the ICT industry landscape is being reshaped by new applications, technologies, computing architectures, tens of billions of
connections, and explosive data growth, a new computing industry chain is taking shape, creating new vendors and hardware
and software systems:
Hardware: servers, components, and enterprise storage devices
Software: operating systems, virtualization software, databases, middleware, big data platforms, enterprise application software, cloud
services, and data center management services
Infrastructure
software Public cloud
Servers Databases
15.25 billion IaaS
112.1 billion 56.9 billion 141 billion
Big data
Enterprise platforms
Middleware application 41 billion DC
Enterprise 43.4 billion management
storage
software service
31.1 billion 402 billion 159.5 billion
51 Huawei Confidential
Advantages of the Kunpeng Computing Industry
Industry applications
incubated and
optimized in the
1
Chinese market
contribute to a
2 Advantages
healthy global
industry chain.
Advantageous
ecosystem shared
by Kunpeng and
Arm accelerates
development.
52 Huawei Confidential
Overall Architecture of the Kunpeng Computing Industry
Based on Kunpeng processors, the Kunpeng computing industry covers full-stack IT infrastructure, industry applications, and
services, including PCs, servers, storage devices, OSs, middleware, virtualization software, databases, cloud services, and
consulting and management services.
...
Databases Kunpeng ECS Kunpeng BMS Kunpeng Kunpeng RDS Kunpeng DWS
container
Middleware
Kunpeng OSs
Compatible with different OSs, ...
Kunpeng
processors
...
53 Huawei Confidential
Typical Applications
Driven by technologies such as 5G, AI, cloud computing, and big data, various industries have raised such requirements for
computing platforms as device-cloud synergy, intelligent processing of massive volumes of diverse data, and real-time
analysis. The powerful computing base provided by Kunpeng processors will play an important role in the digital
transformation of industries.
54 Huawei Confidential
Panorama of Kunpeng Computing Industry Ecosystems
Technical ecosystem Collaboration with universities
The Kunpeng computing platform is an Talent for the computing industry is continuously
open technological ecosystem that is developed through the university-enterprise
compatible with mainstream operating collaboration in various forms:
systems, databases, and middleware. • University-enterprise joint courses
• University-enterprise joint publication
• Training centers in universities
• Huawei ICT Job Fair
Technical Collaboration with
Developer ecosystem ecosystem universities
The Kunpeng computing platform encourages Industry ecosystem
developers to develop and innovate services Huawei collaborates with partners and customers for
Developer
based on the platform: industry solutions:
ecosystem Industry
• Kunpeng Developer Contest
Kunpeng ecosystem
• Kunpeng online courses/cloud lab
• Kunpeng career certification ecosystem
Government Finance Gaming Media and Carrier
Community entertainment
building Partner
Community building ecosystem
The Kunpeng community provides customers,
partners, and developers with abundant resources, Partner ecosystem
and an open and equal space for technological The Kunpeng Partner Program provides
exchange. partners with comprehensive support in
training, technology, and marketing.
55 Huawei Confidential
Build the Computing Capability of the Entire System
Based on Huawei Kunpeng Processors
Efficient computing Safe and reliable Open ecosystem
• Arm-compliant high- • Kunpeng processors use • Open platform supports Huawei
performance Kunpeng Huawei-developed cores, mainstream hardware and Kunpeng 950
processors, TaiShan and TaiShan servers use software; Build a Kunpeng
servers, and solutions Huawei-developed ecosystem, and establish a new 2023
efficiently improve the computing chips. smart computing base with
computing capabilities • 17 years of computing developers, partners, and other Huawei
of data centers. innovation guarantees industry organizations. Kunpeng 930
high quality.
Huawei
Kunpeng 920 2021
• First 7 nm data
Huawei center CPU
Kunpeng 916
Hi1612 • First multi-socket
2019
(Kunpeng 912) Arm CPU
K3 • First Arm-based
• First Arm- 64-bit CPU
• First Arm- 2016
• First ASIC chip based mobile
for transmission based base device CPU 2014 TaiShan 200
network station chip
• Kunpeng 920-based
2009 TaiShan 100
2005 • Kunpeng 916-based
1991
56 Huawei Confidential
OSs Compatible with the Kunpeng Ecosystem
Community Edition Commercial Edition
China-
made OSs
...
...
Non-
China-
made OSs ...
...
57 Huawei Confidential
Overview of HUAWEI CLOUD Kunpeng Cloud Services
Based on Kunpeng processors and other diverse infrastructures, HUAWEI CLOUD Kunpeng cloud services cover bare metal servers
(BMSs), virtual machines, and containers. The services feature multi-core and high concurrency and are suitable for scenarios
such as AI, big data, HPC, cloud phone, and cloud gaming.
Business ...
development Finance Large enterprise Internet Multiple scenarios
across industries
Full-stack
Enterprise Native
Solution dedicated HPC Big data AI
apps applications
cloud
...
Full-stack Kunpeng Kunpeng Kunpeng Kunpeng Kunpeng Full-series
BMS ECS EVS OBS SFS Kunpeng cloud
Kunpeng
cloud services ... services
Kunpeng Kunpeng Kunpeng Kunpeng Kunpeng NAT
CCE CCI VPC ELB Gateway
Huawei Ascend
Kunpeng 920 Hi181x Hi182x Hi171x 310/910
high- storage server high- Core infrastructure
network
performance controller controller management performance
CPU AI processor
58 Huawei Confidential
HUAWEI CLOUD Kunpeng Cloud Services Support A Wide
Range of Scenarios
Transaction Big Data Scientific
Databases Cloud Services Storage Mobile Native Apps
Processing Analytics Computing
OLTP OLAP MySQL CAE/CFD Front-end web Block storage Cloud gaming
Game development
Web servers Offline analysis Redis CAD/EDA Data cache Object storage
and testing
Molecular
App service AI inference Oracle Mobile office
Dynamics
SAP Defense/Security
• Open-source software can run on Huawei Kunpeng platforms, and commercial software is gradually improved.
• Applications that support high concurrency deliver excellent performance.
59 Huawei Confidential
Contents
60 Huawei Confidential
Huawei Big Data Solution
Kunpeng Big Data Solution
Huawei's secure and controllable Kunpeng Big Data Solution provides one-stop high-performance big
data computing and data security capabilities. This solution aims to resolve basic problems such as data
security, efficiency, and energy consumption during intelligent big data construction in the public safety
industry.
BigData Pro
This solution adopts the public cloud architecture with storage-compute decoupling, ultimate scalability,
and highest possible efficiency. The highly scalable Kunpeng computing power is used as the computing
resource and Object Storage Service (OBS) that supports native multi-protocols is used as the storage
pool. The resource utilization of big data clusters can be greatly improved, and the big data cost can be
halved.
Drew on HUAWEI CLOUD's extensive experience, Huawei big data solution provides you with high-
performance and highly-reliable infrastructure resources and AI training and inference platforms for big
data services, witnessing your success in digitization and intelligentization.
61 Huawei Confidential
Advantages of Huawei Big Data Solution
High security
Controllable servers and big data platforms
62 Huawei Confidential
HUAWEI CLOUD Big Data Services
One-stop service for data development, test, and application
DAYU Enterprise
Data integration Data standards Data development Data governance Data assets Data openness apps
IoT access One-stop big data platform — MRS Data services Reports,
dashboards
Batch
Stream query Interactive query data mining
Data access processing
100% compatibility with open-source ecosystems, 3rd-party components managed as plug-ins, one-stop enterprise platform
Storage-compute decoupling + Kunpeng optimization for better performance
63 Huawei Confidential
HUAWEI CLOUD MRS Overview
MRS is a HUAWEI CLOUD service that is used to deploy and manage the Hadoop
system and enables one-click Hadoop cluster deployment.
MRS provides enterprise-level big data clusters on the cloud. Tenants can fully control
clusters and easily run big data components such as Hadoop, Spark, HBase, Kafka, and
Storm. MRS is fully compatible with open source APIs, and incorporates advantages of
HUAWEI CLOUD computing and storage and big data industry experience to provide
customers with a full-stack big data platform featuring high performance, low cost,
flexibility, and ease-of-use. In addition, the platform can be customized based on service
requirements to help enterprises quickly build a massive data processing system and
discover new value points and business opportunities by analyzing and mining massive
amounts of data in either real time or non-real time.
64 Huawei Confidential
Advantages of MRS (1)
High performance
Leverages Huawei-developed CarbonData storage technology which allows one data set to
apply to multiple scenarios.
Supports such features as multi-level indexing, dictionary encoding, pre-aggregation, dynamic
partitioning, and quasi-real-time data query. This improves I/O scanning and computing
performance and returns analysis results of tens of billions of data records in seconds.
Supports Huawei-developed enhanced scheduler Superior, which breaks the scale bottleneck
of a single cluster and is capable of scheduling over 10,000 nodes in a cluster.
Optimizes software and hardware based on Kunpeng processors to fully release hardware
computing power and achieve cost-effectiveness.
65 Huawei Confidential
Advantages of MRS (2)
Easy O&M
Provides a visualized big data cluster management platform, improving O&M efficiency.
Supports rolling patch upgrade and provides visualized patch release information.
Supports one-click patch installation without manual intervention, ensuring long-term stability of user clusters.
Delivers high availability (HA) and real-time SMS and email notification on all nodes.
66 Huawei Confidential
Advantages of MRS (3)
High security
With Kerberos authentication, MRS provides role-based access control (RBAC) and sound audit functions.
MRS is a one-stop big data platform that allows different physical isolation modes to be set up for customers in the
public resource area and dedicated resource area of HUAWEI CLOUD as well as HCS Online in the customer's equipment
room.
A cluster supports multiple logical tenants. Permission isolation enables the computing, storage, and table resources of
the cluster to be divided based on tenants.
67 Huawei Confidential
Advantages of MRS (4)
Cost-effectiveness
Provides various computing and storage
choices based on diverse cloud infrastructure.
68 Huawei Confidential
Application Scenarios of MRS (1)
Offline analysis of massive volumes of data
Low cost: OBS offers cost-effective storage.
Mass data analysis: Hive analyzes TB/PB-scale data.
Visualized data import and export tool: Loader exports data to Data Warehouse Service (DWS) for business intelligence
(BI) analysis.
Loader
Hive
69 Huawei Confidential
Application Scenarios of MRS (2)
Large-scale data storage
Real time: Kafka accesses massive amounts of vehicle messages in real time.
Massive data storage: HBase stores massive volumes of data and supports data queries in milliseconds.
Distributed data query: Spark analyzes and queries massive volumes of data.
HBase HBase
Storm
IoV system
Spark
70 Huawei Confidential
Application Scenarios of MRS (3)
Low-latency real-time data analysis
Real-time data ingestion: Flume implements real-time data ingestion and provides
various data collection and storage access methods.
Data source access: Kafka accesses data of tens of thousands of elevators and
escalators in real time.
HBase Spark
IoEE
system
Details on Flume
each elevator/
escalator
Kafka Storm
71 Huawei Confidential
Summary
This chapter describes the opportunities and challenges in the big data era and Huawei
Kunpeng big data solution. In this booming big data era, every business is a data business.
On the one hand, big data analytics technologies have been fully applied in a wealth of
fields, such as finance, education, government and public security, transportation planning,
and clean energy. On the other hand, the development of big data also faces many
challenges. To address these challenges, Huawei proposes the Kunpeng strategy. Based on
Huawei Kunpeng processors and TaiShan servers, Huawei continuously improves the
computing power and develops the Huawei Kunpeng computing industry. Based on this,
Huawei develops the Huawei Kunpeng ecosystem. In the big data field, particularly, Huawei
proposes multiple public cloud services to help partners complete intelligent transformation
faster and better.
72 Huawei Confidential
Recommendations
75 Huawei Confidential
Chapter 2 HDFS and ZooKeeper
Foreword
This course describes the big data distributed storage system HDFS and the
ZooKeeper distributed service framework that resolves some frequently-
encountered data management problems in distributed services. This
chapter lays a solid foundation for subsequent component learning.
77 Huawei Confidential
Objectives
78 Huawei Confidential
Contents
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
79 Huawei Confidential
Dictionary and File System
80 Huawei Confidential
HDFS Overview
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
HDFS has a high fault tolerance capability and is deployed on cost-effective hardware.
HDFS provides high-throughput access to application data and applies to applications
with large data sets.
HDFS looses some Potable Operating System Interface of UNIX (POSIX) requirements to
implement streaming access to file system data.
HDFS was originally built as the foundation for the Apache Nutch Web search engine
project.
HDFS is a part of the Apache Hadoop Core project.
81 Huawei Confidential
HDFS Application Scenario Example
82 Huawei Confidential
Contents
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
83 Huawei Confidential
Computer Cluster Structure
The distributed file system stores files on multiple
computer nodes. Thousands of computer nodes form
a computer cluster.
Currently, the computer cluster used by the
distributed file system consists of common hardware,
which greatly reduces the hardware overhead.
84 Huawei Confidential
Basic System Architecture
HDFS Architecture
Metadata ops
Block ops
Client
Replication
Blocks Blocks
Client
Rack 1 Rack 2
85 Huawei Confidential
Block
The default size of an HDFS block is 128 MB. A file is divided into multiple
blocks, which are used as the storage unit.
The block size is much larger than that of a common file system, minimizing the
addressing overhead.
The abstract block concept brings the following obvious benefits:
Supporting large-scale file storage
Simplifying system design
Applicable to data backup
86 Huawei Confidential
NameNode and DataNode (1)
NameNode DataNode
Metadata is stored in the memory. The file content is stored in the disk.
Saves the mapping between files, Maintains the mapping between block IDs
blocks, and DataNodes. and local files on DataNodes.
87 Huawei Confidential
NameNode and DataNode (2)
88 Huawei Confidential
DataNodes
DataNodes are working nodes that store and read data in HDFS. DataNodes
store and retrieve data based on the scheduling of the clients or NameNodes,
and periodically send the list of stored blocks to the NameNodes.
Data on each DataNode is stored in the local Linux file system of the node.
89 Huawei Confidential
Contents
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
90 Huawei Confidential
HDFS Architecture Overview
91 Huawei Confidential
HDFS Namespace Management
The HDFS namespace contains directories, files, and blocks.
HDFS uses the traditional hierarchical file system. Therefore, users can create
and delete directories and files, move files between directories, and rename files
in the same way as using a common file system.
NameNode maintains the file system namespace. Any changes to the file
system namespace or its properties are recorded by the NameNode.
92 Huawei Confidential
Communication Protocol
HDFS is a distributed file system deployed on a cluster. Therefore, a large
amount of data needs to be transmitted over the network.
All HDFS communication protocols are based on the TCP/IP protocol.
The client initiates a TCP connection to the NameNode through a configurable port
and uses the client protocol to interact with the NameNode.
The NameNode and the DataNode interact with each other by using the DataNode
protocol.
The interaction between the client and the DataNode is implemented through the
Remote Procedure Call (RPC). In design, the NameNode does not initiate an RPC
request, but responds to RPC requests from the client and DataNode.
93 Huawei Confidential
Client
The client is the most commonly used method for users to operate HDFS. HDFS
provides a client during deployment.
The HDFS client is a library that contains HDFS file system interfaces that hide
most of the complexity of HDFS implementation.
Strictly speaking, the client is not a part of HDFS.
The client supports common operations such as opening, reading, and writing,
and provides a command line mode similar to Shell to access data in HDFS.
HDFS also provides Java APIs as client programming interfaces for applications
to access the file system.
94 Huawei Confidential
Disadvantages of the HDFS Single-NameNode
Architecture
Only one NameNode is set for HDFS, which greatly simplifies the system design but
also brings some obvious limitations. The details are as follows:
Namespace limitation: NameNodes are stored in the memory. Therefore, the number of
objects (files and blocks) that can be contained in a NameNode is limited by the memory size.
Performance bottleneck: The throughput of the entire distributed file system is limited by the
throughput of a single NameNode.
Isolation: Because there is only one NameNode and one namespace in the cluster, different
applications cannot be isolated.
Cluster availability: Once the only NameNode is faulty, the entire cluster becomes unavailable.
95 Huawei Confidential
Contents
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
96 Huawei Confidential
HDFS High Availability (HA)
Heartbeat
Heartbeat
EditLog
JN JN JN
ZKFC ZKFC
HDFS
Read/write data.
Client
Copy.
97 Huawei Confidential
Metadata Persistence
3. Merge.
FsImage FsImage
.ckpt .ckpt
5. Roll back
FsImage. 4. Upload the newly
generated FsImage file
Editlog FsImage to the active node.
99 Huawei Confidential
HDFS Federation
APP Client-1 Client-k Client-n
Pool
Pool 1 Pool n
Block Pools
Storage
Block
Common Storage
DataNode1 DataNode2 DataNodeN
... ... ...
Distance=4
Distance=4
Distance=0
Metadata reliability:
The log mechanism is used to operate metadata, and metadata is stored on the active and standby NameNodes.
The snapshot mechanism implements the common snapshot mechanism of file systems, ensuring that data can be
restored in a timely manner in the case of mis-operations.
Security mode:
HDFS provides a unique security mode mechanism to prevent faults from spreading when DataNodes or disks are faulty.
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
FSData NameNode
6. Close the file.
OutputStream
Client node
4 4
DataNode DataNode DataNode
5 5
4. Read data.
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
2. HDFS-related Concepts
3. HDFS Architecture
4. Key Features
6. ZooKeeper Overview
7. ZooKeeper Architecture
ZooKeeper Service
Leader
The ZooKeeper cluster consists of a group of server nodes. In this group, there is only one leader node, and
other nodes are followers.
The leader is elected during the startup.
ZooKeeper uses the custom atomic message protocol to ensure data consistency among nodes in the entire
system.
After receiving a data change request, the leader node writes the data to the disk and then to the memory.
Client
1. Write Request
Client
6. Write Response
HDFS is compatible with inexpensive hardware devices, stream data read and write, large data sets, simple file models,
and powerful cross-platform compatibility. However, HDFS has its own limitations. For example, it is not suitable for
low-latency data access, cannot efficiently store a large number of small files, and does not support multi-user write
and arbitrary file modification.
"Block" is the core concept of HDFS. A large file is split into multiple blocks. HDFS adopts the abstract block concept,
supports large-scale file storage, simplifies system design, and is suitable for data backup.
The ZooKeeper distributed service framework is used to solve some data management problems that are frequently
encountered in distributed applications and provide distributed and highly available coordination service capabilities.
2. Why is the HDFS data block size larger than the disk block size?
The Apache Hive data warehouse software helps read, write, and manage
large data sets that reside in distributed storage by using SQL. Structures
can be projected onto stored data. The command line tool and JDBC driver
are provided to connect users to Hive.
1. Hive Overview
Data extraction
Data Data loading
warehouse Data transformation
Usage
HQL (SQL-like) SQL
Method
Metadata storage is independent of data
Flexibility storage, decoupling metadata and data.
Low flexibility. Data can be used for limited purposes.
Advantages
High Reliability
and SQL-like Scalability Multiple APIs
Fault Tolerance
1. Cluster 1. SQL-like 1. User-defined 1. Beeline
deployment syntax storage 2. JDBC
of HiveServer 2. Large number format 3. Thrift
2. Double of built-in 2. User-defined 4. ODBC
MetaStores functions function
3. Timeout retry
mechanism
1 2 3 4
1. Hive Overview
Hive
JDBC ODBC
Web
Thrift Server
Interface
Driver
MetaStore
(Compiler, Optimizer, Executor)
Database
Table Table
Partition
CREATE/LOAD Data is moved to the repository directory. The data location is not moved.
1. Hive Overview
$ $HIVE_HOME/bin/beeline -u jdbc:hive2://$HS2_HOST:$HS2_PORT
Running Hcatalog:
$ $HIVE_HOME/hcatalog/sbin/hcat_server.sh
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
-- Describe a table:
hive> DESCRIBE invites;
-- Modify a table:
hive> ALTER TABLE events RENAME TO 3koobecaf;
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
--GROUP BY:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200;
--JOIN:
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
--STREAMING:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING
'/bin/cat' WHERE a.ds > '2008-08-09';
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
Data
User image Time series
storage
data
Meteorological
Message/Order HBase data
storage Scenarios
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
Column Family
Column
Info
name age gender
20200301 Tom 18 male
Row Key 20200302 Jack 19 male
female
20200303 Lily 20 female
t1
t2
Cell The cell has two timestamps t1 and t2.
One timestamp corresponds to one data
version.
K1 V1 K2 V2 K3 V3 K4 V4
K5 V5 K6 V6 K7 V7 K8 V8
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
HRegionServer HRegionServer
HRegion HRegion
HBase
HLog
StoreFile StoreFile ... StoreFile ... ... ... StoreFile StoreFile ... StoreFile ... ... ...
HFile HFile HFile HFile HFile HFile
...
− StoreFile (StoreFiles for each Store for each Region for the table)
~ Block (Blocks within a StoreFile within a Store for each Region for the table)
Table Table
Table Region Region
lexicographical order
Region Split
By row key
Region
Region Region
Region
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
StoreFile1: 64 MB
Split StoreFile6: 128 MB
StoreFile5: 256 MB
StoreFile3: 64 MB StoreFile5B: 128 MB
StoreFile7: 128 MB
StoreFile4: 64 MB
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
Write
put MemStore
Flush
Minor Compaction
Major Compaction
HFile
ColumnFamily-1
MemStore
HFile-11
HFile-12
Region
ColumnFamily-2
MemStore
HFile-21
HFile-22
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
1. Introduction to HBase
3. HBase Architecture
5. HBase Highlights
This course describes the knowledge about the HBase database. HBase is an open
source implementation of BigTable. Similar to BigTable, HBase supports a large
amount of data and distributed concurrent data processing. It is easy to expand,
supporting dynamic scaling, and is applicable to inexpensive devices.
Additionally, this course describes the differences between the conceptual view and
physical view of HBase data. HBase is a mapping table that stores data in a sparse,
multi-dimensional, and persistent manner. It uses row keys, column keys, and
timestamps for indexing, and each value is an unexplained string.
B. Long
C. String
D. Byte[]
B. Column Family
C. Column
D. Cell
4. Enhanced Features
Therefore, Hadoop 2.0 introduces the YARN framework to better schedule and allocate cluster resources to
overcome the shortcomings of Hadoop 1.0 and meet the diversified requirements of programming paradigms.
4. Enhanced Features
Input
输入 Map tasks
Map任务 Reduce tasks
Reduce任务 Output
输出
Split
分片0 0 Map()
map() Shuffle
Split
分片1 1 Map()
map() Reduce()
reduce() Output
输出0 0
Split
分片2 2 Map()
map() Reduce()
reduce() Output
输出1 1
Split
分片3 3 Map()
map() Reduce()
reduce() Output
输出2 2
Split
分片4 4 Map()
map()
Map Phase
(Optional)
Reduce phase
Map MOF
phase Local disk
Copy Merge Reduce
Map MOF
phase Local disk Cache Disk HDFS
Map MOF
phase Local disk
Resource Name
Manager Node
Input Output
File that contains Number of times that
words each word is repeated
Bye 3
Hello World Bye World MapReduce Hadoop 4
Hello Hadoop Bye Hadoop
Hello 3
Bye Hadoop Hello Hadoop
World 2
<Hello,1>
<World,1>
01 "Hello World Bye World" Map <Bye,1>
<World,1>
<Hello,1>
<Hadoop,1> Sort
02 "Hello Hadoop Bye Hadoop" Map
<Bye,1>
<Hadoop,1>
<Bye,1>
03 "Bye Hadoop Hello Hadoop" Map <Hadoop,1>
<Hello,1>
<Hadoop,1>
<Bye,1> <Bye,1>
<Hello,1> <Bye,1 1 1> Reduce Bye 3
<Hello,1>
<World,1> <World,2>
<World,1>
<Bye,1> Combine
<Hadoop,2 2> Reduce Hadoop 4
(Optional) <Bye,1> Merge
<Hadoop,1>
<Hadoop,2>
<Hadoop,1>
<Hello,1>
<Hello,1> Reduce Hello 3
<Hello,1 1 1>
<Bye,1>
<Hadoop,1> <Bye,1>
<Hadoop,1> <Hadoop,2>
<World,2> Reduce World 2
<Hello,1> <Hello,1>
Client Node
Manager
Resource
Manager
Client App Mstr Container
Node
MapReduce status
Manager
Job submission Container
Node status
Container Container
Resource request
ResourceManager
1
Client
Applications Resource
Manager Scheduler
2
3 8
4
6 2 5 Container Container
5 6 6
Container Container 7
ZooKeeper Cluster
Container
AM-1
Restart/Failure
Container
AM-1
Container
4. Enhanced Features
In Hadoop 3.x, the YARN resource model has been promoted to support user-defined
countable resource types, not just CPU and memory.
Common countable resource types include GPU resources, software licenses, and locally-
attached storage in addition to CPU and memory, but do not include ports and labels.
4. Enhanced Features
NodeManager NodeManager
NodeManager
Queue
Task
B. Outstanding scalability
C. Real-time computing
B. CPU
C. Container
D. Disk space
B. Offline computing
D. Stream computing
B. Flexibility
C. Multi-leasing
This course describes the basic concepts of Spark and the similarities and
differences between the Resilient Distributed Dataset (RDD), DataSet, and
DataFrame data structures in Spark. Additionally, you can understand the
features of Spark SQL, Spark Streaming, and Structured Streaming.
1. Spark Overview
Lightwei
Fast Flexible Smart
ght
Spark core code Delay for small Spark offers Spark smartly
has 30,000 datasets different levels uses existing
lines. reaches the of flexibility. big data
sub-second components.
level.
1. Spark Overview
RDD1 RDD2
groupByKey
map, filter
As shown in the following figure, if the b1 partition is lost, a1, a2, and a3 need to be
recalculated.
A: B:
G:
Stage1 groupby
C: D: F:
map join
E:
Stage2 union
Stage3
Transformation Description
Uses the func method to generate a new RDD
map(func)
for each element in the RDD that invokes map.
func is used for each element of an RDD that
filter(func) invokes filter and then an RDD with elements
containing func (the value is true) is returned.
It is similar to groupBykey. However, the value
reduceBykey(func, [numTasks]) of each key is calculated based on the provided
func to obtain a new value.
If the data set is (K, V) and the associated data
set is (K, W), then (K, (V, W) is returned.
join(otherDataset, [numTasks]) leftOuterJoin, rightOutJoin, and fullOuterJoin
are supported.
Action Description
reduce(func) Aggregates elements in a dataset based on functions.
Used to encapsulate the filter result or a small enough
collect()
result and return an array.
count() Collects statistics on the number of elements in an RDD.
first() Obtains the first element of a dataset.
Obtains the top elements of a dataset and returns an
take(n)
array.
This API is used to write the dataset to a text file or
saveAsTextFile(path) HDFS. Spark converts each record into a row and writes
the row to the file.
Allen
Bobby
Allen
Bobby
1. Spark Overview
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
1. Spark Overview
Costm
model
Optimized Selected
DataFrame logical plan Logical plan Physical plans RDDs
logical plan physical plan
Catalog
Dataset
1. Spark Overview
1 2 3
Time
Cat 2 Cat 2
Cat 1
Calculation results dog 3 dog 4
dog 3
owl 1 owl 2
1. Spark Overview
Batches of
Input data stream Spark Batches of input data Spark processed data
Streaming Engine
Original
DStream
Window-based
operation
Windowed
DStream
Window Window Window
at Time 1 at Time 3 at Time 5
The window slides on the Dstream. The RDDs that fall within the window are merged and operated to
generate a window-based RDD.
Window length: indicates the duration of a window.
Sliding window interval: indicates the interval for performing window operations.
290 Huawei Confidential
Comparison Between Spark Streaming and Storm
Dynamic adjustment
Supported Not supported
parallelism
3. What are the differences between wide dependencies and narrow dependencies
of Spark?
3. Flink Watermark
Flink
301 Huawei Confidential
Key Concepts
Continuous processing of streaming data, event time, stateful stream processing, and
state snapshots
RabbitMQ
HBase
Flume
Collections
Collections
Implement your own
SourceFunction.collect
308 Huawei Confidential
Flink Program Running Diagram
4. Submit the
task.
1. Submit the Flink 2. Send the
program. program.
If the input data is bounded, the result of the following code is the same as that of the
preceding code:
val counts = visits
.groupBy("region")
.sum("visits")
317 Huawei Confidential
Flink Batch Processing Model
Flink uses a bottom-layer engine to support both stream and batch processing.
Backtracking for
Checkpoints, state
scheduling and recovery,
management,
special memory data
watermark, window,
structures, and query
and trigger
optimization
3. Flink Watermark
Local time when the data node is processed Timestamp carried in a record
Input
Tumbling
window
Output
Input
Sliding
window
Output
3. Flink Watermark
Ideal situation
Actual situation
334 Huawei Confidential
Out-of-Order Example
An app records all user clicks and sends logs back. If the network condition is poor, the
logs are saved locally and sent back later. User A performs operations on the app at
11:02 and user B performs operations on the app at 11:03. However, the network of
user A is unstable and log backhaul is delayed. As a result, the server receives the
message from user B at 11:03 and then the message from user A at 11:02.
The following figure shows the watermark of out-of-order streams (Watermark is set to
2).
3. Flink Watermark
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig().setCheckpointingTimeout(60000)
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500)
env.getCheckpointConfig().setMaxConcurrentCheckpoints(500)
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
Checkpoint Savepoint
Triggering and Automatically triggered and Manually triggered and
management managed by Flink managed by users
Quickly restores tasks from Backs up data as planned, for
Purpose failures, for example, timeout example, by modifying code or
due to network jitter. adjusting concurrency.
• Persistent
• Lightweight
• Stored in a standard format
• Automatic recovery from
and allows code or
Features failures
configuration changes.
• State is cleared by default
• Manually restores data from
after a job is stopped.
savepoints.
This course explains the architecture and technical principles of Flink and
the running process of Flink programs. The focus is on the difference
between Flink stream processing and batch processing. In the long run, the
DataStream API should contain the DataSet API through bounded data
streams.
2. What are the two types of APIs for Flink stream processing and batch processing?
2. Key Features
3. Applications
Source Sink
Multi-agent architecture: Flume can connect multiple agents to collect raw data and store them in the final
storage system. This architecture is used to import data from outside the cluster to the cluster.
Channel Channel
Log Agent 1 Agent 2 HDFS
Source Source
Sink Sink
Log Channel Channel HDFS
Agent 2
Agent 4
Source
Sink
Log Channel
Agent 3
You can configure multiple level-1 agents and point them to the source of an agent using Flume. The source
of the level-2 agent consolidates the received events and sends the consolidated events into a single channel.
The events in the channel are consumed by a sink and then pushed to the destination.
Interceptor Events
Channel
Events
Events Events Events
Channel Channel
Source Channel
Processor Selector
Events
Sink Sink
Sink
Runner Processor
2. Key Features
3. Applications
Log HDFS
Source Channel Sink
Source
Log Channel
Sink
Agent 1
Source
Flume
Start tx
Put events
Send events
Start tx
Take events
End tx
Source Sink
HDFS
Source
Channel
Sink
Log
Channel
Source Sink
Sink
HDFS
Channel
Interceptor
Channel
Channel
2. Key Features
3. Applications
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix = over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1
mv /var/log/log.11 /tmp/log_test
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_click
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.selector.type = replicating
jserver.sources.a1.basenameHeaderKey = basename
server.sources.a1.deserializer.maxBatchLine = 1
server.sources.a1.deserializer.maxLineLength = 2048
server.sources.a1.channels = ch1
Loader is used for efficient data import and export between the big data
platform and structured data storage (such as relational databases). Based
on the open-source Sqoop 1.99.x, Loader functions have been enhanced.
1. Introduction to Loader
RDB
Hadoop
SFTP Server
Loader HDFS
HBase
SFTP Server
Hive
Customized
Data Source
High
Graphical Performance
High
reliability Secure
Loader
External Data Source
Loader Client
Tool WebUI
JDBC File
REST API
JDBC SFTP/FTP
Transform Engine
Job
Execution Engine
Scheduler
Submission Engine
Yarn Map Task
Job Manager
HBase
Metadata Repository
HDFS Reduce Task
HA Manager
Hive
Loader Server
Term Description
Loader Client Provides a web user interface (WebUI) and a command-line interface (CLI).
Processes operation requests sent from the client, manages connectors and metadata,
Loader Server
submits MapReduce jobs, and monitors MapReduce job statuses.
REST API Provides RESTful APIs (HTTP + JSON) to process requests from the client.
Job Scheduler A simple job scheduling module. It supports periodical execution of Loader jobs.
Processes data transformation. Supports field combination, string cutting, string reversing,
Transform Engine
and other data transformations.
Execution Engine Execution engine of Loader jobs. It provides detailed processing logic of MapReduce jobs.
Submission Engine Submission engine of Loader jobs. It supports the submission of jobs to MapReduce.
Metadata Repository Stores and manages data about Loader connectors, transformation procedures, and jobs.
Manages the active/standby status of Loader servers. Two Loader servers are deployed in
HA Manager
active/standby mode.
1. Introduction to Loader
This chapter describes the Loader about its main functions, features, job
management, and job monitoring.
1. (T or F) True or false: MRS Loader supports only data import and export between
relational databases and Hadoop’s HDFS or HBase.
B. Dirty data refers to the data that does not comply with the conversion rules.
B. After Loader submits a job to MapReduce for execution, the job execution will be
retried if a Mapper fails to execute the job.
C. If a job execution fails, you need to manually clear the data remanence.
D. After Loader submits a job to MapReduce for execution, it cannot submit another job
before the preceding job completes.
1. Introduction
3. Data Management
Flume Storm
Kafka
Hadoop Spark
Farmer
1. Introduction
3. Data Management
ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper
Broker: A Kafka cluster contains one or more service instances, which are called brokers.
Topic: Each message published to the Kafka cluster has a category, which is called a topic.
Partition: Kafka divides a topic into one or more partitions. Each partition physically
corresponds to a directory for storing all messages of the partition.
Consumer: consumes messages and functions as a client to read messages from Kafka
Broker.
Consumer Group: Each consumer belongs to a given consumer group. You can specify a group
name for each consumer.
Consumer group 1
Kafka topic
。。。 new
Producer总是在末尾追加消息
Producers always append messages to the end of a queue.
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Consumer
group C1
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Kafka Cluster
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
Kafka Cluster
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
writes
old new old new
Producer
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
1. Introduction
3. Data Management
Data Storage Reliability
Data Transmission Reliability
Old Data Processing Methods
$KAFKA_HOME/config/server.properties
This chapter describes the basic concepts of the message system, and Kafka
application scenarios, system architecture, and data management.
B. Distributed
C. Message persistence
2. Which of the following components does the Kafka cluster depend on during its running?( )
A. HDFS
B. Zookeeper
C. HBase
D. Spark
Playing games
Signing in Weibo
Reading comics
Following a TV series
Listening to music
LDAP
Identity
Authentication Thrift Service
LDAP
Application Server
Driver Metastore
Visitor
Visitor
Authentication server
Server Sky wheel
in the cluster
KrbServer KrbServer
AS ③ TGS
① ② ④
⑤
Kerberos Kerberos
Client Server
⑥
User Permission
Kerberos LDAP
management management
In the prestart phase, invoke the API to generate the keytab file,
and save it to a directory on the HDFS server.
Kerberos service is deployed in load sharing mode. During the installation, Kerberos
service needs to be distributed to the two control nodes in the cluster.
LDAP server service role: SLAPD server.
LDAP server service is deployed in active/standby mode. During the installation, LDAP
server service needs to be distributed to the two control nodes in the cluster.
To achieve optimal performance, it is recommended that LDAP server and KrbServer in
all clusters be deployed on the same node.
Timeout interval for the connection between Kerberos and the back-end LDAP
LDAP_OPTION_TIMEOUT database. If the connection time exceeds the timeout interval, a failure message is
returned.
Timeout interval for Kerberos to query the back-end LDAP database. If the query
LDAP_SEARCH_TIMEOUT
time exceeds the timeout interval, a failure message is returned.
The maximum number of attempts made by the JDK process to connect to KDC for
max_retries authentication. If the connection attempts exceed the max_retries value, a failure
message is returned.
ldapadd Indicates the command-line tool of LDAP to add the user information to LDAP.
ldapdelete Indicates the command-line tool of LDAP to remove entries from the LDAP.
Indicates the command-line tool of Kerberos to authenticate users. Only authenticated users
kinit can run the shell command of each MRS component to complete maintenance tasks.
Indicates the command-line tool of Kerberos to deregister users after tasks of components
kdestroy are completed.
Indicates the command-line tool of Kerberos to switch to the Kerberos admin who can
kadmin obtain and modify Kerberos user information.
kpasswd Indicates the command-line tool of Kerberos to change the user password.
This chapter introduces the security authentication system of Huawei’s big data
platform, including the basic authentication process (by explaining the protocol),
how to productize the security authentication system (from the perspective of big
data integration and deployment), and new features developed during the
productization process.
After learning this chapter, you can better understand LDAP, Kerberos, and the
security authentication of MRS products so that you can better maintain the
products.
1. Elasticsearch Overview
High
Scalability Relevance Reliability
performance
The search results can Horizontal scaling is Searches results are Faults are
be obtained supported. sorted based on automatically detected,
immediately, and the Elasticsearch can run elements (from word hardware faults such
inverted index for full- on hundreds or frequency or proximate as and network
text search is thousands of servers. cause to popularity). segmentation, ensuring
implemented. The prototype the security and
environment and availability of your
production cluster (and data).
environment can be
seamlessly switched.
Write and read: The written data can be searched in real time.
User access
layer
ELK/ELKB provides a
complete set of solutions.
Data access
layer
1. Introduction to Elasticsearch
Index shard. Elasticsearch splits a complete index into multiple shards and distributes
Shard them on different nodes.
Index replica. Elasticsearch allows you to set multiple replicas for an index. Replicas can
improve the fault tolerance of the system. When a shard on a node is damaged or lost,
Replica the data can be recovered from the replica. In addition, replicas can improve the search
efficiency of Elasticsearch by automatically balancing the load of search requests.
Data recovery or re-distribution. When a node is added to or deleted from the cluster,
Recovery Elasticsearch redistributes shards based on the load of the node. When a failed node is
restarted, data will be recovered.
Mode for storing Elasticsearch index snapshots. By default, Elasticsearch stores indexes in the
memory and only makes them persistent on the local hard disk when the memory is full. The
Gateway gateway stores index snapshots. When an Elasticsearch cluster is disabled and restarted, the
cluster reads the index backup data from the gateway. Elasticsearch supports multiple types
of gateways, including the default local file system, distributed file system, Hadoop HDFS,
and Amazon S3.
Interaction mode between an Elasticsearch internal node or cluster and the client. By default,
internal nodes use the TCP protocol for interaction. In addition, such transmission protocols
Transport (integrated using plugins) as the HTTP (JSON format), Thrift, Servlet, Memcached, and
ZeroMQ are also supported.
1. Elasticsearch Overview
Precautions:
Ensure that replicas in the shard of the instance to be deleted exist in another instance.
Ensure that data in the shard of the instance to be deleted has been migrated to another
node.
in Elasticsearch.
539 Huawei Confidential
Elasticsearch Multi-instance Deployment on a Node
Multiple Elasticsearch instances can be deployed on one node, and differentiated from
each other based on the IP address and port number. This method increases the usage
of the single-node CPU, memory, and disk, and improves the indexing and search
capability of Elasticsearch.
EsMaster EsMaster
EsNode1 EsNode1
EsNode1
coll_shard_replica1 coll_shard_replica1
EsNode2 EsNode2
EsNode2
coll_shard_replica2 coll_shard_replica2
B. Unstructured data
C. Semi-structured data
B. MongoDB
C. Memcached
D. Lucence
4. Redis Optimization
4. Redis Optimization
Redis Cluster
Server2
4. Redirect to Server3
5. Request
6. Response
2. The server node returns the cluster topology, including the cluster node list and the mapping relationship
between slots and nodes. The client caches the cluster topology in memory.
3. The client calculates the slot of the key based on hash (KEY)%16384 and queries the mapping relationship
between slots and nodes, and then accesses the Server2 node that the key belongs to read and write data.
4. Server2 receives the request sent by the client and checks whether the key exists. If the key does not exist,
Server2 informs the client of redirecting the request to the Server3 node. If the key exists, Server2 returns
the service operation results.
5. The client receives the redirection response and sends a reading and writing request to Server3.
6. Server3 receives the request and processes the request the same way as that in step 4.
4. Redis Optimization
hset/hget/hmset/hmget/hgetall(hsetnx)
hexists (Check whether the attribute in the key exists.)
hincrby (The hash type does not have the hincr command.)
hdel
hkeys/hvals
hlen (Obtain the number of fields contained in a key.)
sadd/smembers/srem/sismember;
sdiff (difference set)/sinter (intersection set)/sunion (union set);
sdiffstore/sinterstore/sunionstore;
scard (Obtains the set length.)/spop (Randomly takes an element out of the set and deletes it.);
srandmember key [count]:
If count is a positive number and less than the set cardinality, the command returns an array
containing count different elements.
If count is greater than or equal to the set cardinality, the entire set is returned.
If count is a negative number, the command returns an array. The elements in the array may
appear multiple times, and the length of the array is the absolute value of count.
zadd/zscore/zrange/zrevrange/
zrangebyscore (A closed interval is used by default. Users can use "(" to adopt an
open interval.)
zincrby/zcard/zcount (Obtains the number of elements in a specified score range. A
closed interval is used by default. Users can use "(" to adopt an open interval.)
zrem/zremrangebyrank/zremrangebyscore (A closed interval is used by default.
Users can use "(" to adopt an open interval.)
Extension: +inf (positive infinity) -inf (negative infinity)
Application scenarios:
Time-limited preferential activity information
Website data cache (for data that needs to be updated periodically, for example, bonus point rankings)
Limiting the frequency to access a website (for example, a maximum of 10 times per minute).
Extended get parameter: The rule of the get parameter is the same as that of the by parameter. get #
(Returns the value of the element.)
Extended store parameter
Use the store parameter to save the sort result to a specified list.
Performance optimization:
Reduce the number of elements in the key to be sorted as much as possible.
Use the limit parameter to obtain only the required data.
If there is a large amount of data to be sorted, use the store parameter to cache the result.
Run the save or bgsave command to enable Redis to perform snapshot operations.
The difference between the two commands is that the save command is used by the main process to
perform snapshot operations, which block other requests, and the bgsave command is used by Redis to
execute the fork function to copy a subprocess for snapshot operations.
Note: When Redis is started, if both RDB persistence and AOF persistence are enabled, the
program preferentially uses the AOF mode to restore the data set because the data stored in the
AOF mode is the most complete. If the AOF file is lost, the database is empty after the startup.
Note: To switch the running Redis database from RDB to AOF, users can use the dynamic
switchover mode and then modify the configuration file. (Do not modify the configuration file on
your own and restart the database. Otherwise, the data in the database is empty.)
4. Redis Optimization
If data persistence is not required in service scenarios, disable all data persistence modes to achieve optimal
performance.
Optimize internal coding (only need to have a basic understanding of it).
Redis provides two internal coding methods for each data type. Redis can automatically adjust the coding method in
different scenarios.
SLOWLOG [get/reset/len]
The commands whose execution time is larger than the value (in microseconds, one second = one million microseconds)
specified by slowlog-log-slower-than will be recorded.
slowlog-max-len determines the maximum number of logs that can be saved in slowlog.
578 Huawei Confidential
Redis Optimization (2)
Modify the memory allocation policy of the Linux kernel.
Add vm.overcommit_memory = 1 to /etc/sysctl.conf and restart the server.
Alternatively, run the sysctl vm.overcommit_memory=1 command (take effect
immediately).
If the total data volume is not large and the memory is sufficient, users do not need to limit the memory used
by Redis. If the data volume is unpredictable and the memory is limited, limit the memory used by Redis to
prevent Redis from using the swap partition or prevent OOM errors.
Note: If the memory is not limited, the swap partition is used after the physical memory is used up. In this case, the
performance is low. If the memory is limited, data cannot be added after the specified memory is reached. Otherwise, an
OOM error is reported. Users can set maxmemory-policy to delete data when the memory is insufficient.
set mset
get mget
lindex lrange
hset hmset
hget hmget
4. Redis Optimization
Multiple recommendation results (offerings) may exist. To avoid repeated recommendation of an offering,
the set result is used for storage and access. The key is designed as res-<user id>.
1. Which Redis data structure is appropriate when the top N records for an
application need to be obtained?
2. Does the Redis server forward a key operation request that does not belong to its
node to the correct node?
This course introduces the application scenarios, features, data types, and
service data reading and writing processes of the Redis component. After
learning this course, users can select a proper Redis data structure based
on the specific service scenario to access Redis data.
The solution implements cross-cloud seamless synchronization of advanced service capabilities and multi-scenario
collaboration, and supports Huawei Kunpeng and Ascend computing capabilities to help governments and enterprises
realize refined resource control, cross-cloud hybrid orchestration, collaboration of multiple scenarios, such as
combining online development and test with offline deployment, and online training with offline inference. In this
case, each enterprise can build its own cloud.
Huawei's big data services are deployed in 150 countries and regions, serving more than 7,000 customers.
As cloud-based transformation of government and enterprises develops, hybrid clouds are favored by more and more
government and enterprise users. User requirements drive industry evolution and replacement. HUAWEI CLOUD Stack
8.0 is developed from "hybrid resource management" to "refined management and control + hybrid services". It
redefines the hybrid cloud.
Agricultural Industrial
economy Digital economy
economy
Land + labor Capital + Data +
force technology intelligence 2005 2015 2025
Source: Oxford Economics; Huawei Global Industry Vision
Supply E-
ERP CRM Finance Logistics Store ...
chain commerce
Group Slow requirement
Company implementation
Department
Difficult service supervision
Individual
AI-powered
data mid-end
Trend 3: Data management and resource management make up for the weaknesses of the AI industry.Logic: The success of
AI depends on the understanding of data. Only one resource pool is required to maximize resource utilization.
Volcano
Big data computing Introduction of database Data management added to AI development
Storage-compute decoupling + ACID, updating, indexing Data preprocessing, feature processing, unified access, and
brute-force scanning resource scheduling
AI apps
1 Label
AI development platform 2 Evaluate
Data collection & processing 3 Train
1 2 3 4 5 6 4 Evaluate
5 Infer
Data management/View Unified data access 6 Encapsulate
Object storage Unified resource scheduling
Object storage
Object storage
Huawei aims to build an open, cooperative, and win-win cloud ecosystem and helps partners quickly integrate into the local ecosystem. HUAWEI
CLOUD adheres to business boundaries, respects data sovereignty, does not monetize customer data, and works with partners for joint innovation
to continuously create value for customers and partners. By the end of 2019, HUAWEI CLOUD has launched 200+ cloud services and 190+
solutions, serving many well-known enterprises around the world.
...
Share 200+ services of 18 categories, 200+ solutions, and big cloud ecosystem
Unified architecture
Seamless experience
Unified ecosystem
Unified
AI data mid-end | Unified
O&M
computing, big data, and AI. Rebuilding core application systems
using a distributed architecture helps transform customers' core technology mid-end innovation service
assets into digital services that can be made available to external Unified O&M
systems, leading to improved IT efficiency, accelerated ecosystem
growth, and energized innovation.
Data Security
enablement enablement
Business Operations
enablement
Operations
IT Operations
aPaaS
Security
O&M
Support
Converged
AI Big data IoT Video
communications
GIS ...
Cloud core services
& PaaS
Cloud Core Services
Connection
Facilities, Smart Vehicle-
Parking Navigational Check-in FIDS Airport City IoT Hand-held
Device Security equipment, and
garage
application mounted
lighting terminal terminal sensor sensor terminal
...
power consumption sensor terminal
Smart city Public transport Emergency Transportation Smart water Smart campus ...
Speech Recognition
Image Recognition
Smart Engineering
Data Data Data Data twins
integratio
integration
Intelligent robot
Transportation
conservation
Animation
n development governance assets services ...
Water
TrafficGo
OCR
City
NLP
DAYU-CDM
mid-end
(data
Big data services
Data
Smart
migration) campus
IVA
Data Lake Cloud Search Graph Engine
MapReduce Service Data Warehouse
(FDI)DRS
(APIC) Insight (DLI) Service (CSS) Service (GES)
(data
(MQS) (LINK)
MRS Service (DWS)
DLI CSS GES ModelArts (AI development)
replication) Data Model Model
Deployment
processing training management
Ascend
HUAWEI CLOUD Stack Kunpeng
DAYU Enterprise
Data integration Data design Data development Data governance Data assets Data openness applications
IoT access One-Stop Big Data Platform – MRS Data services Reports,
dashboards
Batch
Stream query Interactive query Data mining processing
Data Data Lake
access Visualization OLAP analysis
Flink SQL SparkSQL Presto MLlib Hive
(DLV)
Flume
Stream computing Batch processing Data
Warehouse Track mining
SparkStreaming Flink Storm Spark MapReduce Tez Service (DWS)
Loader
Cloud Search
Third-party Parquet TXT ORC CarbonData HBase Smart
Service (CSS)
tools assistant
Kafka
HDFS OBS Graph Engine
Service (GES) Vision...service
100% compatibility with open-source ecosystems, 3rd-party components managed as plugins, one-stop enterprise platform
Storage-compute decoupling + Kunpeng optimization for better performance
API gateway
Cloud security
middleware integration Cloud market
Automated
Resource pool orchestration
Compute Storage resource Network Heterogeneous Public cloud
resource pool pool resource pool resource pool resource pool Alarm monitoring
Infrastructure Compute Taishan Atlas x86 server Storage OceanStor FusionStorage Network CloudEngine
Challenges Compute-Storage
Customer Benefits
Decoupling
Data silos Unified storage, no silos
Systems were built to be silos MRS DLI Raw data all stored in OBS
Inefficient data sharing, hindering Multi-architecture computing and data
global services interaction supported by multi protocols of OBS
...
Rigid capacity expansion Kunpeng Ascend Elastic scalability, higher resource
As long-term storage of data is utilization
mandated by China's Cyber Security Object File Compute and storage decoupled and scaled
Law, the three-replica policy of HDFS semantics semantics separately
is faced with a huge cost pressure. Compute resource utilization can reach 75%
IT capacities must be flexibly scaled
to meet changing demands, e.g. EC
peaks and troughs. Guaranteed performance with
Low utilization storage-compute decoupling +
OBS
Unbalanced storage and compute Kunpeng
resources, low efficiency Kunpeng multi-core optimization, software
Data is kept for long-term auditing Guaranteed performance with Kunpeng cache layer, and OBS high-concurrency support
and occasional access, compute + cache + software optimization Doubled cost-efficiency
resource utilization < 20% 40% higher storage utilization with EC
High reliability
performance
AZ DR
Data synchronization
Anti-affinity, live migration of ECSs, and cross-AZ data synchronization and
Advantages of multiple models + Kunpeng backup
Software optimization: Spark CBO, CarbonData, and HBase Storage-compute decoupling (cross-AZ OBS DR and cross-AZ compute resource
secondary index and tag index, and optimized compression deployment). External Relational Database Service (RDS) for storing metadata,
algorithm are improved by 50% on average. and cross-AZ reliability
management
GUI-based
Auto scaling
One-click deployment/capacity expansion; configuration, status The smallest cluster has only two nodes (4 vCPUs and 8 GB memory). All
management, and alarm monitoring all on a web portal
nodes can be scaled out. Specification scale-up and node scale-out and scale-
Seamless interconnection with legacy management systems via
in events do not affect services.
standard SNMP, FTP, and Syslog APIs
CN (access node)
GTM (distributed transaction Enterprise-Level Distributed Multi-
management)
Modula Data Warehouse
Computing network: TCP/RDMA
Enterprise-level multi-module data warehouse
that supports OLAP data analysis and time series
DataNode DataNode flow engine
Distributed architecture, storage-compute
decoupling, and on-demand independent scaling
Distributed SQL Distributed SQL
Primary/standby HA Compatible with Standard SQL, Ensuring
Transaction ACID
Distributed execution Cross-AZ DR Distributed execution Compatible with standard SQL 2003
Ensure transaction ACID and data consistency
data zone
Huawei data warehouses analytics.
Data Source system
Industry Background
• Mainstream information retrieval and log
analysis engine in the industry
• DB-Engine index, which is the No.7 database in
the world
• DB-Engine index, which is the No.1 search
engine in the world
Application Scenarios
• Log analysis and O&M monitoring
• Search, recommendation, and database
acceleration
Advantages
• High performance: Provides a vector retrieval
engine with 10x higher performance than the
open source Elasticsearch, supports automatic
rollup of time series data, and improves the
aggregation analysis capability by 10x.
• Low cost: Provides Kunpeng computing power,
separation of hot and cold data, storage-
compute decoupling, and index lifecycle
management solutions, reducing the storage
cost of cold data by more than 80%.
• High availability: Services are not interrupted
when the cluster specifications are changed
and plug-ins and parameters are changed.
No. 1 search Kibana-based HUAWEI Supports automatic data backup and delivers
engine visualization
CLOUD 99.9999% data availability.
enhancement
• Social relationships
Algorithm Web Portal
• Transaction records Individual development
Publish EYWA high-performance
• Call records
GES analysis cloud graph engine
Visualizer
Result
• Diverse data support, not merely Service Abundant graph analysis
algorithm libraries
• Information propagation structured data modeling
• Multi-source data association, auto Group High-performance graph
Service
application
computing kernel
• Historical browsing records propagation analysis Distributed high-performance
embedding
• Dynamic data changes, and real-time Submit
Service graph storage engine
interactive analysis without training user Mobile clients
• Transportation networks • Visualized and interpretable results
Link
• Communications networks Massive, complex, analysis
associated data is naturally
• ... graph data.
Product Advantages
Large scale High Integration Ease of use
Tens of billions of
performance Integration with Wizard-based GUI and
20,000+ QPS per abundant algorithms for compatibility with
vertices and hundreds of
instance, responses querying and analysis Gremlin facilitate easy
billions of edges
within seconds graph analysis
DAYU
+ Industry know-how
Data Data Data Industry Data assets with data virtualization and federation,
Data assets Data services Huawei data enablement methodologies
integration development governance templates
Open data layer by layer; and work with partners to
develop templates to formalize industry know-how.
+AI
Auto-tuning, system self-optimization, autonomous and
Multi-architecture computing +AI
continuous performance improvement
AI engine, optimized AI algorithms/models used for
processing unstructured data
Stream Batch Interactive Data
Search Graph
computing computing analysis warehouse
Multi-architecture computing
Big data + Data warehouse convergence: time series
x86 Kunpeng Ascend GPU analysis, leading performance
Cross-source interactive analysis
Kubernetes container scheduling
Data quality monitoring and improvement Periodic task processing and scheduling
Monitor and improve data quality throughout the lifecycle, Metadata collection and quality tasks can be scheduled and
and output clean and reliable data. monitored periodically for continuous data value extraction.
Data access and Basic/ad hoc library Converged data to support Service
conversion creation upper-layer apps operation All-domain data
management
Data platform MRS DLI DWS ... for data lineage analysis
Data integration
Batch data migration between homogeneous and
Data application
heterogeneous data sources is provided to help customers
implement free data flow inside and outside the lake and
between lakes.
Data openness Data visualization Data development
The one-stop big data development environment and fully-
hosted big data scheduling help users quickly and efficiently
Data lake develop data services.
Data development
Data governance
Data governance
decommission
Metric
Workflow
requirements Data modeling
publishing/
construction history
engine
management definitions
Metric
Metric
definitions &
Permissions & Metric
security content
data models
Service data Standard aPaaS design
Systematic and process- requirements
based data standards
Objective
Unified standards
Meeting data requirements
Data construction Industry Metadata
throughout the full link
methodology standards center
Automated and codeless
data development
Intelligent
auxiliary
Data quality
report
Data asset
map
Full-link
lineage
Product
governance
Advantages
DLG: One-stop Data Governance Platform
One-stop governance: one-
Metadata Data Data Data Data lifecycle Data stop management of metadata,
management standards quality security management map data standards, data quality, and
data security.
This chapter introduces Huawei big data services and data mid-end services.
HUAWEI CLOUD Stack 8.0 is a brand-new hybrid cloud solution based on Huawei
Kunpeng processors.