大数据原生集群 (Hadoop2.X为核心) 本地测试环境搭建五

尘世壹俗人

已于 2025-08-13 14:15:29 修改

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏：各类型大数据集群搭建文档文章标签：大数据

于 2020-11-19 18:48:54 首次发布

本文链接：https://blog.csdn.net/dudadudadd/article/details/109726023

本篇安装版本

flink1.7
Azkaban-2.5.0
presto 0.196
druid (imply-2.7.10)
jupyterLab用最新版就行

Flink

一、解压缩 flink-1.7.2-bin-hadoop27-scala_2.11.tgz，进入conf目录中。

二、修改配置

1）修改 flink/conf/flink-conf.yaml 文件，在开头位置有一个jobmanager.rpc.address，这个是jobmanager所在地址，当集群运行的时候有rpc协议交互，我给的是hdp2，因为我总共只给flink三台机器

2）修改 conf/slave文件，里面添加集群task节点，可以有之前配置的hdp2，用来泡task任务

3）scp分发给另外两台机子

4）启动，在jobmanager所在节点调用bin/start-cluster.sh，在其他节点调用虽然不会报错，但是无法完全启动flink，最后访问http://hdp2:8081可以对flink集群和任务进行监控管理< 在这里插入图片描述
通常情况下，flink的配置文件很少去更改，但是如果你有需要也可以去设置，下面是整个配置文件的作用

#==============================================================================
# 基础配置，一般情况下使用flink的时候都是用yarn做资源调度
# 这是因为Apache Flink 自带的的资源管理成本很大
# 它不直接限制单个任务的使用资源大小，而是通过提供任务槽（Task Slots）来限制集群中能够并行运行的任务数量
# 这些任务直接可以共同使用整个task节点的资源，所有task节点的资源情况你可以在web页面上看到，flink直接把整个被注册为task节点的服务器所拥有的资源全部拿来用了
#==============================================================================

# jobmanager所在节点
jobmanager.rpc.address: hdp2

# jobmanager的rpc端口
jobmanager.rpc.port: 6123

# 如当前节点是jobmanager，那么它的jobmanager进程所用的JVM堆内存大小
jobmanager.heap.size: 1024m

# 同上，但此配置是taskmanager的
taskmanager.heap.size: 1024m

# 如此节点是taskmanager，则提供多少个任务槽位
taskmanager.numberOfTaskSlots: 1

# 任务默认的并行度，这个会影响任务占用多少个核数，一般在代码里面设置，配置文件中保持默认就行
parallelism.default: 1

# 如果你的任务里用了相对路径，则解析的文件系统指向谁，默认是本地文件系统，即'file:///'，你可以设置为'hdfs://mynamenode:12345'
# fs.default-scheme

#==============================================================================
# 高可用配置
#==============================================================================

# 高可用依赖的组件，可选的只能是 'NONE' or 'zookeeper'. 默认是zookeeper
# high-availability: zookeeper

# 高可用元数据存储的地址，必须是一个可持久化的路径，默认是在hdfs上的hdfs:///flink/ha/路径下
# 可选的有 HDFS, S3, Ceph, nfs, ...
# high-availability.storageDir: hdfs:///flink/ha/

# zk集群地址
# high-availability.zookeeper.quorum: localhost:2181

# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# It can be either "creator" (ZOO_CREATE_ALL_ACL) or "open" (ZOO_OPEN_ACL_UNSAFE)
# The default value is "open" and it can be changed to "creator" if ZK security is enabled
# 这个配置自用的话一般默认open，不改，它是指对高可用元数据是否有其他的访问限制，如果配置其他值，那么需要你参考flink和zk的官方文档
# high-availability.zookeeper.client.acl: open

#==============================================================================
# 集群容错，这里配置的是默认配置，你在写代码的时候可以动态改
#==============================================================================

# 如果你的任务做了checkpoint，那么checkpoint保存用的方式是什么
# 可选项： 'jobmanager', 'filesystem', 'rocksdb', or the <class-name-of-factory>.
# 默认文件系统
# state.backend: filesystem

# 指定检查点的存储位置，用于实现作业的容错和状态恢复。
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints

# 指定保存点的存储位置，允许用户手动管理作业状态，并支持跨作业版本的状态恢复。
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints

# 如果你的保存方式用的是rocksdb，即RocksDBStateBackend，那么此配置可以为true
# 它是指任务存在多个保存点时，不同的保存版本以增量的形式保存数据状态而不是全部数据保存
# state.backend.incremental: false

#==============================================================================
# Web 前端，也就是自带的ui页面
#==============================================================================

# 所在地址，一般不动，默认随jobmanager启动
#web.address: 0.0.0.0

# 端口
rest.port: 8081

# 是否可以通过web提交和管理任务，这个配置没有实际作用，因为你不改它，默认false的时候，你在web上任然可以操作提交
#web.submit.enable: false

#==============================================================================
# 其他高级配置
#==============================================================================

# 用来指定任务提交之后的暂存路径，用yarn调度的话会跟随yarn的配置，无需另行配置
# io.tmp.dirs: /tmp

# Specify whether TaskManager's managed memory should be allocated when starting
# up (true) or when memory is requested.
#
# We recommend to set this value to 'true' only in setups for pure batch
# processing (DataSet API). Streaming setups currently do not use the TaskManager's
# managed memory: The 'rocksdb' state backend uses RocksDB's own memory management,
# while the 'memory' and 'filesystem' backends explicitly keep data as objects
# to save on serialization cost.
# 是否使用taskmanager内存来保存任务的计算数据，flink官方建议纯批处理的时候此项为true
# 因为它确实可以减少序列化消耗，但实时不建议，而RocksDB有自己的内存使用策略
# taskmanager.memory.preallocate: false

# The classloading resolve order. Possible values are 'child-first' (Flink's default)
# and 'parent-first' (Java's default).
#
# Child first classloading allows users to use different dependency/library
# versions in their application than those in the classpath. Switching back
# to 'parent-first' may help with debugging dependency issues.
# 类加载顺序，flink默认子优先
# classloader.resolve-order: child-first

# 这个配置用来更改flink任务给数据来源多大的网络缓存区，默认占taskmanager.memory.flink.size值的百分之十
# 但是它不是绝对的，最终结果要收到下面两个最大、最小值的约束
# taskmanager.network.memory.fraction: 0.1
# taskmanager.network.memory.min: 64mb
# taskmanager.network.memory.max: 1gb

#==============================================================================
# Flink Cluster 集群安全相关的凭证设置
#==============================================================================

# Kerberos authentication for various components - Hadoop, ZooKeeper, and connectors -
# may be enabled in four steps:
# 1. configure the local krb5.conf file
# 2. provide Kerberos credentials (either a keytab or a ticket cache w/ kinit)
# 3. make the credentials available to various JAAS login contexts
# 4. configure the connector to use JAAS/SASL

# The below configure how Kerberos credentials are provided. A keytab will be used instead of
# a ticket cache if the keytab path and principal are set.

# security.kerberos.login.use-ticket-cache: true
# security.kerberos.login.keytab: /path/to/kerberos/keytab
# security.kerberos.login.principal: flink-user

# The configuration below defines which JAAS login contexts

# security.kerberos.login.contexts: Client,KafkaClient

#==============================================================================
# ZK 安全配置
#==============================================================================

# Below configurations are applicable if ZK ensemble is configured for security

# Override below configuration to provide custom ZK service name if configured
# zookeeper.sasl.service-name: zookeeper

# The configuration below must match one of the values set in "security.kerberos.login.contexts"
# zookeeper.sasl.login-context-name: Client

#==============================================================================
# 历史服务器
#==============================================================================

# fl