Hadoop-HA无需记忆配置搭建【大数据比赛长期更新】

最新推荐文章于 2024-07-31 22:44:40 发布

原创最新推荐文章于 2024-07-31 22:44:40 发布 · 741 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#大数据 #hadoop #分布式

本文详细介绍了如何在Docker容器中搭建Hadoop-HA集群，包括安装过程、配置核心-site.xml、hdfs-site.xml、yarn-site.xml以及mapred-site.xml等，涉及namenode、yarn和zookeeper的设置，确保了高可用性和服务状态监控。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

培训辅导

本人精通大数据比赛相关知识，了解比赛环境，已辅导N+职业学院，欢迎咨询wx：18233661567

环境说明

需要选手在docker容器中搭建hadoop-ha（高可用）模式集群，服务器为centos7，并装有docker，docker中有三个centos7容器，容器名称分别为bigdata1,bigdata2,bigdata3,可直接使用docker exec 命令进入容器

安装流程

该题是我们整个搭建考查的重点也是难点，因为题目中要求namenode-HA和 yarn-HA
模式，所以相关配置项非常多，靠人脑全部记下来费时费力，所以，此题我们采取偷懒的办法！！
在hadoop安装包内是有一份静态的网页文档的，在/opt/module/hadoop-3.1.3/share/doc/hadoop目录下,我们将该文档拷贝到我们的宿主机（比赛中为centos），再从宿主机拷贝到我们的客户端（比赛中为ubuntu），查看文档搭建

将相关软件从宿主机拷贝到master容器内，并在master内解压到指定目录

docker cp hadoop-3.1.3.tar.gz master:/opt/software/
docker cp jdk-8u212-linux-x64.tar.gz master:/opt/software/
docker cp apache-zookeeper-3.5.7-bin.tar.gz master:/opt/software/

进入到master容器内解压安装包

tar -zxvf /opt/software/hadoop-3.1.3.tar.gz -C /opt/module/
tar -zxvf /opt/software/jdk-8u212-linux-x64.tar.gz -C /opt/module/
tar -zxvf /opt/software/apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/

配置环境变量,搭建zookeeper组件

修改/etc/profile配置各个组件的用户

[root@bigdata1 hadoop]# vim /etc/profile

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root

修改/etc/profile配置上面解压的三个组件的环境变量，此步骤省略

zk搭建步骤省略，通过zkServer.sh status 查看zk节点状态为leader或者follower代表搭建成功

打开上面提到的离线文档，通过google浏览器打开index.html

在这里插入图片描述
根据文档，填写配置文件


[root@bigdata1 /]# cd /opt/module/hadoop-3.1.3/etc/hadoop/
[root@bigdata1 hadoop]# vim core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://mycluster</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/opt/module/hadoop-3.1.3/data/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>ha.zookeeper.quorum</name>
  <description>
    A list of ZooKeeper server addresses, separated by commas, that are
    to be used by the ZKFailoverController in automatic failover.
  </description>
        <value>bigdata1:2181,bigdata2:2181,bigdata3:2181</value>
</property>

</configuration>

[root@bigdata1 hadoop]# vim hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
  <description>
    Comma-separated list of nameservices.
  </description>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>bigdata1:9868</value>
  <description>
    The secondary namenode http server address and port.
  </description>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>
    sshfence
    shell(/bin/true)
  </value>
</property>
<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value>
  <description>
    The prefix for a given nameservice, contains a comma-separated
    list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).

    Unique identifiers for each NameNode in the nameservice, delimited by
    commas. This will be used by DataNodes to determine all the NameNodes
 in the cluster. For example, if you used �~@~\mycluster�~@~] as the nameservice
    ID previously, and you wanted to use �~@~\nn1�~@~] and �~@~\nn2�~@~] as the individual
    IDs of the NameNodes, you would configure a property
    dfs.ha.namenodes.mycluster, and its value "nn1,nn2".
  </description>
</property>

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>bigdata1:8020</value>
  <description>
    RPC address that handles all clients requests. In the case of HA/Federation where multiple namenodes exist,
    the name service id is added to the name e.g. dfs.namenode.rpc-address.ns1
    dfs.namenode.rpc-address.EXAMPLENAMESERVICE
    The value of this property will take the form of nn-host1:rpc-port. The NameNode's default RPC port is 8020.
  </description>
</property>

<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>bigdata2:8020</value>
  <description>
    RPC address that handles all clients requests. In the case of HA/Federation where multiple namenodes exist,
    the name service id is added to the name e.g. dfs.namenode.rpc-address.ns1
    dfs.namenode.rpc-address.EXAMPLENAMESERVICE
    The value of this property will take the form of nn-host1:rpc-port. The NameNode's default RPC port is 8020.
  </description>
</property>

<property>
  <name>dfs.namenode.http-address.mycluster.nn1</name>
  <value>bigdata1:9870</value>
  <description>
    The address and the base port where the dfs namenode web ui will listen on.
  </description>
</property>

<property>
  <name>dfs.namenode.http-address.mycluster.nn2</name>
  <value>bigdata2:9870</value>
  <description>
    The address and the base port where the dfs namenode web ui will listen on.
  </description>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>
 <value>qjournal://bigdata1:8485;bigdata2:8485;bigdata3:8485/mycluster</value>
  <description>A directory on shared storage between the multiple namenodes
  in an HA cluster. This directory will be written by the active and read
  by the standby in order to keep the namespaces synchronized. This directory
  does not need to be listed in dfs.namenode.edits.dir above. It should be
  left empty in a non-HA cluster.
  </description>
</property>
<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
  <description>
    Whether automatic failover is enabled. See the HDFS High
Availability documentation for details on automatic HA
    configuration.
  </description>
</property>
<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/opt/module/hadoop-3.1.3/data/journal/</value>
  <description>
    The directory where the journal edit files are stored.
  </description>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
</configuration>

[root@bigdata1 hadoop]# vim yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <description>Enable RM high-availability. When enabled,
      (1) The RM starts in the Standby mode by default, and transitions to
      the Active mode when prompted to.
      (2) The nodes in the RM ensemble are listed in
      yarn.resourcemanager.ha.rm-ids
      (3) The id of each RM either comes from yarn.resourcemanager.ha.id
      if yarn.resourcemanager.ha.id is explicitly specified or can be
      figured out by matching yarn.resourcemanager.address.{id} with local address
      (4) The actual physical addresses come from the configs of the pattern
      - {rpc-config}.{id}</description>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>


  <property>
    <description>Name of the cluster. In a HA setting,
      this is used to ensure the RM participates in leader
      election for this cluster and ensures it does not affect
      other clusters</description>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yarn-cluster</value>
  </property>

  <property>
    <description>The list of RM nodes in the cluster when HA is
      enabled. See description of yarn.resourcemanager.ha
      .enabled for full details on how this is used.</description>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>

  <property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname.rm1</name>
    <value>bigdata1</value>
  </property>
<property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname.rm2</name>
    <value>bigdata2</value>
  </property>
  <property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>bigdata1:2181,bigdata2:2181,bigdata3:2181</value>
</property>

<property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
</property>
<property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
</property>
</configuration>

[root@bigdata1 hadoop]# vim mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The runtime framework for executing MapReduce jobs.
  Can be one of local, classic or yarn.
  </description>
</property>

</configuration>

[root@bigdata1 hadoop]# vim workers

bigdata1
bigdata2
bigdata3

分发组件

[root@bigdata1 hadoop]# scp -r /opt/module/hadoop-3.1.3 bigdata2:/opt/module/
[root@bigdata1 hadoop]# scp -r /opt/module/hadoop-3.1.3 bigdata3:/opt/module/

启动集群

初始化集群

在3个节点上分别启动journalnode服务

[root@bigdata1 /]# hadoop-daemon.sh start journalnode

在我们配置的第一个namenode节点上初始化并启动

[root@bigdata1 /]# hdfs namenode -format
[root@bigdata1 /]# hadoop-daemon.sh start namenode

在另一个namenode节点上执行该命令将其与活动的 NameNode 的元数据同步

[root@bigdata2 /]# hdfs namenode -bootstrapStandby

接下来使用start-all.sh将全部服务启动，并强制激活其中一个namenode节点

[root@bigdata1 /]# hdfs haadmin -transitionToActive --forcemanual nn1

完成之后，可查看下nn1的状态,Active是激活，Standby是未激活

[root@bigdata1 /]# hdfs haadmin -getServiceState nn1

启动zfkc服务，监控NameNode进程, 自动备援

初始化

[root@bigdata1 /]# hdfs zkfc -formatZK

在所有namenode节点上启动服务

[root@bigdata1 /]# hadoop-daemon.sh start zkfc

查看集群状态

查看NN状态

[root@bigdata1 /]# hdfs haadmin -getAllServiceState

查看RM状态

[root@bigdata1 /]#  rmadmin -getAllServiceState

结果

在这里插入图片描述
出现以上服务即可满分