- 问题描述:使用
beeline
连接hiveserver2
,将HDFS上的json格式数据导入到一个临时表,然后再通过insert into ... select ...
将数据导入正式表中,但在这一步Hive报错,Error信息为:ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
详细报错如下:
INFO : Completed compiling command(queryId=cluster_20210705153434_96f4ae92-7b40-4138-802d-61ac670caf24); Time taken: 0.809 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=cluster_20210705153434_96f4ae92-7b40-4138-802d-61ac670caf24): insert into table huayao_dwd partition(date = '2021-07-04', dev) select *, devnum from huayao_dwd_tmp
INFO : Query ID = cluster_20210705153434_96f4ae92-7b40-4138-802d-61ac670caf24
INFO : Total jobs = 3
INFO : Launching Job 1 out of 3
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : Starting Job = job_1625469997771_0001, Tracking URL = http://cdh-01:8088/proxy/application_1625469997771_0001/
INFO : Kill Command = /home/cluster/app/hadoop/bin/hadoop job -kill job_1625469997771_0001
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2021-07-05 15:34:42,389 Stage-1 map = 0%, reduce = 0%
INFO : 2021-07-05 15:35:06,883 Stage-1 map = 100%, reduce = 0%
ERROR : Ended Job = job_1625469997771_0001 with errors
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : MapReduce Jobs Launched:
INFO : Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
INFO : Total MapReduce CPU Time Spent: 0 msec
INFO : Completed executing command(queryId=cluster_20210705153434_96f4ae92-7b40-4138-802d-61ac670caf24); Time taken: 35.146 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
- 尝试一:做了一份较小的数据,并进行了一边上面的流程,发现成功了,所以考虑是不是数据量的过大,导致内存分配不够。
当在hive中执行sql语句时,在执行一些复杂的语句时,hive会报上面那个错误,出现这种错误的原因是:用hiveserver2 启动服务端,然后用beeline去连接hiveserver2服务端,hive默认给beeline分配的内存只有256m,分配的内存太小,导致我们无法运行而报错。尝试解决办法:修改hadoop环境配置文件,修改hadoop分配客户端内存的配置,如下:
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export HADOOP_CLIENT_OPTS="-Xmx2g $HADOOP_CLIENT_OPTS"
但结果没有起作用。。
- 尝试二:突然发现,小数据量的数据本身是有问题的,它并不是json格式,没有形成对照,所以上面的解决办法没效果是正常的。到yarn的UI上,查看失败作业的信息,也并没又发现:
失败诊断没有价值,但想到既然作业的失败发生在MapReduce过程,而Hadoop又自带一个历史服务器来记录日志,所以只要开启这个服务,就可以获得失败错误日志。 - 尝试三:开启jobHistory服务,
./mr-jobhistory-daemon.sh start historyserver
,然后登录cdh-01:19888
,再次运行一个失败作业,然后进入具体作业信息页面:
进入具体诊断页面:
进入日志页面:
终于,获得了详细而准确的错误原因:json依赖包找不到!!!
- 解决:
$HIVE_HOME/lib
下已经有了json包,这也是load过程并没有出错,而且数据也确实导入到临时表的原因;而从临时表向正式表insert,触发了MapReduce过程,所以必须将json包json-serde-1.3.8-jar-with-dependencies.jar
传到每一台服务器的$HADOOP_HOME/share/hadoop/mapreduce
路径下,因为这个job并不一定会在哪个NodeManager
节点执行,如下:
比如连续两次,就一次在05,一次在06,所以每台机器都要有这个包。
最后,执行成功:
0: jdbc:hive2://cdh-01:2181,cdh-02:2181,cdh-0> insert into table huayao_dwd partition(date = '2021-07-03', dev) select *, devnum from huayao_dwd_tmp;
INFO : Compiling command(queryId=cluster_20210705162323_f7ab6298-e989-4d44-9108-ad082cb7454c): insert into table huayao_dwd partition(date = '2021-07-03', dev) select *, devnum from huayao_dwd_tmp
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:huayao_dwd_tmp.no, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.time, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.batch, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.pattern, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.pronum, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.code, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.devnum, type:string, comment:null), FieldSchema(name:huayao_dwd_tmp.dat_001, type:int, comment:null), FieldSchema(name:huayao_dwd_tmp.dat_002, type:int, comment:null), FieldSchema(name:huayao_dwd_tmp.dat_003, type:int, comment:null), FieldSchema(name:huayao_dwd_tmp.dat_004, type:int, comment:null), FieldSchema(name:huayao_dwd_tmp.dat_005, type:int, comment:null), FieldSchema(na