Hive:join

最新推荐文章于 2025-06-19 18:32:26 发布

原创最新推荐文章于 2025-06-19 18:32:26 发布 · 298 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hive

hive 专栏收录该内容

4 篇文章

订阅专栏

本文深入解析Hive中的Join操作，包括内联、外联、半联及交叉联结的使用方法，探讨了优化技巧如小表加载内存、MapJoin及分桶连接，以及执行流程与注意事项。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Hive: Join

hive join的连接方法如下：
join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [where_condition] (as of Hive 0.10)
 
table_reference:
    table_factor
  | join_table
 
table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

4类：
[inner] join (默认)
left right full [outer] join (左连、右连、全连)
left semi join (select a.id from a where id in (select id from b where b.id = a.id))
cross join （如果没有where条件，会产生笛卡尔积）

Vesion 2.2.0+:复杂的表达式在on子句中
不支持不等式条件
join_condition:
    ON equality_expression ( AND equality_expression )*

        equality_expression:
            expression = expression

Version 0.13.0+:隐式join表示法，默认采用inner join 
select * 
  from table1 t1 ,table2 t2,table3 t3
 where t1.id = t2.id and t2.id = t3.id and t1.zipcode = '0255';

 Version 0.13.0+:唯一行引用,如果两个表的字段不一样时，可以直接在on中出现
 create table a(k1 string,v1 string);
 create table b(k2 string,v2 string);
 select k1,v1,k2,v2
   from a join b on k1 = k2;

join需要注意点：

1.复杂的join表达式是允许的
select a.* from a join b on (a.id = b.id)
select a.* from a join b on (a.id = b.id and a.department = b.department)
select a.* from a left outer join b on (a.id<>b.id)

2.超过2张表join在一个查询中
select a.val,b.val,c.val from a.join b on (a.key = b.key1) join c on (c.key = b.key2);

3.hive把多个表的join转换成一个map/reduce job如果每一个表相同的行用
在join语句中b.key1在两个on中
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)

4. 由于key1用在第一个join条件中，key2用在第二个join中，会生成二个map/reduce job
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)
执行流程
第一个是 a join b ==> result
第二个是 resutl join c ==> finally result 

5.在map/reduce的每个join阶段,最后一张表的数据是缓缓的流向reducers(其他的表是被缓存的)。因此，通
过把最的表放在最后，能减少reduce的内存去缓存符合特定join的key小表放前面，大表放后面
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)
生成一个map/reduce ,符合表a , b join key的值被缓存在reducers的内存中，
然后，每一行从表c 获取的，这个连接采用buffer中的rows。类似
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)

6.指定那个表采用stream（缓缓的流向）方式连接
select /*+ STREAMTABLE(a) */ a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)

7.连接方式
    a left join b 以a为基础表
    a right join b 以b为基础表
    a full outer join b 以a、b为基础表

8.执行流程
select a.val,b.valu from a left join b on (a.key = b.key)
where a.ds = '2009-07-07' and b.ds = '20009-07-07';
a left join b  ==> where 
优化
select a.val,b.val from a left outer join b 
on (a.key = b.key and b.ds = '2009-07-07' and a.ds = '2009-07-07')
a b where ==> join

*note
表是从左向右连接的，不管是left join 或 right join 
select a.val,a.val2,b.val,c.val
from a 
join b on (a.key = b.key)
left outer join c on (a.key = b.key)

9.left semi join 
select a.key,b.value
from a 
where a.key in (
select b.key from b);
等于
select a.key,a.val from a left semi join b on a.key = b.key;

10.小表加载到内存 不适用full / right outer join  MAPJOIN(tn)tn需要加载到内存的表名
除一个表以外都是小表join，这个表可能采用一个mapper job 
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true
如果表是分连接是排序和为join 列分桶，并且他们有相同的buckets数，将会采用sort-merge join的方式。相应的桶是join在mapper
select /*+ MAPJOIN(b) */ a.key,b.val from a join b on a.key = b.key

join总结：

1.连接的方式
join 内联
left / right / fulle outer join 左 右 外 联结
left semi join 包含联结
cross join 交叉联结

2.表的连接方式都是从左到右

3.小大表连接
    3.1 小表在前，大表在后
    3.2 通过STREAMTABLE(tablename)指定大表名
    3.3 mapjoin(tablename)指定小表名，把小表加载到内存，再进行连接操作

4.大大表连接
    4.1 在on子句中采用过滤的方式，减少数据量
    4.2 分桶操作，连接的表采用相同的分桶与排序方式，开启分桶优化
        set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
        set hive.optimize.bucketmapjoin = true;
        set hive.optimize.bucketmapjoin.sortedmerge = true