Hive:join

本文深入解析Hive中的Join操作,包括内联、外联、半联及交叉联结的使用方法,探讨了优化技巧如小表加载内存、MapJoin及分桶连接,以及执行流程与注意事项。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Hive: Join

hive join的连接方法如下:
join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [where_condition] (as of Hive 0.10)
 
table_reference:
    table_factor
  | join_table
 
table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

4类:
[inner] join (默认)
left right full [outer] join (左连、右连、全连)
left semi join (select a.id from a where id in (select id from b where b.id = a.id))
cross join (如果没有where条件,会产生笛卡尔积)

Vesion 2.2.0+:复杂的表达式在on子句中
不支持不等式条件
join_condition:
    ON equality_expression ( AND equality_expression )*

        equality_expression:
            expression = expression

Version 0.13.0+:隐式join表示法,默认采用inner join 
select * 
  from table1 t1 ,table2 t2,table3 t3
 where t1.id = t2.id and t2.id = t3.id and t1.zipcode = '0255';

 Version 0.13.0+:唯一行引用,如果两个表的字段不一样时,可以直接在on中出现
 create table a(k1 string,v1 string);
 create table b(k2 string,v2 string);
 select k1,v1,k2,v2
   from a join b on k1 = k2;

join需要注意点:

1.复杂的join表达式是允许的
select a.* from a join b on (a.id = b.id)
select a.* from a join b on (a.id = b.id and a.department = b.department)
select a.* from a left outer join b on (a.id<>b.id)

2.超过2张表join在一个查询中
select a.val,b.val,c.val from a.join b on (a.key = b.key1) join c on (c.key = b.key2);

3.hive把多个表的join转换成一个map/reduce job如果每一个表相同的行用
在join语句中b.key1在两个on中
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)

4. 由于key1用在第一个join条件中,key2用在第二个join中,会生成二个map/reduce job
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)
执行流程
第一个是 a join b ==> result
第二个是 resutl join c ==> finally result 

5.在map/reduce的每个join阶段,最后一张表的数据是缓缓的流向reducers(其他的表是被缓存的)。因此,通
过把最的表放在最后,能减少reduce的内存去缓存符合特定join的key小表放前面,大表放后面
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)
生成一个map/reduce ,符合表a , b join key的值被缓存在reducers的内存中,
然后,每一行从表c 获取的,这个连接采用buffer中的rows。类似
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)

6.指定那个表采用stream(缓缓的流向)方式连接
select /*+ STREAMTABLE(a) */ a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)

7.连接方式
    a left join b 以a为基础表
    a right join b 以b为基础表
    a full outer join b 以a、b为基础表

8.执行流程
select a.val,b.valu from a left join b on (a.key = b.key)
where a.ds = '2009-07-07' and b.ds = '20009-07-07';
a left join b  ==> where 
优化
select a.val,b.val from a left outer join b 
on (a.key = b.key and b.ds = '2009-07-07' and a.ds = '2009-07-07')
a b where ==> join

*note
表是从左向右连接的,不管是left join 或 right join 
select a.val,a.val2,b.val,c.val
from a 
join b on (a.key = b.key)
left outer join c on (a.key = b.key)

9.left semi join 
select a.key,b.value
from a 
where a.key in (
select b.key from b);
等于
select a.key,a.val from a left semi join b on a.key = b.key;

10.小表加载到内存 不适用full / right outer join  MAPJOIN(tn)tn需要加载到内存的表名
除一个表以外都是小表join,这个表可能采用一个mapper job 
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true
如果表是分连接是排序和为join 列分桶,并且他们有相同的buckets数,将会采用sort-merge join的方式。相应的桶是join在mapper
select /*+ MAPJOIN(b) */ a.key,b.val from a join b on a.key = b.key

join总结:

1.连接的方式
join 内联
left / right / fulle outer join 左 右 外 联结
left semi join 包含联结
cross join 交叉联结

2.表的连接方式都是从左到右

3.小大表连接
    3.1 小表在前,大表在后
    3.2 通过STREAMTABLE(tablename)指定大表名
    3.3 mapjoin(tablename)指定小表名,把小表加载到内存,再进行连接操作

4.大大表连接
    4.1 在on子句中采用过滤的方式,减少数据量
    4.2 分桶操作,连接的表采用相同的分桶与排序方式,开启分桶优化
        set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
        set hive.optimize.bucketmapjoin = true;
        set hive.optimize.bucketmapjoin.sortedmerge = true

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值