Hive: Join
hive join的连接方法如下:
join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [where_condition] (as of Hive 0.10)
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
4类:
[inner] join (默认)
left right full [outer] join (左连、右连、全连)
left semi join (select a.id from a where id in (select id from b where b.id = a.id))
cross join (如果没有where条件,会产生笛卡尔积)
Vesion 2.2.0+:复杂的表达式在on子句中
不支持不等式条件
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
Version 0.13.0+:隐式join表示法,默认采用inner join
select *
from table1 t1 ,table2 t2,table3 t3
where t1.id = t2.id and t2.id = t3.id and t1.zipcode = '0255';
Version 0.13.0+:唯一行引用,如果两个表的字段不一样时,可以直接在on中出现
create table a(k1 string,v1 string);
create table b(k2 string,v2 string);
select k1,v1,k2,v2
from a join b on k1 = k2;
join需要注意点:
1.复杂的join表达式是允许的
select a.* from a join b on (a.id = b.id)
select a.* from a join b on (a.id = b.id and a.department = b.department)
select a.* from a left outer join b on (a.id<>b.id)
2.超过2张表join在一个查询中
select a.val,b.val,c.val from a.join b on (a.key = b.key1) join c on (c.key = b.key2);
3.hive把多个表的join转换成一个map/reduce job如果每一个表相同的行用
在join语句中b.key1在两个on中
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)
4. 由于key1用在第一个join条件中,key2用在第二个join中,会生成二个map/reduce job
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)
执行流程
第一个是 a join b ==> result
第二个是 resutl join c ==> finally result
5.在map/reduce的每个join阶段,最后一张表的数据是缓缓的流向reducers(其他的表是被缓存的)。因此,通
过把最的表放在最后,能减少reduce的内存去缓存符合特定join的key小表放前面,大表放后面
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)
生成一个map/reduce ,符合表a , b join key的值被缓存在reducers的内存中,
然后,每一行从表c 获取的,这个连接采用buffer中的rows。类似
select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key2)
6.指定那个表采用stream(缓缓的流向)方式连接
select /*+ STREAMTABLE(a) */ a.val,b.val,c.val from a join b on (a.key = b.key1) join c on (c.key = b.key1)
7.连接方式
a left join b 以a为基础表
a right join b 以b为基础表
a full outer join b 以a、b为基础表
8.执行流程
select a.val,b.valu from a left join b on (a.key = b.key)
where a.ds = '2009-07-07' and b.ds = '20009-07-07';
a left join b ==> where
优化
select a.val,b.val from a left outer join b
on (a.key = b.key and b.ds = '2009-07-07' and a.ds = '2009-07-07')
a b where ==> join
*note
表是从左向右连接的,不管是left join 或 right join
select a.val,a.val2,b.val,c.val
from a
join b on (a.key = b.key)
left outer join c on (a.key = b.key)
9.left semi join
select a.key,b.value
from a
where a.key in (
select b.key from b);
等于
select a.key,a.val from a left semi join b on a.key = b.key;
10.小表加载到内存 不适用full / right outer join MAPJOIN(tn)tn需要加载到内存的表名
除一个表以外都是小表join,这个表可能采用一个mapper job
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true
如果表是分连接是排序和为join 列分桶,并且他们有相同的buckets数,将会采用sort-merge join的方式。相应的桶是join在mapper
select /*+ MAPJOIN(b) */ a.key,b.val from a join b on a.key = b.key
join总结:
1.连接的方式
join 内联
left / right / fulle outer join 左 右 外 联结
left semi join 包含联结
cross join 交叉联结
2.表的连接方式都是从左到右
3.小大表连接
3.1 小表在前,大表在后
3.2 通过STREAMTABLE(tablename)指定大表名
3.3 mapjoin(tablename)指定小表名,把小表加载到内存,再进行连接操作
4.大大表连接
4.1 在on子句中采用过滤的方式,减少数据量
4.2 分桶操作,连接的表采用相同的分桶与排序方式,开启分桶优化
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true