Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整,以获得最少的MapReduce作业数。举一个例子(案例基于Hive 0.6.0): create table ljn1(k1 bigint,k2 String,v1 int); create table ljn2(k1 bigint,v2 int); create table ljn3(k1 bigint,v3 int); create table ljn4(k1 bigint,v4 int); create table ljn5(k1 bigint,v5 int); create table ljn6(k2 string,v6 int); 然后看一下下面这个SQL的执行计划:explainselect a.v1fromljn1 aleft outer join ljn2 b on (a.k1 = b.k1)left outer join ljn3 c on (a.k1 = c.k1)left outer join ljn4 d on (a.k1 = d.k1)left outer join ljn6 e on (a.k2 = e.k2)left outer join ljn5 f on (a.k1 = f.k1); STAGE DEPENDENCIES: Stage-5 is a root stage Stage-1 depends on stages: Stage-5 Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 0 value expressions: expr: k1 type: bigint expr: k2 type: string expr: v1 type: int b TableScan alias: b Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 1 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} 1 handleSkewJoin: false outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.Hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: $INTNAME Reduce Output Operator key expressions: expr: _col0 type: bigint sort order: + Map-reduce partition columns: expr: _col0 type: bigint tag: 0 value expressions: expr: _col1 type: string expr: _col2 type: int c TableScan alias: c Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 1 d TableScan alias: d Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 2 f TableScan alias: f Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 3 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Left Outer Join0 to 2 Left Outer Join0 to 3 condition expressions: 0 {VALUE._col3} {VALUE._col4} 1 2 3 handleSkewJoin: false outputColumnNames: _col3, _col4 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: $INTNAME Reduce Output Operator key expressions: expr: _col3 type: string sort order: + Map-reduce partition columns: expr: _col3 type: string tag: 0 value expressions: expr: _col4 type: int e TableScan alias: e Reduce Output Operator key expressions: expr: k2 type: string sort order: + Map-reduce partition columns: expr: k2 type: string tag: 1 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col10} 1 handleSkewJoin: false outputColumnNames: _col10 Select Operator expressions: expr: _col10 type: int outputColumnNames: _col0 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 常规来讲,这个SQL非常简单,a表是主表,与其他表左外关联用到了k1和k2两个关联键,使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业:(Stage-0用做数据的最终展示,该作业可以忽略不计)第1个作业(Stage-5)是a表与b表关联;第2个作业(Stage-1)是第1个作业的中间结果再与c、d、f三表关联;第3个作业(Stage-2)是第2个作业的中间结果再与e表关联。
更多详情见请继续阅读下一页的精彩内容: http://www.linuxidc.com/Linux/2013-11/92555p2.htm
Hive 的详细介绍:请点这里
Hive 的下载地址:请点这里
相关阅读:基于Hadoop集群的Hive安装 http://www.linuxidc.com/Linux/2013-07/87952.htmHive内表和外表的区别 http://www.linuxidc.com/Linux/2013-07/87313.htmHadoop + Hive + Map +reduce 集群安装部署 http://www.linuxidc.com/Linux/2013-07/86959.htmHive本地独立模式安装 http://www.linuxidc.com/Linux/2013-06/86104.htmHive学习之WordCount单词统计 http://www.linuxidc.com/Linux/2013-04/82874.htm
Hive中使用自定义函数(UDF)实现分析函数row_number的功能HIVE Row_Sequence 实现列自增长相关资讯 MapReduce Hive MapReduce
- MapReduce朝不保夕的江湖地位 (02月27日)
- Google开源C/C++版MapReduce框架 (02/24/2015 16:07:46)
- 如何使用Hadoop MapReduce实现不同 (12/13/2014 10:26:34)
| - MapReduce中Shuffle过程整理 (05/26/2015 14:37:47)
- MapReduce作业运行第三方配置文件 (12/24/2014 08:26:38)
- MapReduce程序中的万能输入 (11/21/2014 14:04:20)
|
本文评论 查看全部评论 (0)