Hive中表的关联顺序对生成MapReduce作业数的影响案例

Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整，以获得最少的MapReduce作业数。举一个例子（案例基于Hive 0.6.0）： create table ljn1（k1 bigint,k2 String,v1 int）; create table ljn2（k1 bigint,v2 int）; create table ljn3（k1 bigint,v3 int）; create table ljn4（k1 bigint,v4 int）; create table ljn5（k1 bigint,v5 int）; create table ljn6（k2 string,v6 int）; 然后看一下下面这个SQL的执行计划:explainselect a.v1fromljn1 aleft outer join ljn2 b on （a.k1 = b.k1）left outer join ljn3 c on （a.k1 = c.k1）left outer join ljn4 d on （a.k1 = d.k1）left outer join ljn6 e on （a.k2 = e.k2）left outer join ljn5 f on （a.k1 = f.k1）; STAGE DEPENDENCIES: Stage-5 is a root stage Stage-1 depends on stages: Stage-5 Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 0 value expressions: expr: k1 type: bigint expr: k2 type: string expr: v1 type: int b TableScan alias: b Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 1 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} {VALUE._col2} 1 handleSkewJoin: false outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.Hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: $INTNAME Reduce Output Operator key expressions: expr: _col0 type: bigint sort order: + Map-reduce partition columns: expr: _col0 type: bigint tag: 0 value expressions: expr: _col1 type: string expr: _col2 type: int c TableScan alias: c Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 1 d TableScan alias: d Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 2 f TableScan alias: f Reduce Output Operator key expressions: expr: k1 type: bigint sort order: + Map-reduce partition columns: expr: k1 type: bigint tag: 3 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Left Outer Join0 to 2 Left Outer Join0 to 3 condition expressions: 0 {VALUE._col3} {VALUE._col4} 1 2 3 handleSkewJoin: false outputColumnNames: _col3, _col4 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: $INTNAME Reduce Output Operator key expressions: expr: _col3 type: string sort order: + Map-reduce partition columns: expr: _col3 type: string tag: 0 value expressions: expr: _col4 type: int e TableScan alias: e Reduce Output Operator key expressions: expr: k2 type: string sort order: + Map-reduce partition columns: expr: k2 type: string tag: 1 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col10} 1 handleSkewJoin: false outputColumnNames: _col10 Select Operator expressions: expr: _col10 type: int outputColumnNames: _col0 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 常规来讲,这个SQL非常简单,a表是主表，与其他表左外关联用到了k1和k2两个关联键，使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业：（Stage-0用做数据的最终展示，该作业可以忽略不计）第1个作业（Stage-5）是a表与b表关联；第2个作业（Stage-1）是第1个作业的中间结果再与c、d、f三表关联；第3个作业（Stage-2）是第2个作业的中间结果再与e表关联。更多详情见请继续阅读下一页的精彩内容： http://www.linuxidc.com/Linux/2013-11/92555p2.htmHive 的详细介绍：请点这里
Hive 的下载地址：请点这里相关阅读：基于Hadoop集群的Hive安装 http://www.linuxidc.com/Linux/2013-07/87952.htmHive内表和外表的区别 http://www.linuxidc.com/Linux/2013-07/87313.htmHadoop + Hive + Map +reduce 集群安装部署 http://www.linuxidc.com/Linux/2013-07/86959.htmHive本地独立模式安装 http://www.linuxidc.com/Linux/2013-06/86104.htmHive学习之WordCount单词统计 http://www.linuxidc.com/Linux/2013-04/82874.htm

Hive中使用自定义函数（UDF）实现分析函数row_number的功能HIVE Row_Sequence 实现列自增长相关资讯 MapReduce Hive MapReduce

MapReduce朝不保夕的江湖地位（02月27日）
Google开源C/C++版MapReduce框架（02/24/2015 16:07:46）
如何使用Hadoop MapReduce实现不同（12/13/2014 10:26:34）

MapReduce中Shuffle过程整理（05/26/2015 14:37:47）
MapReduce作业运行第三方配置文件（12/24/2014 08:26:38）
MapReduce程序中的万能输入（11/21/2014 14:04:20）

本文评论查看全部评论（0）