Hadoop HelloWord - 排序

抓紧时间Hadoop入门。不得不说Hadoop the Definitive Guide是本好书（下载见下面），但是却不是一本好的入门书，一上来讲了一堆各种Hadoop架构，对与一个菜鸟来说读起来总感觉有点心虚，一行Hadoop代码没写过,一直看各种Hadoop的架构，让人感觉非常的不踏实。找来找去也只是看到一个WordCount的demo，还好实验室Xia兄推荐了个文档，是虾皮工作室写的，名字叫做“细细评味Hadoop”系列的第9章，好几个由简单到复杂的demo，推荐，并在此对作者表示感谢。相关阅读：《MongoDB 权威指南》（MongoDB: The Definitive Guide）英文文字版[PDF] http://www.linuxidc.com/Linux/2012-07/66735.htmHadoop: The Definitive Guide【PDF版】 http://www.linuxidc.com/Linux/2012-01/51182.htm吐槽下：Hadoop的官方文档应该学学directx sdk的官方文档，各种由简单到复杂的demo，后期demo都是不少经典论文的实现，效果也非常cool，加上足够的说明，一个个下来让人感觉非常的踏实和日益精进。相比之下Hadoop的官方文档也太简陋了一点了。这个demo是对数据做简单的排序。学了wordcount后有点入门后，大家都知道经过map函数后，到达reduce之前会有个shuffle和sort的过程，这个过程主要对map函数output的key进行排序。我们就利用这个过程来对我们自己的数据排序。这样子思路就很简单了，在map阶段，我们将一个个值作为key输出，value随便写，reduce阶段将这些map阶段输入的key直接写出来就可以了。当然为了增加趣味性，可以增加一个变量count统计这个key值排在第几位。。输入数据可以是：//data1.txt：123
12
87
150
22
23423
9874
9876//data2.txt29347
9877
27985
98776
989
767
2345
1532
8702
8702详细代码如下：import java.util.*;
import java.awt.datatransfer.StringSelection;
import java.io.*;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;public class DataSort {

public static class SortMapper extends Mapper<Object,Text,IntWritable,IntWritable>{

IntWritable one = new IntWritable（1）;
@Override
public void map（Object key, Text value, Context context）throws IOException, InterruptedException
{
context.write（ new IntWritable（Integer.parseInt（value.toString（）））, one）;
}
}

public static class SortReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{

private static IntWritable count = new IntWritable（0）;

@Override
public void reduce（IntWritable key, Iterable<IntWritable> values, Context context） throws IOException, InterruptedException
{
for（IntWritable val : values）
{
context.write（key, count）;
count.set（count.get（） + 1）;
}
}
}

public static void main（String[] args） throws Exception
{
Configuration conf = new Configuration（）;

Job job =new Job（conf,"DataSort"）;
job.setJarByClass（DataSort.class）;

job.setMapperClass（SortMapper.class）;
job.setReducerClass（SortReducer.class）;

job.setOutputKeyClass（IntWritable.class）;
job.setOutputValueClass（IntWritable.class）;

FileInputFormat.addInputPath（job, new Path（args[0]））;
FileOutputFormat.setOutputPath（job, new Path（args[1]））;

System.exit（job.waitForCompletion（true）？ 0 : 1）;
}
}最终结果输出：12 0
22 1
87 2
123 3
150 4
767 5
989 6
1532 7
2345 8
8702 9
8702 10
9874 11
9876 12
9877 13
23423 14
27985 15
29347 16
98776 17最后分享下我犯的一个弱智错误，继承Mapper和Reducer两个虚类后必须实现map和reduce函数，但是我reduce函数不小心写成reducer，导致整个程序相当于从来没有进入reduce阶段，导致最后输出的结果一直是map的中间结果，还好Xia兄过来看后发现了这个错误。大家以后可以加上标志@Override，这样子以后万一不小心写错了编译器也可以提示。更多Hadoop相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx？tid=13