mahout算法中new Configuration问题

mahout算法中new Configuration问题2015-10-04版本：hadoop2.4+mahout0.9

在web程序中调用云平台mahout的算法时，有时会遇到找不到路径的问题，比如org.apache.mahout.clustering.classify.ClusterClassifier这个类中的

public void readFromSeqFiles（Configuration conf, Path path） throws IOException {Configuration config = new Configuration（）;List<Cluster> clusters = Lists.newArrayList（）;for （ClusterWritable cw : new SequenceFileDirValueIterable<ClusterWritable>（path, PathType.LIST,PathFilters.logsCRCFilter（）, config）） {Cluster cluster = cw.getValue（）;cluster.configure（conf）;clusters.add（cluster）;}this.models = clusters;modelClass = models.get（0）.getClass（）.getName（）;this.policy = readPolicy（path）;}

这个方法，这个方法在CIMapper中的setup函数中会用到。假如使用web调用k均值算法，那么运行到这里就会报错，因为它找不到路径，读取中心点的时候找不到路径，这是因为在方法readFromSeqFiles中读取的Configuration是new出来的。而传入的路径是/path/to/center ，而非hdfs://host:port/path/to/center这样的。

在web提交Job任务的时候如果遇到这种问题就会报错（或者在windows的eclipse中使用main直接提交），但是如果是在终端中提交（namenode所在节点，其他节点未测试），那么是可以读到那个路径的。

解决方法：

1. 固定集群：在Configuration config= new Configuration（）; 后面加上conf.set（）设置集群。

这样也是需要修改源代码的，而且如果集群改变，还是需要重新编译此类，并上传到集群各个节点。

2. 把Configuration传入这个方法内，就像上面的readFromSeqFiles方法一样，但是那个方法里面传入了一个Configuration，但是它还是new了一个，不理解mahout的源码为何是这样的。

这种做法需要修改调用这个方法的类的调用方式，但是如果集群有变，是不需要重新编译打包这些类，上传到集群每个节点的，只需要在提交Job的时候设置集群即可。

作者：csdn博客 fansy1990