使用lucene 3.0.0索引和检索中文文件

使用lucene 3.0.0索引和检索中文文件2011-01-22 博客园 LeftNotEasy一. 我本来的程序

其实我本来的程序挺简单, 完全修改自Demo里面的SearchFiles和IndexFiles. 唯一不同的是引用了SmartCN的分词器.

我把修改那一点的代码贴出来.

IndexhChinese.java:

Date start = new Date（）;
try {
   IndexWriter writer = new IndexWriter（FSDirectory.open（INDEX_DIR）, 
           new SmartChineseAnalyzer（Version.LUCENE_CURRENT）, true, IndexWriter.MaxFieldLength.LIMITED）;
   indexDocs（writer, docDir）;
   System.out.println（"Indexing to directory "" +INDEX_DIR+ ""..."）;
   System.out.println（"Optimizing..."）;
   //writer.optimize（）;
   writer.close（）;

   Date end = new Date（）;
   System.out.println（end.getTime（） - start.getTime（） + " total milliseconds"）;

}
     SearchChinese.java
Analyzer analyzer = new SmartChineseAnalyzer（Version.LUCENE_CURRENT）;

BufferedReader in = null;
if （queries ！= null） {
   in = new BufferedReader（new FileReader（queries））;
} else {
   in = new BufferedReader（new InputStreamReader（System.in, "GBK"））;
}

在这里, 我制定了输入的查询是采用GBK编码的.

然后我充满信心的运行后......发现无法检索出中文, 里面的英文检索是正常的.

二. 发现问题.

于是我就郁闷了, 由于对于java与lucene都是太熟悉, 而且用的3.0.0版外面的讨论又不是太多, 就瞎折腾了一会儿, 发现我如果把文件的格式另存为ansi就可以检索中文了（以前是utf-8的）, 看来是文件编码的问题, 摸索了一下, 在indexChinese.java中发现了如下的代码:

static void indexDocs（IndexWriter writer, File file）
   throws IOException {
   // do not try to index files that cannot be read
   if （file.canRead（）） {
     if （file.isDirectory（）） {
       String[] files = file.list（）;
       // an IO error could occur
       if （files ！= null） {
         for （int i = 0; i < files.length; i++） {
           indexDocs（writer, new File（file, files[i]））;
         }
       }
     } else {
       System.out.println（"adding " + file）;
       try {
         writer.addDocument（FileDocument.Document（file））;
       }
       // at least on windows, some temporary files raise this exception with an "access denied" message
       // checking if the file can be read doesn"t help
       catch （FileNotFoundException fnfe） {
         ;
       }
     }
   }