使用lucene 3.0.0索引和检索中文文件2011-01-22 博客园 LeftNotEasy一. 我本来的程序其实我本来的程序挺简单, 完全修改自Demo里面的SearchFiles和IndexFiles. 唯一不同的是引用了SmartCN的分词器.我把修改那一点的代码贴出来.IndexhChinese.java:Date start = new Date(); try { IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new SmartChineseAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); indexDocs(writer, docDir); System.out.println("Indexing to directory "" +INDEX_DIR+ ""..."); System.out.println("Optimizing..."); //writer.optimize(); writer.close();
Date end = new Date(); System.out.println(end.getTime() - start.getTime() + " total milliseconds");
} SearchChinese.java Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_CURRENT);
BufferedReader in = null; if (queries != null) { in = new BufferedReader(new FileReader(queries)); } else { in = new BufferedReader(new InputStreamReader(System.in, "GBK")); }在这里, 我制定了输入的查询是采用GBK编码的.然后我充满信心的运行后......发现无法检索出中文, 里面的英文检索是正常的.二. 发现问题.于是我就郁闷了, 由于对于java与lucene都是太熟悉, 而且用的3.0.0版外面的讨论又不是太多, 就瞎折腾了一会儿, 发现我如果把文件的格式另存为ansi就可以检索中文了(以前是utf-8的), 看来是文件编码的问题, 摸索了一下, 在indexChinese.java中发现了如下的代码:static void indexDocs(IndexWriter writer, File file) throws IOException { // do not try to index files that cannot be read if (file.canRead()) { if (file.isDirectory()) { String[] files = file.list(); // an IO error could occur if (files != null) { for (int i = 0; i < files.length; i++) { indexDocs(writer, new File(file, files[i])); } } } else { System.out.println("adding " + file); try { writer.addDocument(FileDocument.Document(file)); } // at least on windows, some temporary files raise this exception with an "access denied" message // checking if the file can be read doesn"t help catch (FileNotFoundException fnfe) { ; } } }