Lucene初步应用（一）

时间：2007-07-29 来源：sdwsyjp

初步应用前面提到，Lucene本身只是一个组件，而非一个完整的应用，所以若想让Lucene跑起来，还得在Lucene基础上进行必要的二次开发。第一步：下载与安装 (1)你需要到Lucene的官方网站http://jakarta.apache.org/lucene/ 下载后将得到一个名为lucene-1.4-final.zip的压缩文件，将其解压，里面有一个名为lucene-1.4-final.jar的文件，这就是Lucene组件包了，若需要在项目使用Lucene，只需要把lucene-1.4-final.jar置于类路径下即可，至于解压后的其他文件都是参考用的。第二步：在工程中的使用（1）我用Eclipse建立一个工程，实现基于Lucene的建库、记录加载和记录查询等功能。这是开发完成后的工程，其中有三个源文件 CreateDataBase.java，建库 InsertRecords.java，入库 QueryRecords.java，检索以下是对这三个源文件的分析。建库源码及说明

CreateDataBase.java

package com.holen.part1; import java.io.File; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; /** * @author Holen Chen * 初始化检索库 */ public class CreateDataBase { public CreateDataBase() { } //构造函数 public int createDataBase(File file){ int returnValue = 0; if(!file.isDirectory()){ file.mkdirs(); //若目录不存在，创建目录 } try{ IndexWriter indexWriter = new IndexWriter(file,new StandardAnalyzer(),true); indexWriter.close();//建立索引 returnValue = 1; }catch(Exception ex){ ex.printStackTrace(); } return returnValue; } /** * 传入检索库路径,初始化库 * @param file * @return */ public int createDataBase(String file){ return this.createDataBase(new File(file)); //调用上述方法 } public static void main(String[] args) { CreateDataBase temp = new CreateDataBase(); if(temp.createDataBase("e:\\lucene\\holendb") == 1){ System.out.println("db init succ"); } } }

说明：这里最关键的语句是IndexWriter indexWriter = new IndexWriter(file,new StandardAnalyzer(),true)。第一个参数是库的路径，也就是说你准备把全文检索库保存在哪个位置，比如main方法中设定的“e:\\lucene\\holendb”，Lucene支持多库，且每个库的位置允许不同。第二个参数是分析器，这里采用的是Lucene自带的标准分析器，分析器用于对整篇文章进行分词解析，这里的标准分析器实现对英文（或拉丁文，凡是由字母组成，由空格分开的文字均可）的分词，分析器将把整篇英文按空格切成一个个的单词（在全文检索里这叫切词，切词是全文检索的核心技术之一，Lucene默认只能切英文或其他拉丁文，默认不支持中日韩等双字节文字，关于中文切词技术将在后续章节重点探讨）。//切词技术第三个参数是是否初始化库，这里我设的是true，true意味着新建库或覆盖已经存在的库，false意味着追加到已经存在的库。这里新建库，所以肯定需要初始化，初始化后，库目录下只存在一个名为segments的文件，大小为1k。但是当库中存在记录时执行初始化，库中内容将全部丢失，库回复到初始状态，即相当于新建了该库，所以真正做项目时，该方法一定要慎用。加载记录源码及说明

InsertRecords.java

package com.holen.part1; import java.io.File; import java.io.FileReader; import java.io.Reader; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; /** * @author Holen Chen * 记录加载 */ public class InsertRecords { public InsertRecords() { } //构造函数 public int insertRecords(String dbpath,File file){ int returnValue = 0; try{ IndexWriter indexWriter = new IndexWriter(dbpath,new StandardAnalyzer(),false); this.addFiles(indexWriter,file); returnValue = 1; }catch(Exception ex){ ex.printStackTrace(); } return returnValue; } /** * 传入需加载的文件名 * @param file * @return */ public int insertRecords(String dbpath,String file){ return this.insertRecords(dbpath,new File(file)); } public void addFiles(IndexWriter indexWriter,File file){ Document doc = new Document(); try{ doc.add(Field.Keyword("filename",file.getName())); //以下两句只能取一句,前者是索引不存储,后者是索引且存储 //doc.add(Field.Text("content",new FileReader(file))); doc.add(Field.Text("content",this.chgFileToString(file))); indexWriter.addDocument(doc); indexWriter.close(); }catch(Exception ex){ ex.printStackTrace(); } } /** * 从文本文件中读取内容 * @param file * @return */ public String chgFileToString(File file){ String returnValue = null; StringBuffer sb = new StringBuffer(); char[] c = new char[4096]; try{ Reader reader = new FileReader(file); int n = 0; while(true){ n = reader.read(c); if(n > 0){ sb.append(c,0,n); }else{ break; } } reader.close(); }catch(Exception ex){ ex.printStackTrace(); } returnValue = sb.toString(); return returnValue; } public static void main(String[] args) { InsertRecords temp = new InsertRecords(); String dbpath = "e:\\lucene\\holendb"; //holen1.txt中包含关键字"holen"和"java" if(temp.insertRecords(dbpath,"e:\\lucene\\holen1.txt") == 1){ System.out.println("add file1 succ"); } //holen2.txt中包含关键字"holen"和"chen" if(temp.insertRecords(dbpath,"e:\\lucene\\holen2.txt") == 1){ System.out.println("add file2 succ"); } } }

说明：这个类里面主要有3个方法insertRecords(String dbpath,File file)，addFiles(IndexWriter indexWriter,File file)，chgFileToString(File file)。（1）ChgFileToString方法用于读取文本型文件到一个String变量中。 /** * 从文本文件中读取内容 * @param file * @return */ public String chgFileToString(File file){ String returnValue = null; StringBuffer sb = new StringBuffer(); char[] c = new char[4096]; try{ Reader reader = new FileReader(file); int n = 0; while(true){ n = reader.read(c); if(n > 0){ sb.append(c,0,n); }else{ break; } } reader.close(); }catch(Exception ex){ ex.printStackTrace(); } returnValue = sb.toString(); return returnValue; } （2）InsertRecords方法用于加载一条记录，这里是将单个文件入全文检索库，第一个参数是库路径，第二个参数是需要入库的文件。 public int insertRecords(String dbpath,File file){ int returnValue = 0; try{ IndexWriter indexWriter = new IndexWriter(dbpath,new StandardAnalyzer(),false); this.addFiles(indexWriter,file); returnValue = 1; }catch(Exception ex){ ex.printStackTrace(); } return returnValue; } InsertRecords需要调用addFiles，（3）addFiles是文件入库的真正执行者。AddFiles里有如下几行重点代码： doc.add(Field.Keyword("filename",file.getName())); 注意，在Lucene里没有严格意义上表，Lucene的表是通过Field类的方法动态构建的，比如Field.Keyword("filename",file.getName())就相当于在一条记录加了一个字段，字段名为filename，该字段的内容为file.getName()。常用的Field方法如下：

方法	切词	索引	存储	用途
Field.Text(String name, String value)	Y	Y	Y	标题，文章内容
Field.Text(String name, Reader value)	Y	Y	N	META信息
Field.Keyword(String name, String value)	N	Y	Y	作者
Field.UnIndexed(String name, String value)	N	N	Y	文件路径
Field.UnStored(String name, String value)	Y	Y	N	与第二种类似

为了更深入的了解全文检索库，