Lucene Indexing在大约100万个文件nrm文件越来越大…之后卡住了

问题描述

| 有人知道为什么会这样吗？我正在对XML文件进行基本索引+ SAX解析，并将每个路径添加为文档中的新字段。我必须喜欢150万个文件，并将其卡在该文件上30分钟，然后.nrm（规范化文件？）变得越来越大。我不知道为什么会这样，我的IndexWriter的格式为：

writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_30),IndexWriter.MaxFieldLength.UNLIMITED)

这不是用于大型索引的最佳选择吗？为什么将其冻结在此一个文件上？我已经使用超过100万个XML文件对其进行了多次运行，并且不断地卡在不同的XML文件中（不仅是这个文件-结构还不错）。编辑：因此，假设我一次使用单独的Java命令为文件2000编制索引。索引完成后，我调用了indexwriter close方法，如果要重写到该索引，是否会丢失任何内容？我应该优化索引吗？我想我记得《 Lucene in Action》曾说过要优化，如果您有一段时间不会写它。实际上，此方法可处理180万个文件，但是在我尝试分批添加2000个文件后，此NRM文件和另一个文件写入了约70GB！如果仅以2000个批次调用java Lucene索引函数，为什么内存会从JVM用尽？除非您在关闭索引编写器之前需要向Lucene代码中添加一些内容，否则这似乎不是垃圾拼凑的问题。编辑2：我大约有400万个XML文件，如下所示：

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<person>
   <name>Candice Archie
   </name>
   <filmography>
      <direct>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </direct>
      <write>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </write>
      <edit>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </edit>
      <produce>
         <movie>
            <title>Lucid (2006) (V)
            </title>
            <year>2006
            </year>
         </movie>
      </produce>
   </filmography>
</person>

我解析这些XML文件并将内容添加到路径的字段中，例如/ person / produce / filmography / movie / title，Lucid（2006）（V）问题是，我正在寻找针对索引中的每个文档（然后是所有文档中该值的总和）的文档的字段实例中给定术语的频率统计信息...因此，如果有两个实例/ person / produce / filmography / movie / title，它们都包含“ Lucid”，我想要两个。如果存在其他路径（例如：/ person / name：Lucid），Lucene给出的tf（t in d）将给出3，但是对于文档中相似字段内的术语则不这样做。 Lucene Indexing的核心是这样做的：

public void endElement( String namespaceURI,String localName,String qName ) throws SAXException {
  if(this.ignoreTags.contains(localName)){
      ignoredArea = false;
      return;
  }
  String newContent = content.toString().trim();
  if(!empty.equals(newContent) && newContent.length()>1)
  {
      StringBuffer stb = new StringBuffer();
      for(int i=0; i<currpathname.size();i++){
      //System.out.println(i + \"th iteration of loop. value:\" + currpathname.get(i).toString() + k + \"th call.\");
      stb.append(\"/\");
      stb.append(currpathname.get(i));
      }
      //stb.append(\"0\");
      if(big.get(stb.toString())==null){
          big.put(stb.toString(),1);
      }
      else{
          big.put(stb.toString(),big.get(stb.toString())+1);
      }
      if(map.get(stb.toString())==null){
          map.put(stb.toString(),0);
          stb.append(map.get(stb.toString())); //ADDED THIS FOR ZERO
      }
      else
      {
          map.put(stb.toString(),map.get(stb.toString())+1);
          stb.append(map.get(stb.toString()));
      }
      doc.add(new Field(stb.toString(),newContent,Field.Store.YES,Field.Index.ANALYZED));
      seenPaths.add(stb);
      //System.out.println(stb.toString());// This will print all fields indexed for each document (separating nonunique [e.x.: /person/name0 /person/name1]
      //System.out.println(newContent);
  }
  currpathname.pop();   
  content.delete(0,content.length()); //clear content
  //This method adds to the Lucene index the field of the unfolded Stack variable currpathname and the value in content (whitespace trimmed).

} Map和BigMap是哈希图（不必担心bigmap，它用于其他用途。每当实例化一个新的XML文件（Document对象）时，map就会实例化。有一个endDocument（）方法可在添加后所有的startElement，endElement和character方法都被调用（这些是Xerces Parser方法）

  public void endDocument( ) throws SAXException {
  try {
      numIndexed++;
    writer.addDocument(doc);
} catch (CorruptIndexException e) {
    e.printstacktrace();
} catch (IOException e) {
    e.printstacktrace();
}
  }

抱歉，很长的帖子-感谢您的帮助！另外，我不认为服务器是问题所在。我一次在400万个文件上运行该代码，即使我使用Xmx12000M xms12000M，它也耗尽了堆内存这是一台功能强大的服务器，因此它绝对可以处理此问题... 编辑3：再一次问好！谢谢，你是对的。 Lucene可能不是这样做的。实际上，我们将进行其他实验，但是我认为我已经在您的想法和其他想法的帮助下解决了该问题。首先，我停止对字段进行规范化，这使索引的大小缩小了很多倍。另外，我使用了mergedocs和rambuffer方法并对其进行了升级。索引大大改善。我会在您的帮助下标记问题的答案：）谢谢。

解决方法

尝试分批编制索引。下面的代码应为您提供一个方法。我也建议您查看《 Lucene in Action》的最新版本。最有可能是您使垃圾收集器超载（假设没有很难发现的内存泄漏），这最终将导致您的内存不足错误。

    private static final int FETCH_SIZE = 100;
    private static final int BATCH_SIZE = 1000;

    //Scrollable results will avoid loading too many objects in memory
    ScrollableResults scroll = query.scroll(ScrollMode.FORWARD_ONLY);
    int batch = 0;
    scroll.beforeFirst();
    while (scroll.next()) {
        batch++;

        index(scroll.get(0)); //index each element

        if (batch % BATCH_SIZE == 0) {
            //flushToIndexes(); //apply changes to indexes
            //optimize();
            //clear(); //free memory since the queue is processed
        }
    }

100万个 indexing indexing lucene lucene nrm 之后之后大约文件越来越大