Stanzas Corenlp实现中与tregex进行模式匹配似乎找不到正确的子树

问题描述

我对NLP比较陌生，目前我正尝试从德语文本中提取不同的短语结构。为此，我将使用带有tregex功能的节的Stanford corenlp实现实现树中的模式匹配。

到目前为止，我没有任何问题，我能够匹配“ NP”或“ S> CS”之类的简单模式。不，我正在尝试匹配立即由ROOT主导的CS节点或立即由ROOT主导的CS节点。为此，我使用模式“ S>（CS> TOP）|> TOP”。但似乎无法正常工作。我正在使用以下代码：

text = "Peter kommt und Paul geht."    
def linguistic_units(_client,_text,_pattern):
        matches = _client.tregex(_text,_pattern)
        list = matches['sentences']
        print('+++++Tree++++') 
        print(list[0])
        for sentence in matches['sentences']:
            for match_id in sentence:
                print(sentence[match_id]['spanString'])
        return count_units



with CoreNLPClient(properties='./corenlp/StanfordCoreNLP-german.properties',annotators=['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref'],timeout=300000,be_quiet=True,endpoint='http://localhost:9001',memory='16G') as client:

      result = linguistic_units(client,text,'S > (CS > ROOT) | > ROOT'
      print(result)

在带有文本“ Peter kommt und Paul geht”的示例中，我使用的模式应与两个短语“ Peter kommt”和“ Paul geht”匹配，但它不起作用。之后，我看了看树本身，解析器的输出如下：

constituency parse of first sentence
child {
  child {
    child {
      child {
        child {
          value: "Peter"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "kommt"
        }
        value: "VERB"
      }
      value: "S"
    }
    child {
      child {
        value: "und"
      }
      value: "CCONJ"
    }
    child {
      child {
        child {
          value: "Paul"
        }
        value: "PROPN"
      }
      child {
        child {
          value: "geht"
        }
        value: "VERB"
      }
      value: "S"
    }
    value: "CS"
  }
  child {
    child {
      value: "."
    }
    value: "PUNCT"
  }
  value: "NUR"
}
value: "ROOT"
score: 5466.83349609375

我现在怀疑这是由于ROOT节点所致，因为它是树的最后一个节点。 ROOT节点不应该在树的开头吗？有人知道我在做什么错吗？

解决方法

一些评论：

1。）假设您使用的是CoreNLP（4.0.0+）的最新版本，则需要将mwt注释器与德语一起使用。因此，您的注释者列表应为tokenize,ssplit,mwt,pos,parse

2。）为清楚起见，这是您在PTB中的句子：

(ROOT
  (NUR
    (CS
      (S (PROPN Peter) (VERB kommt))
      (CCONJ und)
      (S (PROPN Paul) (VERB geht)))))

如您所见，ROOT是树的根节点，因此您的模式在此句子中将不匹配。我个人认为PTB格式更易于查看树结构和编写Tregex模式。您可以通过json或文本输出格式（而不是序列化的对象）来获取。在客户请求集中，output_format="text"

3。）这是有关使用Stanza客户端的最新文档：https://stanfordnlp.github.io/stanza/client_properties.html

stanford-nlp