我在python3中使用stanford依赖解析器来解析一个句子,它返回一个依赖图.
import pickle
from nltk.parse.stanford import StanfordDependencyParser
parser = StanfordDependencyParser('stanford-parser-full-2015-12-09/stanford-parser.jar','stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar')
sentences = ["I am going there","I am asking a question"]
with open("save.p","wb") as f:
pickle.dump(parser.raw_parse_sents(sentences),f)
AttributeError: Can't pickle local object 'DependencyGraph.__init__.
我想知道是否可以使用或不使用pickle保存依赖图.
最佳答案
继instructions to get a parsed output之后.
1.将DependencyGraph输出为CONLL格式并写入文件
(见What is CoNLL data format?和What does the dependency-parse output of TurboParser mean?)
$export STANFORDTOOLSDIR=$HOME
$export CLAsspATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
$python
>>> from nltk.parse.stanford import StanfordDependencyParser
>>> dep_parser=StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> sent = "The quick brown fox jumps over the lazy dog."
>>> output = next(dep_parser.raw_parse("The quick brown fox jumps over the lazy dog."))
>>> type(output)
arameter just means that we want 4 columns in the CONLL format
u'The\tDT\t4\tdet\nquick\tJJ\t4\tamod\nbrown\tJJ\t4\tamod\nfox\tNN\t5\tnsubj\njumps\tVBZ\t0\troot\nover\tIN\t9\tcase\nthe\tDT\t9\tdet\nlazy\tJJ\t9\tamod\ndog\tNN\t5\tnmod\n'
>>> with open('sent.conll','w') as fout:
... fout.write(output.to_conll(4))
...
>>> exit()
$cat sent.conll
The DT 4 det
quick JJ 4 amod
brown JJ 4 amod
fox NN 5 nsubj
jumps VBZ 0 root
over IN 9 case
the DT 9 det
lazy JJ 9 amod
dog NN 5 nmod
2.将CONLL文件读入NLTK中的DependencyGraph
>>> from nltk.parse.dependencygraph import DependencyGraph
>>> output = DependencyGraph.load('sent.conll') # Loads any CONLL file,that might contain 1 or more sentences.
>>> output # list of DependencyGraphs
[irst DependencyGraph,the one we have saved
emma': None,u'tag': u'TOP',u'rel': None,u'address': 0,u'feats': None},1: {u'ctag': u'DT',u'head': 4,{}),u'tag': u'DT',u'address': 1,u'word': u'The',u'lemma': u'The',u'rel': u'det',u'feats': u''},2: {u'ctag': u'JJ',u'tag': u'JJ',u'address': 2,u'word': u'quick',u'lemma': u'quick',u'rel': u'amod',3: {u'ctag': u'JJ',u'address': 3,u'word': u'brown',u'lemma': u'brown',4: {u'ctag': u'NN',u'head': 5,{u'det': [1],u'amod': [2,3]}),u'tag': u'NN',u'address': 4,u'word': u'fox',u'lemma': u'fox',u'rel': u'nsubj',5: {u'ctag': u'VBZ',u'head': 0,{u'nmod': [9],u'nsubj': [4]}),u'tag': u'VBZ',u'address': 5,u'word': u'jumps',u'lemma': u'jumps',u'rel': u'root',6: {u'ctag': u'IN',u'head': 9,u'tag': u'IN',u'address': 6,u'word': u'over',u'lemma': u'over',u'rel': u'case',7: {u'ctag': u'DT',u'address': 7,u'word': u'the',u'lemma': u'the',8: {u'ctag': u'JJ',u'address': 8,u'word': u'lazy',u'lemma': u'lazy',9: {u'ctag': u'NN',{u'case': [6],u'det': [7],u'amod': [8]}),u'address': 9,u'word': u'dog',u'lemma': u'dog',u'rel': u'nmod',u'feats': u''}})
请注意,StanfordParser的输出是nltk.tree.Tree而不是DependencyGraph(这只是有人在树上发布类似问题的情况.
$export STANFORDTOOLSDIR=$HOME
$export CLAsspATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
$python
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
[Tree('ROOT',[Tree('NP',[Tree('DT',['the']),Tree('JJ',['quick']),['brown']),Tree('NN',['fox'])]),Tree('NP',[Tree('NNS',['jumps'])]),Tree('PP',[Tree('IN',['over']),['lazy']),['dog'])])])])])])]
>>> output = list(parser.raw_parse("the quick brown fox jumps over the lazy dog"))
>>> type(output[0])
对于nltk.tree.Tree,您可以将其输出为括号中的解析字符串,并将字符串读入Tree对象:
>>> from nltk import Tree
>>> output[0]
Tree('ROOT',['dog'])])])])])])
>>> str(output[0])
'(ROOT\n (NP\n (NP (DT the) (JJ quick) (JJ brown) (NN fox))\n (NP\n (NP (NNS jumps))\n (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))))))'
>>> parsed_sent = str(output[0])
>>> type(parsed_sent)