为什么 lxml 会切出一个 XML 文件的一部分？

问题描述

我正在使用 pyspellchecker 拼写检查库，以便对法语文本的 OCR 输出进行后期更正。

我使用 lxml 仅从 TEI-XML 文件中提取原始文本，以便之后应用拼写检查。

更正没有问题，但 lxml 在嵌套标记之后（此处为 <hi rend="i">emblèmes</hi> 之后）切掉了 XML 文件的整个部分，这意味着：

qui,par le moyen des <hi rend="i">emblèmes</hi>,explique ou repréfente la doétrine
des anciens temps fur les diverfes
opérations de la nature,fur les différents
états de la vie humaine,fur
les vertus &amp; fur les vices,fur
les sorts heureux ou malheureux.
Ainfi,par exemple,des montagnes
sous terre fignifîent l’humilité,&amp; la
difpolîtion ou la longueur de différentes
lignes combinées fervent à exprimer
les effets de cette vertu ( i).</p><p rend="small">(i) Notice de l’Y-king,par M. Vifdeîau,à la fin de la Traduction du Chufcing.</p>

当解析并转换为 .txt 文件时，变为：

qui par le moyen des
emblèmes
(i) notice de l’y-king par m vifdeîau à la fin de la Traduction du chancing

因此，缺少整个 explique ou représente [...] ( i). 部分。

如何恢复？

Python 代码：

import os,re,glob,csv
from spellchecker import SpellChecker
from lxml import etree
from collections import Counter 

# ignore hidden files in the directory with the input XML files (e.g. '._5419000_r.xml') 
def listdir_nohidden(path):
    return glob.glob(os.path.join(path,'*'))

# spécify the input files to be corrected 
directory_in = listdir_nohidden("./sample_in/")

# remove the .xml extension 
for file_in in directory_in:
    tree = etree.parse(file_in)
    root = tree.getroot()
    file_in = os.path.basename(file_in)
    file_in = os.path.splitext(file_in)[0]
    # print(file_in) # 5419000_r,test
    
    # create new .txt files on which the corrections will be applied
    file_out = '{}'.format(file_in)+'.txt'
    # print(file_out) # 5419000_r.txt,test.txt
    directory_out = os.path.join("./sample_out/",file_out)
  
    # create new .csv files with the errors,corrections and error frequencies
    corr_out = os.path.join('./csv/',file_in+'.csv')
     
    # remove special characters
    car_spec = ['■','•','%','*','#','+','^','\\','$','>','<','£','{','}'] 
    
    # generate a .csv sheet
    with open(directory_out,'w') as f,open(corr_out,'w') as fout:
        writer = csv.writer(fout)
        writer.writerow(["Erreur"'\t' "Correction"'\t' "Fréquence"'\t']) 
        
        # remove the XML tags in order to get the text only
        for elem in root.iter('*'):
            if elem.text is not None:
                text = elem.text.strip()
                if text: 
                    for c in car_spec:
                        text = text.replace(c,'')
                    
                    # preprocessing
                    text = re.sub('&','et',text) 
                    text = re.sub('« \n','',text) # concatenate the words separated by the hyphen,represented as a quotation mark 
                                                    
                    text = re.sub(" +"," ",text)  # reduce the multiple spaces into one simple space
                    text = text.lower() # lowercase the text 
                    text = text.replace("\n"," ") # so that each line starts from the very beginning,and not after a space
                                                   
                    # remplace the quotation marks in order to avoid the parsing problem
                    text = text.replace("'","’") 
                    
                    # delete space before certain spécial characters
                    text = text.replace(',',') 
                    text = text.replace(' .','. ')
                    text = text.replace(' :',':')
                    text = text.replace(' ;',';')
                    text = text.replace(' !','!')
                    text = re.sub('\s\?','?',text)
                    text = text.replace(' "','"')
                    text = text.replace('( ','(')
                    text = text.replace(' )',')') 
                    text = text.replace(' –','-')
                    
                    # remplace long and middle dashes with a short one 
                    text = text.replace('–','-') 
                    
                    # remote the punctuation marks at the end of a token because 
                    # they prevent the corrector from correcting the sequence 
                    # 'token + punctuation mark ',even if the token is indeed written incorrectly
                    # e.g.: 'jeuneffe,' (with comma) > 'jeuneffe' (incorrect)
                    # instead of 'jeuneffe' (without comma) > 'jeunesse' (correct)
                    text = re.sub('(?<=\w)[,;:?!.]',text)
                    
                    # define the french spell checker 
                    # pyspellchecker
                    spell = SpellChecker(language='fr')

                    # tokenise the texte with the standard tokeniser (e.g.: 'l'empire')
                    # because the pyspellchecker's tokenise badly (e.g.: 'l','empire')
                    token_list = text.split()

                    for t in token_list:
                    # do not correct neither the tokens with the apostrophe (e.g. : l’empire,d’art,s’étend...)
                    # nor those in the parentheses (e.g. : (1716-1790))
                        r1 = re.findall(r"(l’\w+|l’\w+-\w+|d’\w+|d’\w+|qu’\w+|c’\w+|n’\w+|j’\w+|lorfqu’\w+|eft|\w+.*?\)|\(.*?.\)|\(.*$)",t)
                        spell.word_frequency.load_words(r1)
                        a = spell.kNown(r1)  # les mots {'e.g. : l’empire,s’étend'} are non 
                                             # in the dictionary of correct words
                        
                    # correct the tokens in the .txt file
                    # extract the errors,their frequencies and their corrections in a .csv
                    misspelled = spell.unkNown(token_list)
                 
                    for m in misspelled:
                        corrected = spell.correction(m)
                        if m in token_list:
                            m_freq = token_list.count(m)
                            # print(m_freq)
                        # print(m,corrected,str(m_freq))
                        text = text.replace(m,corrected)
                        # f.write(c.replace('clafliques','classiques'))

                        fout.write(m+'\t' + corrected+'\t' + str(m_freq)+' \n')
                    # print(text)
                    f.write(text + "\n")

输入 XML：

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="fr" n="5419000" xml:id="cb30263946g">
  <teiHeader>
<fileDesc>
<titleStmt>
<title>Les livres classiques de l'empire de la Chine</title>
<author role="Auteur du texte" key="11909957">Confucius (0551?-0479? av. J.-C.)</author>
<respStmt>
  <resp key="40">Annotateur</resp>
  <name key="12176450">Pluquet,François-André-Adrien (1716-1790)</name>
</respStmt>
<respStmt>
  <resp key="680">Traducteur</resp>
  <name key="16653645">Noël,François (1651-1729)</name>
</respStmt>
</titleStmt>
<publicationStmt>
<publisher>TGB (BnF – OBVIL)</publisher>
</publicationStmt>
<seriesstmt>
<title level="s">Les livres classiques de l'empire de la Chine</title>
<title level="a">Tome 2</title>
<biblScope unit="volumes" n="6"/>
<idno>cb30263946g</idno>
</seriesstmt>
<sourceDesc>
<bibl>
<idno>http://gallica.bnf.fr/ark:/12148/bpt6k54190001</idno>
<publisher>Barrois aîné et Barrois jeune</publisher>
<date when="1784">1784</date>
</bibl>
</sourceDesc>
</fileDesc>
</teiHeader>
  <text>
    <body><pb xml:id="PAG_00000001" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f1.image"/>
<pb xml:id="PAG_00000002" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f2.image"/>
<pb xml:id="PAG_00000003" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f3.image"/>
<pb xml:id="PAG_00000004" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f4.image"/><div><head>Livres classiques</head><p rend="left">
DE L’EMPIRE .
</p></div><div><head>De la chine.</head><pb xml:id="PAG_00000005" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f5.image"/>
<pb xml:id="PAG_00000006" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f6.image"/>
<pb xml:id="PAG_00000007" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f7.image"/>
<pb xml:id="PAG_00000008" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f8.image"/></div><div><head>Observations</head><p rend="left small">SUR</p><p rend="center small">LES LIVRES CLASSIQUES</p><p rend="center small">DE L’EMPIRE</p><p rend="center small">DE LA CHINE.</p><p rend="small">.LES Chinois ont deux sortes de
livres clafliques ou canoniques : les
Kings,ou les livres canoniques du
premier ordre ; &amp; les Ssée-chu,ou
livres canoniques dusecond ordre.</p><p rend="small">Les Kings sont au nombre de
cinq ; l’Y-king,le Chu-king,lc
Chi-king,le Tchun-tfIoU &amp; le Lild.</p><p rend="left small">L’Y-king remonte à la plus haute
<hi rend="i">Tome II. a</hi></p><p rend="left"><hi rend="i">'\</hi>
<pb xml:id="PAG_00000009" n="" corresp="http://gallica.bnf.fr/ark:/12148/bpt6k54190001/f9.image"/>ij O B S E K.VATI ON S.</p><p rend="small">antiquité ; on l’attribue en grande
partie à Fo - hi : c’eft un ouvrage
qui,à la fin de la Traduction du Chufcing.</p>
</div></body>
  </text>
</TEI>

输出文本：

les livres classiques de l’empire de la chine
confucius (0551-0479 av j-c)
innovateur
paquet françois-andré-adrien (1716-1790)
Traducteur
noël françois (1651-1729)
rgb (bnf- obvil)
les livres classiques de l’empire de la chine
tome 2
cb30263946g
http//gallicabnffr/ark/12148/bpt6k54190001
barrons aîné et barrons jeune
1784
livres classiques
de l’empire 
de la chine
observations
sur
les livres classiques
de l’empire
de la chine
les chinois ont deux sortes de livres classiques ou canonique les kings ou les livres canonique du premier ordre et les ssée-chu ou livres canonique second ordre
les kings sont au nombre de cinq l’y-king le chu-kinglc thinking le tchun-tfIoU et le lily
l’y-king remonte à la plus haute
tome ii a
a
antiquité on l’attribue en grande partie à fo - hi c’eft un ouvrage qui par le moyen des
emblèmes
(i) notice de l’y-king par m vifdeîau à la fin de la Traduction du chancing

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

lxml lxml lxml python tei xml xml xml xml xml xml xml

为什么 lxml 会切出一个 XML 文件的一部分？

问题描述

解决方法

相关问答