使用正则表达式作为分词器?

问题描述

我正在尝试将语料库标记成句子。我尝试使用spacy和nltk,但由于我的文字有点棘手,它们无法正常工作。以下是我制作的人工样本,涵盖了我所知道的所有极端情况:

It is relevant to point that Case No. 778 - Martin H. v. The Woods,it was mentioned that death
 to one cannot be generalised. However,the High Court while enhancing the same from life to 
death,in our view,has not assigned adequate and acceptable reasons. In our opinion,it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion,while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

我希望如何对句子进行标记

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods,it was mentioned that death to one cannot be generalised.
2) However,the High Court while enhancing the same from life to death,has not assigned adequate and acceptable reasons.
3) In our opinion,it is not a rarest of rare case where extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4)15. In the light of the above discussion,while
 maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

这是我正在使用的正则表达式:

sent = re.split('(?<!\w\.\w.)(?<![A-Z]\.)(?<![1-9]\.)(?<![1-9]\.)(?<![v]\.)(?<![vs]\.)(?<=\.|\?) ',j)

我不是很精通正则表达式,但我手动输入了vvs这样的条件。我也忽略了在周期之前是否有一个数字,例如15.

我面临的问题:

  1. 如果两个句子之间没有缝隙,则不能正确分割。
  2. 如果单词之前的首字母大写,我也希望它能记入句号。例如No.Mr.

解决方法

通常,您不能依赖一个伟大的White White可靠的正则表达式,而必须编写一个使用多个正则表达式(正负)的函数。也是缩写字典,以及一些基本的语言解析工具,例如“ I”,“ USA”,“ FCC”,“ TARP”用英文大写。 Reference

遵循此准则,以下功能使用多个正则表达式来解析您的句子 Modification of D Greenberg answer

代码

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    digits = "([0-9])"
    section = "(Section \d+)([.])(?= \w)"
    item_number = "(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations = "(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized = "\((.*?)\)"
    bracketed = "\[(.*?)\]"
    curly_bracketed = "\{(.*?)\}"
    enclosed = '|'.join([parenthesized,bracketed,curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,lambda m: m.group().replace('.','<prd>'),text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if "..." in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]",text)
    text = re.sub(section,text)
    text = re.sub(item_number,text)
    text = re.sub(abbreviations,text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(enclosed,text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences

用法

for index,token in enumerate(split_into_sentences(s),start = 1):
    print(f'{index}) {token}')

测试

1。输入

s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods,it was mentioned that death
 to one cannot be generalised. However,the High Court while enhancing the same from life to 
death,in our view,has not assigned adequate and acceptable reasons. In our opinion,it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion,while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
'''

输出

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods,it was mentioned that death  to one cannot be generalised.
2) However,the High Court while enhancing the same from life to  death,has not assigned adequate and acceptable reasons.
3) In our opinion,it is not a  rarest of rare case where extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion,award of extreme penalty of death by the High Court is set aside and we restore the sentence of  life imprisonment as directed by the trial Court.

2。输入

s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However,he may grab a taxi instead.'''

输出

1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.
3) He'll take the No. 2 bus from the airport.
4) However,he may grab a taxi instead.

3。输入

s = '''The respondent,in his statement Ex.-73,which is accepted and found to be truthful. The passcode is either No.5,No. 5,No.-5,No.+5.'''

输出

1) The respondent,which is accepted and found to be truthful.
2) The passcode is either No.5,No.+5.

4。输入

s = '''He went to New York. He is 10 years old.'''

输出

1) He went to New York.
2) He is 10 years old.

5。输入

s = '''15) In the light of  Ex. P the above discussion,while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC,award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''

输出

1) 15) In the light of  Ex. P the above discussion,award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.
,

您要查找以下正则表达式吗?

'(?<=[^A-Z][a-z]\w)[/.] '

说明:

  • [^ A-Z] [a-z] \ w)[/。]->这将匹配所有不以大写字母开头的单词,后跟一个'。'。还有一个空格。
  • (?这将重置已选择的内容,然后选择接下来要选择的内容,即选择'。只能。

现在可以拆分使用:

sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)