问题描述
我一直在寻找这个问题的解决方案。我正在编写一个自定义函数来计算句子的数量。我为这个问题尝试了 {% extends 'base.html' %}
{% block title %}Каталог{% endblock %}
{% block content %}
{% for i in products %}
<img src="{ static i.photo.url }">
{{i.title}}
{% endfor %}
{% endblock %}
和 nltk
,但两者都给了我不同的计数。
一个句子的例子是这样的。
安妮说:“你确定吗?怎么可能?你在开玩笑吧?”
NLTK 给了我 --> textstat
。
['安妮说,“你确定吗?','这怎么可能?','你是 开玩笑吧?"']
另一个例子:
Annie 说:“它会像这样工作!你需要去面对你的 朋友。好的!”
NLTK 正在给我 --> count=3
。
请推荐。预期计数为 1,因为它是一个直接的句子。
解决方法
我写了一个简单的函数来做你想做的事:
def sentences_counter(text: str):
end_of_sentence = ".?!…"
# complete with whatever end of a sentence punctuation mark I might have forgotten
# you might for instance want to add '\n'.
sentences_count = 0
sentences = []
inside_a_quote = False
start_of_sentence = 0
last_end_of_sentence = -2
for i,char in enumerate(text):
# quote management,to solve your issue
if char == '"':
inside_a_quote = not inside_a_quote
if not inside_a_quote and text[i-1] in end_of_sentence: # ?
last_end_of_sentence = i # ?
elif inside_a_quote:
continue
# basic management of sentences with the punctuation marks in `end_of_sentence`
if char in end_of_sentence:
last_end_of_sentence = i
elif last_end_of_sentence == i-1:
sentences.append(text[start_of_sentence:i].strip())
sentences_count += 1
start_of_sentence = i
# same as the last block in case there is no end punctuation mark in the text
last_sentence = text[start_of_sentence:]
if last_sentence:
sentences.append(last_sentence.strip())
sentences_count += 1
return sentences_count,sentences
考虑以下事项:
text = '''Annie said,"Are you sure? How is it possible? you are joking,right?" No,I'm not... I thought you were'''
为了稍微概括一下您的问题,我又添加了 2 个句子,一个带有省略号,最后一个甚至没有任何结束标点符号。现在,如果我执行这个:
sentences_count,sentences = sentences_counter(text)
print(f'{sentences_count} sentences detected.')
print(f'The detected sentences are: {sentences}')
我得到了这个:
3 sentences detected.
The detected sentences are: ['Annie said,right?"',"No,I'm not...",'I thought you were']
我认为它工作得很好。
注意:请考虑我的解决方案的报价管理适用于美式报价,其中句子的结束标点符号可以在报价内。删除我放置标志表情符号 ? 的行以禁用此功能。