问题描述
我需要将一段文本分成多个句子。下面是一个示例。
北京法院称,大亨任志强曾利用其以前的职务行贿和挪用公款,并指控他非法致富约290万美元。但是,任正非的支持者肯定会看到,长刑判决是对他对习近平的cutting贬评论的一种惩罚,也是对习近平统治的其他潜在批评者的警告。美国政府欢迎DRV对这一建议的迅速回应。 2016年,共产党已经公开警告任正非,并公开谴责习近平有关中国新闻媒体必须为该党服务的言论,他因此被缓刑。 “人民政府何时转变为党的政府?”任先生写道。
以下情况需要考虑:
- 先生
- 太太
- 博士
- 美国
- $ 2.9
预期结果是以下一系列句子:
- 北京法院表示,大亨任志强曾利用其以前的职务行贿和挪用公款,并指控他非法致富约290万美元。
- 但任先生的支持者一定会看到,长句子是对他对习近平的cutting贬评论的一种惩罚,也是对习近平统治的其他潜在批评者的警告。
- 美国政府欢迎DRV对这一建议的迅速回应。
- 2016年,共产党已经公开警告任正非,并公开谴责习近平有关中国新闻媒体必须为该党服务的言论,这使他处于缓刑状态。
- “人民政府何时转变为党的政府?”
- 先生任写道。
可以在JavaScript中的单个正则表达式中实现此拆分吗?我不能使它工作。现在,我以以下正则表达式为起点:
[^.!?;:。!?]+?(?!Mr|Mrs|\$\d+\.)[.!?;:。!?]
解决方法
我认为,这是我们能想到的最好的方法-由于已经讨论的原因,这不是完美的方法,但也许是一个起点?
let s = "The court in Beijing said that the tycoon,Ren Zhiqiang,had used his former posts to take bribes and embezzle public funds,and accused him of illegally enriching himself by about $2.9 million. But Mr. Ren’s supporters are sure to see the long sentence as punishment for his cutting comments about Mr. Xi — and as a warning to other potential critics of Mr. Xi’s rule. The U.S. Government would welcome the prompt response of the DRV to this suggestion. In 2016,the Communist Party had already warned Mr. Ren and put him on probation after he publicly scoffed at Mr. Xi’s comments that Chinese news outlets must serve the party. “When did the people’s government turn into the party’s government?” Mr. Ren wrote.";
// Array of known abbreviations or other dot-ended text that ***probably*** isn't the end of a sentence
const ok = ["Mr.","Mrs.","Dr.","U.S.","Inc."];
function findSentences() {
// split the entire string into words - separated by a space
let words = s.split(" ");
// an array to hold all of the sentences the code constructs
let sentences = [];
// start with a blank sentence array
let newsentence = [];
words.forEach(function(w) {
// the word does NOT end with a dot,just add it to the sentence
if (!w.endsWith(".")) {
newsentence.push(w);
// if it does,but it's an known abbreviation,just add it as normal
// Also allow for single letter abbreviations - eg,in "Samuel L. Jackson"
} else if (ok.find(x => x == w) || w.length == 2) {
newsentence.push(w);
// if it does,but it's NOT an known abbreviation,finish the sentence and start a new one
} else {
newsentence.push(w);
sentences.push(newsentence.join(" "));
newsentence.length = 0;
}
})
// Output the sentences
let ul = document.createElement("ul");
sentences.forEach(function(s) {
let li = document.createElement("li");
li.innerHTML = s;
ul.appendChild(li);
})
document.body.appendChild(ul);
}
findSentences();
,
另一种方法(当然也不是完美方法)是匹配您不想更改的内容,并在组中捕获要保留的内容,以便您可以在其后添加换行符。
您可以使用不想在文本中更改的模式来扩展第一个替换。
在替换项中,您可以检查是否存在组1。如果是这样,请在替换中使用它并添加换行符。如果不存在,则返回匹配项。
说明
\b(?:Mrs?|Dr)\.|\bU\.S\.|\$\d+(?:\.\d+)?(?: million)?\b|“[^“”]+”|([.!?;:。!?])\s*(?!$)
-
\b(?:Mrs?|Dr)\.
匹配Mr.
Mrs.
或Dr.
-
|
或 -
\bU\.S\.
匹配U.S.
-
|
或 -
\$\d+(?:\.\d+)?(?: million)?)
将美元符号,1个以上的数字与可选的小数部分,可选的空格和百万个匹配。 -
|
或 -
“[^“”]+”
从打开“
到关闭”
的比赛,例如,以防止在问号内闯入 -
|
或 -
([.!?;:。!?]\s*)
捕获组1 ,匹配字符类中列出的字符之一 -
(?!$)
负向查找,请断言字符串的末尾,以防止在末尾替换为多余的换行符
例如
let pattern = /\b(?:Mrs?|Dr)\.|\bU\.S\.|\$\d+(?:\.\d+)?(?: million)?\b|“[^“”]+”|([.!?;:。!?])\s*(?!$)/g;
let s = `The court in Beijing said that the tycoon,the Communist Party had already warned Mr. Ren and put him on probation after he publicly scoffed at Mr. Xi’s comments that Chinese news outlets must serve the party. “When did the people’s government turn into the party’s government?” Mr. Ren wrote.`;
s = s.replace(pattern,(m,g1) => undefined !== g1 ? g1 + "\n\n" : m);
console.log(s);
扩展它的示例:
\b(?:Mrs?|Dr)\.|\bU\.S\.|\$\d+(?:\.\d+)?(?: million)?\b|“[^“”]+”|(?: |^)[A-Z]\.(?!\S)|([.!?;:。!?])\s*(?!$)