字符串列表中字符串出现的双重列表理解

问题描述

我有两个列表：

text = [['hello this is me'],['oh you kNow u']]
phrases = [['this is','u'],['oh you','me']]

我需要拆分文本，使短语中出现的单词组合成为单个字符串：

result = [['hello','this is','me'],'kNow','u']

我尝试使用 zip() 但它连续遍历列表，而我需要检查每个列表。我还尝试了一个 find() 方法，但从这个例子中它也会找到所有字母 'u' 并将它们变成一个字符串（就像在单词 'you' 中它使它变成 'yo','u'）。我希望 replace() 在用列表替换字符串时也能工作，因为它可以让我做类似的事情：

for line in text:
        line = line.replace('this is',['this is'])

但是尝试了所有方法，我仍然没有找到在这种情况下对我有用的任何东西。你能帮我解决这个问题吗？

解决方法

用原始海报澄清：

给定文本 pack my box with five dozen liquor jugs 和短语 five dozen

结果应该是：

['pack','my','box','with','five dozen','liquor','jugs']

不是：

['pack my box with','liquor jugs']

每个文本和短语都被转换为一个 Python 单词列表 ['this','is','an','example']，以防止在单词内匹配 'u'。

文本的所有可能的子短语都由 compile_subphrases() 生成。较长的短语（更多的词）首先生成，因此它们在较短的短语之前匹配。 'five dozen jugs' 总是优先于 'five dozen' 或 'five' 匹配。

短语和子短语使用列表切片进行比较，大致如下：

    text = ['five','dozen','jugs']
    phrase = ['liquor','jugs']
    if text[2:3] == phrase:
        print('matched')

使用这种比较短语的方法，脚本遍历原始文本，用挑选出的短语重写它。

texts = [['hello this is me'],['oh you know u']]
phrases_to_match = [['this is','u'],['oh you','me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text,minwords=1,include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1,minwords - 1,-1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist,sublist,i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words,matched_phrase,index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

结果：

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is','u','oh you','me']
Results: [['hello','this is','me'],'know','u']]

要使用更大的数据集测试此答案和其他答案，请在代码开头尝试此操作。它在单个长句上生成 100 多个变体以模拟 100 多个文本。

from itertools import chain,combinations
import random

#texts = [['hello this is me'],['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said','the quick brown fox as he jumped over the lazy dog'
])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(),5)
])
texts = random.choices(variations,k=500)
#phrases_to_match = [['this is','me']]
phrases_to_match = [
    ['pack my box','quick brown','the quick','brown fox'],['jumped over','lazy dog'],['five dozen','jugs']
]

试试这个。

import re

def filter_phrases(phrases):
    phrase_l = sorted(phrases,key=len)
    
    for i,v in enumerate(phrase_l):
        for j in phrase_l[i + 1:]:
            if re.search(rf'\b{v}\b',j):
                phrases.remove(v)
    
    return phrases


text = [
    ['hello this is me'],['oh you know u'],['a quick brown fox jumps over the lazy dog']
]
phrases = [
    ['this is',['fox','brown fox']
]

# Flatten the `text` and `phrases` list
text = [
    line for l in text 
    for line in l
]
phrases = {
    phrase for l in phrases 
    for phrase in l
}

# If you're quite sure that your phrase
# list doesn't have any overlapping 
# zones,then I strongly recommend 
# against using this `filter_phrases()` 
# function.
phrases = filter_phrases(phrases)

result = []

for line in text:
    # This is the pattern to match the
    # 'space' before the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_1 = '|'.join([
        f'(?={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b',line)
    ])
    # This is the pattern to match the
    # 'space' after the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_2 = '|'.join([
        f'(?<={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b',line)
    ])
    
    # Now,we combine the both patterns
    # `l_phrase_1` and `l_phrase_2` to
    # create our master regex. 
    result.append(re.split(
        rf'\s(?:{l_phrase_1})|(?:{l_phrase_2})\s',line
    ))
    
print(result)

# OUTPUT (PRETTY FORM)
#
# [
#     ['hello',#     ['oh you',#     ['a quick','brown fox','jumps over the lazy dog']
# ]

在这里，我使用了 re.split 来分隔文本中的短语前后。

这使用了 Python 一流的列表切片。 phrase[::2] 创建一个由列表的第 0、2、4、6... 元素组成的列表切片。这是下面解决方案的基础。

对于每个短语，| 符号放置在找到的短语的两侧。下面显示 'this is' 被标记在 'hello this is me'

'hello this is me' -> 'hello|this is|me'

当文本在 | 上被拆分时：

['hello','me']

偶数元素[::2]是不匹配的，奇数元素[1::2]是匹配的词组：

                   0         1       2
unmatched:     ['hello','me']
matched:                 'this is',

如果段中匹配和不匹配元素的数量不同，则使用 zip_longest 用空字符串填充间隙，以便始终存在一对平衡的不匹配和匹配文本：

                   0         1       2     3
unmatched:     ['hello','me',]
matched:                 'this is',''

对于每个短语，扫描文本中先前不匹配（偶数编号）的元素，用 | 分隔短语（如果找到）并将结果合并回分段文本。

使用 zip() 后跟 flatten() 将匹配和不匹配的段合并回分段文本，注意维护新文本段和现有文本段的偶数（不匹配）和奇数（匹配）索引.新匹配的短语作为奇数元素重新合并，因此不会再次扫描它们以查找嵌入的短语。这可以防止具有类似措辞的短语（例如“这是”和“这”）之间发生冲突。

flatten() 无处不在。它找到嵌入在更大列表中的子列表，并将它们的内容压平到主列表中：

['outer list 1',['inner list 1','inner list 2'],'outer list 2']

变成：

['outer list 1','inner list 1','inner list 2','outer list 2']

这对于从多个嵌入列表中收集短语以及将拆分或压缩的子列表合并回分段文本很有用：

[['the quick brown fox says',''],['hello','']] ->

['the quick brown fox says','','hello',''] ->

                   0                        1       2        3          4     5
unmatched:     ['the quick brown fox says',]
matched:                                    '',

最后，可以删除空字符串的元素，这些元素只是为了奇偶对齐：

['the quick brown fox says',''] ->
['the quick brown fox says','me']

texts = [['hello this is me'],['the quick brown fox says hello this is me']]
phrases_to_match = [['this is','you','me']]
from itertools import zip_longest

def flatten(string_list):
    flat = []
    for el in string_list:
        if isinstance(el,list) or isinstance(el,tuple):
            flat.extend(el)
        else:
            flat.append(el)
    return flat

phrases_to_match = flatten(phrases_to_match)
# longer phrases are given priority to avoid problems with overlapping
phrases_to_match.sort(key=lambda phrase: -len(phrase.split()))
segmented_texts = []
for text in flatten(texts):
    segmented_text = text.split('|')
    for phrase in phrases_to_match:
        new_segments = segmented_text[::2]
        delimited_phrase = f'|{phrase}|'
        for match in [f' {phrase} ',f' {phrase}',f'{phrase} ']:
            new_segments = [
                segment.replace(match,delimited_phrase)
                for segment
                in new_segments
            ]
        new_segments = flatten([segment.split('|') for segment in new_segments])
        segmented_text = new_segments if len(segmented_text) == 1 else \
            flatten(zip_longest(new_segments,segmented_text[1::2],fillvalue=''))
    segmented_text = [segment for segment in segmented_text if segment.strip()]
    # option 1: unmatched text is split into words
    segmented_text = flatten([
        segment if segment in phrases_to_match else segment.split()
        for segment
        in segmented_text
    ])
    segmented_texts.append(segmented_text)
print(segmented_texts)

结果：

[['hello',['the','quick','brown','fox','says','me']]

请注意，短语 'oh you' 优先于子集短语 'you'，并且没有冲突。

这是一个准完整的答案。一些让你开始的东西：

假设：看你的例子，我看不出为什么这些短语必须保持吐出，因为你的第二个文本在“phrases”的第一个列表项中的“u”上分裂。

准备

将短语“list-of-lists”压缩成一个列表。我已经看过这个了an example

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

主要代码：

我的策略是查看文本列表中的每个项目（开始时它只是一个项目）并尝试拆分短语中的一个短语。如果找到拆分，则会发生更改（我用标记标记以进行跟踪），我用该列表替换它的拆分对应项，然后展平（所以它都是一个列表）。然后从头开始循环，如果发生变化（重新开始是因为无法判断“短语”列表中后面的内容是否也可以更早地拆分）

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

text =[['hello this is me'],['oh you know u']]
phrases = ['this is','me']

output = []
for t in text:
    t_copy = t
    no_change=1
    while no_change:
        for i,tc in enumerate(t_copy):
            for p in phrases:
                before = [tc] # each item is a string,my output is a list,must change to list to "compare apples to apples"
                found = re.split(f'({p})',tc)
                found = [f.strip() for f in found]
                if found != before:
                    t_copy[i] = found
                    t_copy = flatten(t_copy) # flatten to avoid 
                    no_change=0
                    break
                no_change=1
        output.append(t_copy)
print(output)

字符串列表中字符串出现的双重列表理解

问题描述

解决方法

准备

主要代码：

评论：

相关问答