使用 pyparsing,如何对 OneOrMore(expre1|expr2) 匹配的表达式进行分组?

问题描述

我的网站接收允许用户发布包含多个问题和多项选择答案的字符串。有一个强制风格指南,允许正则表达式解析结果,然后问题 + MCQ 选项存储在数据库中,稍后在随机练习考试中返回。

我想过渡到 pyparsing,因为正则表达式不是立即可读的,我觉得有点被它束缚了。我希望可以选择轻松扩展我的 questionparser 的功能,而使用 Regex 感觉非常麻烦。

用户输入的形式为:

quiz = [<question-answer>,<q-start>]
<question-answer> = <question> + <answer>
<question> = [<q-text>,\n] ?!= <a-start>
<answer> = [<answer>,<a-start>]  ?!= <q-start>
<q-start> = <nums> + "." | ")"
<a-start> = <alphas> + "." | ")" 

长的用户输入字符串被分成问题-答案,由下一个问题-答案组的 q-start 分隔。 问题都是 q-start 和 a-start 之间的文本。 答案是 a-start 和 a-start 或后面的 q-start 之间所有文本的列表。

示例文本:

3. A lesion that affects N. Solitarius will result in the patient having problems related to:
a. taste and blood pressure regulation
c. swallowing and respiration
b. smell and taste
d. voice quality and taste
e. whistling and chewing

4. A patient comes to your office complaining of weakness on the right side of their body. You notice that their head is
turned slightly to the left and their right shoulder droops. When asked to protrude their tongue,it deviates to the right. Eye
movements and eye-related reflexes appear to be normal. The lesion most likely is located in the:
c. left ventral medulla
a. left ventral midbrain
b. right dorsal medulla
d. left ventral pons
e. right ventral pons

5. A colleague {...}

我一直在使用的正则表达式:

# matches a question-answer block. Matching q-start until an empty line.
regex1 = r"(^[\t ]*[0-9]+[\)\.][\t ]+[\s\S]*?(?=^[\n\r]))" 

# Within question-answer block,matches everything that does not start with a-start
regex6 = r"(^(?!(^[a-fA-F][\)\.]\s+[\s\S]+)).*)"

# Matches all text between a-start and the following a-start,or until the question-answer substring block ends.
regex5 = r"(^[a-fA-F][\)\.]\s+[\s\S]+)"       

然后使用一点 python 和 re 来修剪问题编号、mcq 字母、连接所有有问题的虚线、将 MCQ 附加到列表中。

在 pyparsing 中我试过这个:

EOL = Suppress(LineEnd())
delim = oneOf(". )")
q_start = Linestart() + Word(nums) + delim
a_start = Linestart() + Char(alphas) + delim

question = Optional(EOL) + Group(Suppress(q_start) + OneOrMore(SkipTo(LineEnd()) + EOL,stopOn=a_start)).setResultsName('question',listAllMatches=True)

answer = Optional(EOL) + Group(Suppress(a_start) + OneOrMore( SkipTo(LineEnd()) + EOL,stopOn=(a_start | q_start | StringEnd()))).setResultsName('answer',listAllMatches=True)



qi = Group(OneOrMore(question|answer)).setResultsName('group',listAllMatches=True)
t = qi.parseString(test)
print(t.dump())

结果:

[[['The tectum of the midbrain comprises the:'],['superior and inferior colliculi'],['reticular formation'],['internal arcuate fibers'],['cerebellar peduncles'],['pyramids'],['damage to the dorsal columns on one side of the spinal cord would results in:'],['loss of MVP ipsilaterally below the level of the lesion'],['hypertonicity of the contralateral limbs'],['loss of pain and temperature contralaterally below the level of the lesion'],['loss of MVP contralaterally above the level of the lesion'],['loss of pain and temperature ipsilaterally above the level of the lesion']]]
- group: [[['The tectum of the midbrain comprises the:'],['loss of pain and temperature ipsilaterally above the level of the lesion']]]
  [0]:
    [['The tectum of the midbrain comprises the:'],['loss of pain and temperature ipsilaterally above the level of the lesion']]
    - answer: [['superior and inferior colliculi'],['loss of pain and temperature ipsilaterally above the level of the lesion']]
      [0]:
        ['superior and inferior colliculi']
      [1]:
        ['reticular formation']
      [2]:
        ['internal arcuate fibers']
      [3]:
        ['cerebellar peduncles']
      [4]:
        ['pyramids']
      [5]:
        ['loss of MVP ipsilaterally below the level of the lesion']
      [6]:
        ['hypertonicity of the contralateral limbs']
      [7]:
        ['loss of pain and temperature contralaterally below the level of the lesion']
      [8]:
        ['loss of MVP contralaterally above the level of the lesion']
      [9]:
        ['loss of pain and temperature ipsilaterally above the level of the lesion']
    - question: [['The tectum of the midbrain comprises the:'],['damage to the dorsal columns on one side of the spinal cord would results in:']]
      [0]:
        ['The tectum of the midbrain comprises the:']
      [1]:
        ['damage to the dorsal columns on one side of the spinal cord would results in:']

确实匹配问题和答案,并正确绕过可能会中断问题或答案的换行符。我遇到的问题是它们没有按照我预期的方式分组。 我期待着类似的东西 group[0] = 问题,答案 [1:4] group[2] = 问题,答案 [1:4]

有人有什么建议吗?

谢谢!

解决方法

我认为您走对了路 - 我对您的解析器进行了单独的测试,并得出了非常相似的结构,但只有一些不同之处。

question = Combine(q_start.suppress() + SkipTo(EOL + a_start))
answer = Combine(a_start.suppress() + SkipTo(EOL + (a_start | q_start | StringEnd())))
q_a = Group(question("question") + answer[1,...]("answers"))

for t in q_a[...].parseString(test):
    print(t.dump())

最大的不同是我用来解析你的文本的表达式不仅仅做了 OneOrMore(question | answer),而是定义了一个 Group(question + OneOrMore(answer))。这会为每个问题及其相关答案创建一个组。在您的解析器中,使用 listAllMatches 只会为所有问题创建一个结果名称,为所有答案创建另一个结果名称,但会丢失它们之间的所有关联。通过创建“问题+一个或多个答案”组,这些关联得以维持。

如果您想删除 '\n',与 EOL 业务相比,您可以使用解析操作更轻松地完成此操作。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...