问题描述
这是提取患者数据和任何匹配的Gleason数据的示例。
from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("score:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON score: 3 + 3 = 6' == gleason
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11 S11-4444 20/111-22-3333' == patientData
partMatch = patientData("patientData") | gleason("gleason")
lastPatientData = None
for match in partMatch.searchString(data):
if match.patientData:
lastPatientData = match
elif match.gleason:
if lastPatientData is None:
print "bad!"
continue
print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
lastPatientData.patientData, match.gleason
)
印刷品:
01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)
解决方法
我需要从前列腺切除术最终诊断报告的平面文件中提取格里森分数。这些分数始终包含格里森(Gleason)字和两个数字,这些数字加起来等于另一个数字。在过去的二十多年中,人类一直在打字。包括空格和修饰符的各种约定。以下是到目前为止我的Backus-
Naur表格和两个示例记录。仅针对前列腺切除术,我们正在研究一千多例病例。
我之所以使用pyparsing是因为我正在学习python,并且对我对正则表达式写作的有限接触并没有美好的回忆。
我的问题:如何在不解析最终诊断中可能存在或可能不存在的所有其他可选数据的情况下剔除这些格里森分数?
num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES,PROSTATECTOMY:
- ADENOCARCINOMA.
TOTAL GLEASON SCORE: GLEASON 5+4=9
TUMOR LOCATION: BILATERAL
TUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOR
EXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIOR
SEMINAL VESICLE INVASION: PRESENT
MARGINS: UNINVOLVED
LYMPHOVASCULAR INVASION: PRESENT
PERINEURAL INVASION: PRESENT
LYMPH NODES (SPECIMENS B AND C):
NUMBER EXAMINED: 25
NUMBER INVOLVED: 1
DIAMETER OF LARGEST METASTASIS: 1.7 mm
ADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,ACUTE AND CHRONIC INFLAMMATION,INTRADUCTAL EXTENSION OF INVASIVE
CARCINOMA
PATHOLOGIC STAGE: pT3b N1 MX
B. LYMPH NODES,RIGHT PELVIC,EXCISION:
- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).
C. LYMPH NODES,LEFT PELVIC,EXCISION:
- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).
01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES,PROSTATECTOMY:
- ADENOCARCINOMA.
GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.
TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.
TUMOR LOCATION: BILATERAL.
EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.
MARGINS: NEGATIVE.
PERINEURAL INVASION: IDENTIFIED.
LYMPH-VASCULAR INVASION: NOT IDENTIFIED.
SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.
LYMPH NODES: NONE SUBMITTED.
OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.
PATHOLOGIC STAGE (pTNM): pT2c NX.
全面披露:我是一名从事研究的医师;这是我第一次使用python。我已经阅读了Lutz的《学习Python》,Shaw的《艰难的学习Python》,并研究了各种问题。我在该论坛,pyparsing
Wiki上审查了许多与pyparsing有关的问题,并且我购买并阅读了McGuire先生的“
Pyparsing入门”。也许我是在问一个问题,什么时候应该真正告诉我我站在“沮丧的死亡螺旋,当您必须编写解析器时,这种螺旋非常普遍”(McGuire,17岁)?我不知道。到目前为止,我很高兴能够从事实际上可能是一个真正的项目。