非结构化医学文本的实体属性提取

问题描述

我正在研究命名实体及其属性提取。我的目标是提取与句子中特定实体相关的属性。

例如-“患者报告对ABC疾病呈阳性”

在以上句子中，ABC是实体，Positive是定义ABC的属性。

我正在寻找一种提取属性的简洁方法，我已经制定了一种提取实体的解决方案，该解决方案以可观的准确性无缝地工作，现在正在问题陈述的第二部分中提取其相关属性。

我尝试使用基于规则的方法提取属性，该方法提供了下降的结果，但具有以下缺点：

源代码无法管理。
这根本不是通用的，而且很难管理新方案。
费时。

为了描绘一个更通用的解决方案，我探索了不同的NLP技术，并发现依赖树解析是一种潜在的解决方案。

正在寻找有关如何使用Python / Java进行依赖树解析来解决此问题的建议/意见。

请随时提出其他可能对您有帮助的技术。

解决方法

我建议使用spacy python库，因为它易于使用并且具有不错的依赖解析器。

基准解决方案将从您感兴趣的实体开始，以广度优先的方式遍历依赖关系树，直到遇到类似于属性的标记或距离该实体太远为止。

对该解决方案的进一步改进包括：

一些处理否定的规则，例如“不积极”
更好的属性分类器（这里我只是寻找形容词）
关于应依赖的类型和令牌的一些规则

这是我的基准代码：

import spacy
nlp = spacy.load("en_core_web_sm")
text = "The Patient report is Positive for ABC disease"
doc = nlp(text)
tokens = {token.text:token for token in doc}

def is_attribute(token):
    # todo: use a classifier to determine whether the token is an attrubute
    return token.pos_ == 'ADJ'

def bfs(token,predicate,max_distance=3):
    queue = [(token,0)]
    while queue:
        t,dist = queue.pop(0)
        if max_distance and dist > max_distance:
            return
        if predicate(t):
            return t
        # todo: maybe,consider only specific types of dependencies or tokens
        neighbors =  [t.head] + list(t.children)
        for n in neighbors:
            if n and n.text:
                queue.append((n,dist+1))

print(bfs(tokens['ABC'],is_attribute))  # Positive

dependency-parsing nlp python spacy stanford-nlp