快速命名实体识别，用大型语料库中的某些内容替换名称

问题描述

我收集了大量数据，想对其进行去标识化处理并删除其中所有可能的名称。我使用了 python 模块“ner_anonymizer”，但得到了一些奇怪的结果。这是我的 df 小样本数据的代码：

import ner_anonymizer
import pandas as pd
df = pd.DataFrame({"text":["Jessi assisted with all cars. She moved to another City,called Balla.","Jennifer is a friend of Jillian. jillian is happy with his Job."]})

anonymizer = ner_anonymizer.DataAnonymizer(df)
anonymized_df,hash_dictionary = anonymizer.anonymize(
    free_text_columns=["text"],pretrained_model_name="dslim/bert-base-NER",label_list=["O","B-MISC","I-MISC","B-PER","I-PER","B-ORG","I-ORG","B-LOC","I-LOC"],labels_to_anonymize=["B-PER","I-LOC"]
)

anonymized_df['text'][0]
output: '2809a05a22a4a9c1882a580bcc0ad8a6i assisted with all cars. She moved to 0cc175b9c0f1b6a831c399e269772661nother 57d056ed0984166336b7879c2af3657f,c0cc175b9c0f1b6a831c399e269772661lled 353df421c4fc976e2637061d7a83f6010cc175b9c0f1b6a831c399e269772661.'

anonymized_df['text'][1]
output: 'e1f6a14cd07069692017b53a8ae881f6 is a friend of 2ab45b80a312bb97190187c6f66fdd58ian. jillian is happy with his Job.'

如果您查看第一个输出，Jessi、another、City、Called 和 Balla 被替换为哈希，而其中只有两个名字，Jessi 和 Balla 以及其余的都是错误的。在第二个输出中，Jillian 被替换为一个哈希加上末尾的“ian”，而第二个 jillian 没有被任何哈希替换。

感谢您对如何改进此代码的回应，并在可能的情况下添加以下值：

用“X”代替散列，2) 快一点，因为大型语料库需要太长时间。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

anonymize hash named-entity-recognition python replace