Python Pandas：关于文本提取的任何想法？

问题描述

我有数千个 txt 文件，如下所示（值是组成的）：

Date :  [ 2010-01-01 XX:XX:XX ]  Age :  [ 22 ]  Sex :  [ M ]   :  [ XXX ]
Height(cm) :  [ 145 ]  Weight(kg) :  [ 56.4 ]  Race :  [ Hispanic ]
Spirometry :  [ restrictive pattern ]
Treatment response :  [ Negative ]
Tissue volume :  [ normal ]
Tissue volume
[ normal RV ] 
Diffusing capacity :  [ normal capacity ]
FVC Liters : [ 2.22 ] FVC Liters :  [ 67 ] FVC Liters :  [ 3.35 ] 
FEV1 Liters :  [ 1.96 ] FEV1 Liters :  [ 66 ] FEV1 Liters :  [ 2.06 ] 
FEV1 / FVC % :  [ 58 ] FEV1 / FVC % :  [ 62 ]
DLCO mL/mmHg/min :  [ 21.5 ] DLCO mL/mmHg/min :  [ 102 ]
DLCO Adj mL/mmHg/min :  [ 21.5 ] DLCO Adj mL/mmHg/min :  [ 102 ]
RV/TLC % :  [ 22 ]

我想以 csv 格式提取变量名称及其对应的值。幸运的是，正如您所注意到的，所有 txt 文件都具有类似的格式：

variable : [ value ]

我的第一个问题是如何编写提取具有上述结构的数据的代码。
我的第二个问题是，当一行中有多组“变量：[值]”时，我现在知道如何分隔。（它们不是逗号分隔的！）。

我只设法想出了以下代码......但我现在正在兜圈子。有什么想法吗？

df = pd.read_csv(filename,sep='\n')
df = df[0].str.split(':',expand=True)

提前致谢

解决方法

看起来您需要正则表达式。让我们试试这个

首先加载样本数据

text = \
"""Date :  [ 2010-01-01 XX:XX:XX ]  Age :  [ 22 ]  Sex :  [ M ]   :  [ XXX ]
Height(cm) :  [ 145 ]  Weight(kg) :  [ 56.4 ]  Race :  [ Hispanic ]
Spirometry :  [ restrictive pattern ]
Treatment response :  [ Negative ]
Tissue volume :  [ Normal ]
Tissue volume
[ Normal RV ] 
Diffusing capacity :  [ Normal capacity ]
FVC Liters : [ 2.22 ] FVC Liters :  [ 67 ] FVC Liters :  [ 3.35 ] 
FEV1 Liters :  [ 1.96 ] FEV1 Liters :  [ 66 ] FEV1 Liters :  [ 2.06 ] 
FEV1 / FVC % :  [ 58 ] FEV1 / FVC % :  [ 62 ]
DLCO mL/mmHg/min :  [ 21.5 ] DLCO mL/mmHg/min :  [ 102 ]
DLCO Adj mL/mmHg/min :  [ 21.5 ] DLCO Adj mL/mmHg/min :  [ 102 ]
RV/TLC % :  [ 22 ]
"""

接下来，使用正则表达式查找所有匹配的 'blah : [ blahblah ] ' 对，并粘贴到字典中（strip 从空格中提取 - 可以进入正则表达式，但要避免过度复杂化东西）

import re
parsed = re.findall('(.*?)\:\s*?\[(.*?)\]',text)
res = {g[0].strip() : g[1].strip() for g in parsed}
res

结果：

{'Date': '2010-01-01 XX:XX:XX','Age': '22','Sex': 'M','': 'XXX','Height(cm)': '145','Weight(kg)': '56.4','Race': 'Hispanic','Spirometry': 'restrictive pattern','Treatment response': 'Negative','Tissue volume': 'Normal','Diffusing capacity': 'Normal capacity','FVC Liters': '3.35','FEV1 Liters': '2.06','FEV1 / FVC %': '62','DLCO mL/mmHg/min': '102','DLCO Adj mL/mmHg/min': '102','RV/TLC %': '22'}

如果需要，您可以将其粘贴到数据框中：

df = pd.DataFrame.from_records([res])
df

得到

    Date                   Age  Sex           Height(cm)    Weight(kg)  Race      Spirometry           Treatment response    Tissue volume    Diffusing capacity      FVC Liters    FEV1 Liters    FEV1 / FVC %    DLCO mL/mmHg/min    DLCO Adj mL/mmHg/min    RV/TLC %
--  -------------------  -----  -----  ---  ------------  ------------  --------  -------------------  --------------------  ---------------  --------------------  ------------  -------------  --------------  ------------------  ----------------------  ----------
 0  2010-01-01 XX:XX:XX     22  M      XXX           145          56.4  Hispanic  restrictive pattern  Negative              Normal           Normal capacity               3.35           2.06              62                 102                     102          22

请注意，您提供的示例在顶部 Sex : [ M ] : [ XXX ] 的这一行不符合模式，但代码通过使用空字符串 '' 作为键来处理它。我认为这是复制粘贴的问题，而不是原始数据的问题，但如果您有很多这样的问题，您可能需要更仔细地处理它们

对于示例数据，要获取没有前导和尾随空格的键和值，您可以使用 2 个捕获组。

([^\s:][^:]*)\s+\:\s+\[\s*([^][]*)\s+]

( 捕获组 1
- [^\s:][^:]* 匹配除空白字符或 : 后跟除 : 以外的可选字符以外的任何字符
) 关闭第 1 组
\s+\:\s+ 匹配 : 左右的 1 个或多个空白字符
\[\s* 匹配 [ 和可选的空白字符
( 捕获 第 2 组
- [^][]* 匹配除 [ 和 ] 之外的任何字符的 0 次以上
) 关闭第 2 组
\s+] Match 1+ whitespace chars and ]`

Regex demo | Python demo

输出

[('Date','2010-01-01 XX:XX:XX'),('Age','22'),('Sex','M'),('Height(cm)','145'),('Weight(kg)','56.4'),('Race','Hispanic'),('Spirometry','restrictive pattern'),('Treatment response','Negative'),('Tissue volume','Normal'),('Diffusing capacity','Normal capacity'),('FVC Liters','2.22'),'67'),'3.35'),('FEV1 Liters','1.96'),'66'),'2.06'),('FEV1 / FVC %','58'),'62'),('DLCO mL/mmHg/min','21.5'),'102'),('DLCO Adj mL/mmHg/min',('RV/TLC %','22')]

extract extract pandas pandas separator text

Python Pandas：关于文本提取的任何想法？

问题描述

解决方法

相关问答