将字符串从EDIFACT文件解析到数据帧

问题描述

我正在寻找有关应用最佳实践来解析R中的EDIFACT文件的建议。我有以下文件,该文件没有文件扩展名,但可以通过read.delim()读取为文本。我的目标是解析文件夹中成千上万个这样的文件,从某些部分中提取信息,并将其写入数据框。

这是用'分隔的原始格式。

UNB+UNOC:3+4388901324577+_GLN_supplier__+180101:0050+10870'UNH+10870+DELFOR:D:96A:UN'BGM+241+10870'DTM+137:20180101:102'NAD+BY+4399901361577::92++Customername+Address+City++ZIP+Country'NAD+SU+_GLN_supplier__::92++suppliername.+Address+City++ZIP+Country'UNS+D'NAD+XX'LIN+++55040203121:BP'PIA+1+Product_ID_1+Product_name_1'IMD+F++:::Product_details'QTY+113:3:PCE'SCC+4'DTM+2:20180115:102'QTY+113:1:PCE'SCC+4'DTM+2:20180122:102'QTY+113:4:PCE'SCC+4'DTM+2:20180129:102'QTY+113:3:PCE'SCC+4'DTM+2:20180205:102'LIN+++55040203121:BP'PIA+1+Product_ID_2+Product_name_2'IMD+F++:::Product_details'QTY+113:9:PCE'SCC+4'DTM+2:20180115:102'QTY+113:5:PCE'SCC+4'DTM+2:20180122:102'QTY+113:5:PCE'SCC+4'DTM+2:20180129:102'QTY+113:4:PCE'SCC+4'DTM+2:20180205:102'LIN+++55040203121:BP'PIA+1+Product_ID_3+Product_name_3'IMD+F++:::Product_details'QTY+113:4:PCE'SCC+4'DTM+2:20180115:102'QTY+113:5:PCE'SCC+4'DTM+2:20180122:102'QTY+113:10:PCE'SCC+4'DTM+2:20180129:102'QTY+113:4:PCE'SCC+4'DTM+2:20180205:102'UNS+S'UNT+549+10870'UNZ+1+10870'

为了提高可读性,这里是拆分格式。

UNB+UNOC:3+4388901324577+_GLN_supplier__+180101:0050+10870
UNH+10870+DELFOR:D:96A:UN
BGM+241+10870
DTM+137:20180101:102
NAD+BY+4399901361577::92++Customername+Address+City++ZIP+Country
NAD+SU+_GLN_supplier__::92++suppliername.+Address+City++ZIP+Country
UNS+D
NAD+XX
LIN+++55040203121:BP
PIA+1+Product_ID_1+Product_name_1
IMD+F++:::Product_details
QTY+113:3:PCE
SCC+4
DTM+2:20180115:102
QTY+113:1:PCE
SCC+4
DTM+2:20180122:102
QTY+113:4:PCE
SCC+4
DTM+2:20180129:102
QTY+113:3:PCE
SCC+4
DTM+2:20180205:102
LIN+++55040203121:BP
PIA+1+Product_ID_2+Product_name_2
IMD+F++:::Product_details
QTY+113:9:PCE
SCC+4
DTM+2:20180115:102
QTY+113:5:PCE
SCC+4
DTM+2:20180122:102
QTY+113:5:PCE
SCC+4
DTM+2:20180129:102
QTY+113:4:PCE
SCC+4
DTM+2:20180205:102
LIN+++55040203121:BP
PIA+1+Product_ID_3+Product_name_3
IMD+F++:::Product_details
QTY+113:4:PCE
SCC+4
DTM+2:20180115:102
QTY+113:5:PCE
SCC+4
DTM+2:20180122:102
QTY+113:10:PCE
SCC+4
DTM+2:20180129:102
QTY+113:4:PCE
SCC+4
DTM+2:20180205:102
UNS+S
UNT+549+10870
UNZ+1+10870

我将通过file_list准备一个列表list.files()并遍历file_list的各项,以按以下顺序解析单个信息:

  1. 打开文件
  2. 由定界符分隔
  3. 寻找段UNH+10870,并返回+:间的结果
  4. 寻找段DTM+137,并返回::间的结果
  5. 寻找段PIA+1,但我不确定该怎么做,因为这需要为QTY+113段中每次出现的PIA+1创建新行

通常,这种方法非常麻烦,并且会在每个文件上造成大量循环,并可能导致性能问题。

列的标题指示相关行。

+-----------+----------+--------------+----------------+---------+----------+
| UNH+10870 | DTM+137  |   PIA+1_ID   |   PIA+1_NAME   | QTY+113 |  DTM+2   |
+-----------+----------+--------------+----------------+---------+----------+
| DELFOR    | 20180101 | Product_ID_1 | Product_name_1 |       3 | 20180115 |
| DELFOR    | 20180101 | Product_ID_1 | Product_name_1 |       1 | 20180122 |
| DELFOR    | 20180101 | Product_ID_1 | Product_name_1 |       4 | 20180129 |
| DELFOR    | 20180101 | Product_ID_1 | Product_name_1 |       3 | 20180205 |
| DELFOR    | 20180101 | Product_ID_2 | Product_name_2 |       9 | 20180115 |
| DELFOR    | 20180101 | Product_ID_2 | Product_name_2 |       5 | 20180122 |
| DELFOR    | 20180101 | Product_ID_2 | Product_name_2 |       5 | 20180129 |
| DELFOR    | 20180101 | Product_ID_2 | Product_name_2 |       4 | 20180205 |
| DELFOR    | 20180101 | Product_ID_3 | Product_name_3 |       4 | 20180115 |
| DELFOR    | 20180101 | Product_ID_3 | Product_name_3 |       5 | 20180122 |
| DELFOR    | 20180101 | Product_ID_3 | Product_name_3 |      10 | 20180129 |
| DELFOR    | 20180101 | Product_ID_3 | Product_name_3 |       4 | 20180205 |
+-----------+----------+--------------+----------------+---------+----------+

我希望我对任务进行了充分的解释,并感谢每一个建议都能完成任务。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)