CSV操作唯一条目

问题描述

我有几个以下格式的CSV文件（没有标题）

file1.csv：

firstname,lastname
andrew,harling
brad,dominic
pete,petey

file2.csv：

firstname,lastname,blood_group
andy,robbins,O+
brad,dominic,AB-
pete,petey,B+

目标：在有一些限制的情况下合并这两个文件：

确保所有唯一条目都出现在输出csv中（请参阅下面的示例）
如果找到条目（根据（名字，姓氏）匹配），则只能从file2.csv中获取条目（带有血型）

样本output.csv：

firstname,blood_group
andrew,harling
andy,B+

我尝试了以下方法：

将file2.csv复制到file2_2col.csv（仅复制前两列）
现在使用以下命令从file1（在file2_2col中找到）中删除重复的副本：

comm -23 file1_without_duplicates.csv
使用以下命令合并file1_without_duplicates.csv和file2.csv：

cat * .csv |排序-u> unique.csv

但这仍然包含重复项：

output.csv错误：

firstname,blood_group
brad,petey
andrew,O+
brad dominic AB-
pete,B+

有什么建议吗？

解决方法

一个简短的awk程序

awk '
  BEGIN {FS = OFS = ","} 
  {name = $1 OFS $2}
  NR == FNR {f1[name] = 1; next}
  name in f1 {delete f1[name]}
  {print}
  END { for (name in f1) print name }
' file{1,2}.csv

firstname,lastname,blood_group
andy,robbins,O+
brad,dominic,AB-
pete,petey,B+
andrew,harling

更新的解决方案：

$ cat file{1,2}.csv
firstname,lastname
andrew,harling
brad,dominic
pete,petey
firstname,B+

$ gawk -F "," '{a[$1 "," $2] = $3} END{for(i in a){(a[i]=="") ? x=i : x=i "," a[i]; print x}}' file{1,2}.csv
andy,O+
firstname,blood_group
pete,AB-

csv csv csv python shell