使用awk和bash的vlookup:慢处理大量数据

问题描述

我有一个ID和名称的主数据。他们几乎有13000个条目。文件名是master.txt

id   name
1: name1
2: test
3: fin
4: miar

现在我有idsomeproperty的另一个数据列表。每个ID可以出现多次。数据为74000个条目。 person_entries.txt 例如数据:

id  property
1: somevalue001
2: somevalue002
2: somevalue003
1: somevalue004

现在我必须做类似vlookup的操作来添加名称而不是id

例如:

name    property
name1: somevalue001
test: somevalue002
test: somevalue003
name1: somevalue004

我正在尝试以下脚本vlookup.sh

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    IFS=$'\n';
    myarr=(`echo $line | awk -f break_data.awk`)
    #This will break each data into two lines (id and property which then can be stored as array)

    awk -v var="${myarr[0]}:" -v var2="${myarr[1]}" -f find_data.awk master.txt
    # here we pass the id and property to awk as variables. It will search for id in the master.txt and print name and propert
done < "person_entries.txt"

break_data.awk

# INPUT
# 1: name1

# OUTPUT
# 1
# name1

BEGIN{
    FS=": "
}
{
    for(i=1;i<NF+1;i++)
    {
        print $i
    }
}
END{
}

find_data.awk

#THIS WILL SEARCH THE ID: IN EACH LINE OF break_data2.awk
#WHEN IT FINDS THEN IT WILL PRING THE NAME AnD PROPERTY

BEGIN{
    FS=": "
    #print(var)
}
{
    s=index($0,var)
    if(s != 0){
        print $2": "var2
    }
    else{
        next
    }
}
END{
}

当我跑步时 sh vlookup.sh

这需要很多时间。

Excel可以比这更快。

为我的理解写的答案代码

$ awk '                # use awk
{ 
  if(NR==FNR) 
  {              # process first file
    a[$1]=$2           # hash to a array id is key,name value
    next               # process next record without executing following code
  } else
  {                      # process second file
    print a[$1]":",$2  # output name (the value of) from array a and property
  }

}' master person 

解决方法

应该执行类似的操作。不过,您可能需要对:进行一些调整(请参见标题),并确定如果没有匹配项该怎么办:

$ awk  'NR==FNR{a[$1]=$2;next}{print a[$1]":",$2}' master person

输出:

name: property
name1: somevalue001
test: somevalue002
test: somevalue003
name1: somevalue004

解释:

$ awk '                # use awk
NR==FNR {              # process first file
    a[$1]=$2           # hash to a array id is key,name value
    next               # process next record without executing following code
}
{                      # process second file
    print a[$1]":",$2  # output name (the value of) from array a and property
}' master person       # of the second file,colon in the middle
,

不如awk快,但是比bash代码快。

#!/usr/bin/env bash

IFS= read -r master_head < master.txt
IFS= read -r person_head < person_entries.txt
printf '%s: %s\n' "${master_head##* }" "${person_head##* }"

while IFS= read -ru8 master; do
  while IFS= read -ru9 person; do
    if [[ ${master%% *} == ${person%% *} ]]; then
      printf '%s: %s\n' "${person##* }" "${master##* }"
    fi
  done 9< <(tail -n+2 master.txt)
done 8< <(tail -n+2 person_entries.txt)