优化问题：使用文件1的字典转换文件2

问题描述

我正在尝试使用以下行来转换文件1（如果有生物信息学家通过，则为gtf文件）：

1   X   exon    1   300000  1000    -   .   gene_id "Z.633"; transcript_id "Z.633.mrna1"; exon_number "1"; 
1   X   transcript  1   300000  1000    -   .   gene_id "Z.633"; transcript_id "Z.633.mrna1"; 
1   X   exon    300005  300500  1000    -   .   gene_id "Y.6330"; transcript_id "Y.6330.mrna1"; exon_number "2";
1   X   exon    300500  310000  1000    +   .   gene_id "Y.6330"; transcript_id "Y.6330.mrna1"; exon_number "1"; 
1   X   transcript  30005   310000  1000    +   .   gene_id "Y.6330"; transcript_id "Y.6330.mrna1";

到一个文件中，所有“ Z”将被“ F”和其他对应关系替换。所有对应关系都在我用作字典的文件2中，第1列是键，第2列是值。

示例文件2：

Z.633 F.633
Y.6330 U.6330

示例结果：

1   X   exon    1   300000  1000    -   .   gene_id "F.633"; transcript_id "F.633.mrna1"; exon_number "1"; 
1   X   transcript  1   300000  1000    -   .   gene_id "F.633"; transcript_id "F.633.mrna1"; 
1   X   exon    300005  300500  1000    -   .   gene_id "U.6330"; transcript_id "U.6330.mrna1"; exon_number "2";
1   X   exon    300500  310000  1000    +   .   gene_id "U.6330"; transcript_id "U.6330.mrna1"; exon_number "1"; 
1   X   transcript  30005   310000  1000    +   .   gene_id "U.6330"; transcript_id "U.6330.mrna1";

file1大约有200000行，而file2则有20000行。

为此，我使用了awk脚本：

NR == FNR {
  rep[ $2 ] = $1 
  next
} 

{
  for (key in rep)
    gsub(key,rep[key])
  print
}

然后：

awk -f dict.awk file2 file1 > newfile

我的问题是脚本实际上运行了好几天...有什么办法可以改善它？有没有更适合此问题的编程语言？（我尝试使用python，但是运行时间更糟）

（使用的脚本python：）

def replaceWithDictionnary(dictFileName,filetoReplaceName,newFileName):
    import re
    with open(dictFileName,'r') as d:
        dictFile = dict((line.strip().split())[::-1] for line in d)
        
    with open(newFileName,'w') as g:
        with open(filetoReplaceName,'r') as f:
            for line in f.readlines():
                for d_key,d_value in dictFile.items():
                    if bool(re.search(d_key+"\D+",line)):
                    #if "\""+d_key+"\"" in line:
                        #print(d_key)
                        newline=line.replace(d_key,d_value)
                        #print(d_value)
                        g.write(newline)
                        continue

我可以说不是花了那么长时间的词典部分，因为我用一个较小的file1测试了它，并且运行很快...

解决方法

假设每条记录仅替换一个gene_id，并且总是可以在第10列中找到它，就像在示例文件中一样，您可以找到它并调用gsub一次，而不是每条记录调用20K函数。

> cat tst.awk
NR==FNR {
  r[$1] = $2
  next
}

{
  x = $10
  gsub(/"|;/,"",x)   
  gsub(x,r[x])
  print
}

awk -f tst.awk file2 file1