如何将基因组SNP转换为二进制格式? script.awk 运行脚本script.awk 运行单班轮awk脚本

问题描述

我正在努力找出如何转换(SNP)文件,如下所示:

pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G

为这样的二进制格式:

pos,1,1
79115,0
79116,0
79124,0
79128,0

有人知道R或命令行工具awk代码中可用于此目的的软件包吗?

以下解决方案的一个问题是全部都这样做:

1097023,A
1097027,C
4363243,A 
4363270,T 
4363275,G 
1097023,0
1097027,0
4363243,1 
4363270,0 
4363275,1 

应为:

1097023,1
1097027,0 
4363270,0

因此,我认为解决方案应该包括大多数样本(即> 7个样本相同,则为0,与多数不匹配的样本为1。

解决方法

这是一个简单的awk(标准Linux awk/gawk)脚本:

script.awk

BEGIN {FS = ","} # set field seperator to ","
NR>1{     # every line except the first line
  zeroIndc=$2; # identify the zero indicator and save the variable
  for (i = 2; i <= NF; i++) { # for each DNA letter
    if ($i == zeroIndc) { # if DNA letter match zero indicator
      $i = 0; # set DNA letter to 0
    } else { # if DNA letter not match zero indicator
      $i = 1; # set DNA letter to 1
    }
  }
  print; # print new line at end 
}

运行脚本script.awk

awk -f scirpt.awk input.txt

运行单班轮awk脚本

awk 'BEGIN{FS=","}NR>1{z=$2;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt
,

使用基数R,您可以将每个sample列与第一个示例列进行比较,并在值不匹配的地方返回1。

cols <- grep('sample',names(df))
df[cols] <- +(df$sample1 != df[cols])
df

#    pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107       0       0       0       0       0       0       0       0       0
#2 79115       0       0       0       0       0       0       0       1       0
#3 79116       0       0       0       0       0       0       0       0       0
#4 79124       0       0       1       1       0       0       0       0       0
#5 79128       0       0       0       0       1       0       0       0       0

#  sample10 sample11 sample12 sample13 sample14
#1        0        0        0        1        1
#2        0        1        0        0        0
#3        0        0        0        0        0
#4        0        0        0        0        0
#5        1        0        0        0        0

尽管上面的方法在大型数据集上效率更高,但这是dplyr库的替代方法。

library(dplyr)
df <- df %>% mutate(across(contains('sample'),~+(sample1 != .)))

数据

df <- structure(list(pos = c(79107L,79115L,79116L,79124L,79128L
),sample1 = c("C","C","A","G"),sample2 = c("C",sample3 = c("C","T",sample4 = c("C",sample5 = c("C","A"),sample6 = c("C",sample7 = c("C",sample8 = c("C",sample9 = c("C",sample10 = c("C","C"),sample11 = c("C",sample12 = c("C",sample13 = c("G",sample14 = c("G","G")),class = "data.frame",row.names = c(NA,-5L))
,
$ awk -F,'NR>1{gsub($2,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,1,1
79115,0
79116,0
79124,0
79128,0
,

假设您想将“引用”等位基因编码为0,其中“引用”是最常见的,这是使用Python 3(其pandas数据处理库(使用pip3 install pandasCounter specialized dictionary安装。

从包含示例数据的文件snps_letters.csv开始:

$ cat snps_letters.csv
pos,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G

我们将打开它,对其进行转换并将其保存在脚本binarize.py中:

#!/usr/bin/env python3

from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv",index_col="pos")

def binarize(column):
    # Extracting the most common element as ref
    # (using list and tuple unpacking,# because the most_common method returns
    # a list of (element,count) pairs):
    [(ref,_)] = Counter(column).most_common(1)
    # We can now define an auxiliary function
    # to recode individual elements in the column:
    def to_bin(elem):
        # int(True) -> 1,int(False) -> 0
        return 1 - int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")

在shell中执行脚本:

$ ./binarize.py 

检查保存的结果:

$ cat snps_binary.csv 
pos,0
79115,1
79116,1
79124,1
79128,0

要拥有更可重用的脚本,您可以将输入和输出文件路径作为参数传递给命令行,它们可以在python代码中以sys.argv[1]sys.argv[2](首先为您提供了import sys)。


一个版本,它在命令行上获取输入和输出文件,并将参考核苷酸编码为1而不是0:

#!/usr/bin/env python3

import sys
from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1],int(False) -> 0
        return int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])

它的用法如下:

./binarize.py snps_letters.csv snps_binary.csv
,

使用data from Ronak's answer识别参考等位基因,在这种情况下,基于最频繁的等位基因,我们有5个SNP REF等位基因:

REF <- apply(df[,-1],function(i) names(which.max(table(i)))) 
# [1] "C" "C" "A" "C" "G"

现在,将每个样本的列值与 REF 进行比较,然后通过将逻辑TRUE / FALSE乘以1转换为整数,然后将其分配回原始的 dataframe

df[,-1] <- (df[,-1] != REF) * 1
df
#     pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107       0       0       0       0       0       0       0       0       0        0        0        0        1        1
# 2 79115       0       0       0       0       0       0       0       1       0        0        1        0        0        0
# 3 79116       0       0       0       0       0       0       0       0       0        0        0        0        0        0
# 4 79124       0       0       1       1       0       0       0       0       0        0        0        0        0        0
# 5 79128       0       0       0       0       1       0       0       0       0        1        0        0        0        0