如何将基因组SNP转换为二进制格式？ script.awk 运行脚本script.awk 运行单班轮awk脚本

问题描述

我正在努力找出如何转换（SNP）文件，如下所示：

pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G

为这样的二进制格式：

pos,1,1
79115,0
79116,0
79124,0
79128,0

有人知道R或命令行工具awk代码中可用于此目的的软件包吗？

以下解决方案的一个问题是全部都这样做：

应为：

1097023,1
1097027,0 
4363270,0

因此，我认为解决方案应该包括大多数样本（即> 7个样本相同，则为0，与多数不匹配的样本为1。

解决方法

这是一个简单的awk（标准Linux awk/gawk）脚本：

script.awk

BEGIN {FS = ","} # set field seperator to ","
NR>1{     # every line except the first line
  zeroIndc=$2; # identify the zero indicator and save the variable
  for (i = 2; i <= NF; i++) { # for each DNA letter
    if ($i == zeroIndc) { # if DNA letter match zero indicator
      $i = 0; # set DNA letter to 0
    } else { # if DNA letter not match zero indicator
      $i = 1; # set DNA letter to 1
    }
  }
  print; # print new line at end 
}

运行脚本`script.awk`

awk -f scirpt.awk input.txt

运行单班轮awk脚本

awk 'BEGIN{FS=","}NR>1{z=$2;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt

使用基数R，您可以将每个sample列与第一个示例列进行比较，并在值不匹配的地方返回1。

cols <- grep('sample',names(df))
df[cols] <- +(df$sample1 != df[cols])
df

#    pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107       0       0       0       0       0       0       0       0       0
#2 79115       0       0       0       0       0       0       0       1       0
#3 79116       0       0       0       0       0       0       0       0       0
#4 79124       0       0       1       1       0       0       0       0       0
#5 79128       0       0       0       0       1       0       0       0       0

#  sample10 sample11 sample12 sample13 sample14
#1        0        0        0        1        1
#2        0        1        0        0        0
#3        0        0        0        0        0
#4        0        0        0        0        0
#5        1        0        0        0        0

尽管上面的方法在大型数据集上效率更高，但这是dplyr库的替代方法。

library(dplyr)
df <- df %>% mutate(across(contains('sample'),~+(sample1 != .)))

数据

df <- structure(list(pos = c(79107L,79115L,79116L,79124L,79128L
),sample1 = c("C","C","A","G"),sample2 = c("C",sample3 = c("C","T",sample4 = c("C",sample5 = c("C","A"),sample6 = c("C",sample7 = c("C",sample8 = c("C",sample9 = c("C",sample10 = c("C","C"),sample11 = c("C",sample12 = c("C",sample13 = c("G",sample14 = c("G","G")),class = "data.frame",row.names = c(NA,-5L))

$ awk -F,'NR>1{gsub($2,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,1,1
79115,0
79116,0
79124,0
79128,0

假设您想将“引用”等位基因编码为0，其中“引用”是最常见的，这是使用Python 3（其pandas数据处理库（使用pip3 install pandas和Counter specialized dictionary安装。

从包含示例数据的文件snps_letters.csv开始：

$ cat snps_letters.csv
pos,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G

我们将打开它，对其进行转换并将其保存在脚本binarize.py中：

#!/usr/bin/env python3

from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv",index_col="pos")

def binarize(column):
    # Extracting the most common element as ref
    # (using list and tuple unpacking,# because the most_common method returns
    # a list of (element,count) pairs):
    [(ref,_)] = Counter(column).most_common(1)
    # We can now define an auxiliary function
    # to recode individual elements in the column:
    def to_bin(elem):
        # int(True) -> 1,int(False) -> 0
        return 1 - int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")

在shell中执行脚本：

$ ./binarize.py

检查保存的结果：

$ cat snps_binary.csv 
pos,0
79115,1
79116,1
79124,1
79128,0

要拥有更可重用的脚本，您可以将输入和输出文件路径作为参数传递给命令行，它们可以在python代码中以sys.argv[1]和sys.argv[2]（首先为您提供了import sys）。

一个版本，它在命令行上获取输入和输出文件，并将参考核苷酸编码为1而不是0：

#!/usr/bin/env python3

import sys
from collections import Counter
import pandas as pd

# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1],int(False) -> 0
        return int(elem == ref)
    # Transform the column by applying the recoding function to its elements
    # (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
    return column.apply(to_bin)

# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])

它的用法如下：

./binarize.py snps_letters.csv snps_binary.csv

使用data from Ronak's answer识别参考等位基因，在这种情况下，基于最频繁的等位基因，我们有5个SNP REF等位基因：

REF <- apply(df[,-1],function(i) names(which.max(table(i)))) 
# [1] "C" "C" "A" "C" "G"

现在，将每个样本的列值与 REF 进行比较，然后通过将逻辑TRUE / FALSE乘以1转换为整数，然后将其分配回原始的 dataframe ：

df[,-1] <- (df[,-1] != REF) * 1
df
#     pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107       0       0       0       0       0       0       0       0       0        0        0        0        1        1
# 2 79115       0       0       0       0       0       0       0       1       0        0        1        0        0        0
# 3 79116       0       0       0       0       0       0       0       0       0        0        0        0        0        0
# 4 79124       0       0       1       1       0       0       0       0       0        0        0        0        0        0
# 5 79128       0       0       0       0       1       0       0       0       0        1        0        0        0        0

awk bioinformatics python r r