在Snakemake中合并多个通配符

问题描述

├── DIR1
│   ├── smp1.fastq.gz
│   ├── smp1_fastqc/
│   ├── smp2.fastq.gz
│   └── smp2_fastqc/
└── DIR2
    ├── smp3.fastq.gz
    ├── smp3_fastqc/
    ├── smp4.fastq.gz
    └── smp4_fastqc/

我想按样本计数读取的次数,然后按目录合并所有计数。
我创建了一个字典,将示例1和2链接到目录1,并将示例3和4链接到目录2

Dirs,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')

# Create samples missing
def filter_combinator(combinator,authlist):
    def filtered_combinator(*args,**kwargs):
        for wc_comb in combinator(*args,**kwargs):
            if frozenset(wc_comb) in authlist:
                yield wc_comb
    return filtered_combinator

# Authentification
combine_dir_samples = []

for dir in Dirs:
    samples,= glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
    for smp in samples:
        combine_dir_samples.append( { "dir" : dir,"smp" : smp} )
       
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product,combine_dir_samples)

然后,我创建一个规则以按样本对我的阅读次数进行计数

rule all:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt',dir_samples,dir=Dirs,smp=SAMPLES)

rule countReads:
    input:
        INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
    output:
        INDIR+'/{dir}/{smp}_Nreads.txt'
    shell:
        "grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},$3}}' > {output}"

---------------------------------------------------------------
# result ok

├── DIR1
│   ├── smp1_Nreads.txt
│   └── smp2_Nreads.txt
└── DIR2
    ├── smp3_Nreads.txt
    └── smp4_Nreads.txt

> cat smp1_Nreads.txt
DIR1 15082186

但是,我想添加一条规则来按目录串联我的smp_Nreads.txt文件

rule concatNreads:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt',smp=SAMPLES)
    output:
        INDIR+'/{dir}/Nreads_{dir}.txt'
    shell:
        "cat {input} > {output}"
------------------------------------------------------------------
# result

├── DIR1
│   └── Nreads_DIR1.txt
└── DIR2
    └── Nreads_DIR2.txt

# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459

# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459

我为concat规则尝试了不同的输入语法

expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt',dir=Dirs)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt',smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt',smp=SAMPLES)

我没有找到任何解决方案,就像它不在乎我的词典是否适用此规则。


编辑

我尝试使用字典而不是组合filter_combinator,并尝试使用函数作为规则输入以获取样本。

dir_to_samples = {"DIR1": ["smp1","smp2"],"DIR2": ["smp3","smp4"]}

def func(dir):
    return dir_to_samples[dir]

rule all:
    input:
        lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip',dir=wildcards.dir,smp=func(wildcards.dir))

rule fastQC:
    input:
        lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz',smp=func(wildcards.dir))
    output:
        OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
    shell:
        "fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/" 

> AttributeError: 'Wildcards' object has no attribute 'dir'

解决方法

首先,我认为您已经使解决方案过于复杂,从而使其对于Snakemake而言不那么惯用。结果,您遇到了实施规则的问题。无论如何,让我以您提出的形式回答问题。

两个Nreads_DIRx.txt文件完全相同也就不足为奇了,因为输入不依赖于输出中的任何通配符:

rule concatNreads:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt',dir_samples,dir=DIRS,smp=SAMPLES)

此处,expand函数同时解析dirsmp变量,从而生成完全指定的文件名列表。您需要的东西实际上取决于您输出中的通配符:

rule concatNreads:
    input:
        lambda wildcards: ...

{dir}是使用输出中的通配符完全指定的,因此您不需要从DIRS变量中为其分配值:

rule concatNreads:
    input:
        lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt',dir=wildcards.dir,smp=func(wildcards.dir))

现在的问题是如何实现此func函数,该函数生成目录的样本列表。我花了一些时间来了解您使用combine_dir_samplesfilter_combinator的技巧,所以我留给您使用该代码来实现func功能。但是,您真正需要的是DIR中的地图->示例:

dir_to_samples = {"DIR1": ["smp1","smp2"],"DIR2": ["smp3","smp4"]}

def func(dir):
    return dir_to_samples[dir]

这个dir_to_samples可能更容易评估,但这是您修改后的解决方案:

for dir in DIRS:
    samples,= glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
    dir_to_samples.append({dir: samples})

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...