在Snakemake中使用多个通配符来区分TSV文件中的复制

6kkfgxo0  于 2022-09-21  发布在  其他
关注(0)|答案(1)|浏览(194)

我已经成功地实现了Snakemake工作流程,现在我想增加合并特定样本的可能性,无论它们是技术上的还是生物上的复制。在这种情况下,我被告知应该使用多个通配符,并使用适当的数据框。然而,我非常不确定语法会是什么样子,以及如何正确区分条目。

我的samples.tsv过去是这样的:
示例|FQ1|FQ2
-|-|
BCalAnn1_1_1|Path/to/file|Path/to/file
BCalAnn1_1_2|Path/to/file|Path/to/file
BCalAnn2_1_1|Path/to/file|Path/to/file
BCalAnn2_1_2|Path/to/file|Path/to/file
BCalAnn2_2_1|Path/to/file|Path/to/file
BCalAnn2_3_1|Path/to/file|Path/to/file

现在看起来是这样的:

Sample|BIO_UNIT|TECH_UNIT|FQ1|FQ2
-|
BCalAnn1|1|1|Path/to/file|Path/to/file
BCalAnn1|1|2|Path/to/file|Path/to/file
BCalAnn2|1|1|Path/to/file|Path/to/file
BCalAnn2|1|2|Path/to/file|Path/to/file
BCalAnn2|2|1|Path/to/file|Path/to/file
BCalAnn2|3|1|Path/to/file|Path/to/file

其中BIO_UNIT包含一个样本的库索引,TECH_UNIT是一个库的车道索引。

我的Snakefile如下所示:

import pandas as pd
import os
import yaml

configfile: "config.yaml"

samples = pd.read_table(config["samples"], index_col="sample")

rule all:
    input:
        expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam", 
               sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)

rule r1_mapping:
    input:
        read1 = lambda wc: 
                samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
        ref = config["PacBio_assembly"],
        linker = config["index"] + "pac_bio_assembly.fna.amb"
    output:
        config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
    params:
    conda:
        "../envs/arima_mapping.yaml"
    log:
        config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
    threads:
        12
    shell:
        "bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"

r1_mapping基本上是我的第一条规则,它要求不区分复制。因此,我希望对每一行都使用它。我对该语法进行了一些试验,但最终遇到了以下错误:

MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
    output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
    wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
    affected files:
        /home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz

我有没有正确地读表呢?我现在很困惑,谁能给我一个提示,从哪里开始解决这个问题?已通过以下更改修复:

samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])

samples = pd.read_table(config["samples"], index_col="sample")

**编辑:**新增错误

Missing input files for rule all:
    affected files:
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_2_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_1_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...

这总共是6次,听起来好像我现在得到的所有行都是红色的6次。我想这和expand()的工作原理有关?我现在要继续研究。

tyu7yeag

tyu7yeag1#

这就是我会做的,稍微简化一下,让事情变得简短--没有测试,可能会有错误!

import pandas as pd
import os
import yaml

samples = pd.read_table('samples.tsv', dtype=str)

wildcard_constraints:
    # Constraints maybe not needed but in my opinion better to set them
    sample='|'.join([re.escape(x) for x in samples['sample']]),
    unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
    unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),

rule all:
    input:
        # Assume for now you just want to produce these bam files,
        # one per row of the sample sheet:
        expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
            sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),

rule r1_mapping:
    input:
        read1 = lambda wc: 
            samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
        # Other input/params that don;t need wildcards...
    output:
        "scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam",

主要问题是使用一个输入函数(这里是lambda)来查询样本表,并获得与给定的样例、UNIT_BIO和UNIT_TECH相对应的FASTQ文件。

如果这不起作用,或者你不能理解它,请发布一个完全可重复的例子,我们可以很容易地复制它。

相关问题