python 在Snakemake中将通配符扩展限制为某些组合

在我的工作流中，我希望基于不同的通配符运行一些规则。我有一个**{group}通配符，它派生自目录，还有一个{sample}**通配符，它派生自该目录中的文件。
例如：

group1/
group1/sample1.fastq.gz
group1/sample2.fastq.gz

group2/
group2/sample3.fastq.gz
group2/sample4.fastq.gz

下面是我目前如何接受这些通配符：

GROUP = [ dir for dir in os.listdir('directory/')
         if os.path.isdir(os.path.join('directory/', dir)) ]

SAMPLE = [os.path.basename(fn).replace("_M_1.fastq.gz", "")
            for fn in glob(f"directory/*/*_1.fastq.gz")]

我的snakefile中有一些规则只对{group}通配符起作用，例如：

rule group_task:
    input:
        "directory/{group}"
    output:
        "other_directory/{group}/{group}_contig_file.fasta"

但是，如果我可以为给定的{group}并行处理{sample}通配符中的作业，则其他规则会得到更好的优化，例如

rule sample_per_group_task:
    input:
        sample_reads = "directory/{group}/{sample}",
        group_contigs = "other_directory/{group}/{group}_contig_file.fasta"
    output:
        "other_directory/{group}/{group}{sample}.bam"

但是，正如预期的那样，扩展这些通配符将提供不存在的组合：

group1/sample3.fastq.gz
group1/sample4.fastq.gz

group2/sample1.fastq.gz
group2/sample2.fastq.gz

因此，我的问题是：* * 是否可以约束这些通配符，使不存在的组合不被扩展？**
提前感谢您的任何帮助!

抛开你可能有更大的问题，正如马腾在他的评论中指出的，并在你的帖子中解决你的标题和主要问题，在您的示例中，组1匹配1和2，组2匹配3和4，可以使用zip来接近。zip与expand的使用在expand函数文档中有所介绍。在当前文档的这一节中查找“通过插入第二个位置参数，它可以被任何组合函数替换，例如zip“。然而，我不能很好地偶然发现使其工作的聪明组合，所以我回到Python。请记住，Snakefile代码是Python的超集。所以你可以在你的规则之上使用Python来获得你想要的粒度，创建一个列表来传递输入。A reply here有一个简单的例子，创建一个Python列表，然后在一个规则中将它传递给input。它们允许你使用通配符来更动态地获取文件列表。它们是由Snakemake添加的，但是如果你想像我刚才建议的那样使用纯Python作为构建文件名列表的过程，你也可以更直接。我喜欢Troy Comi here总结的方式：
“记住snakemake基本上是python加上一点额外的语法。任何你想用python自动列出你请求的...文件的方法都可以。”
在您的例子中，直接使用Python的zip并使用Python来进行所需的组合，可以像这样工作：

import os
import sys

'''
GROUP = [ dir for dir in os.listdir('directory/')
         if os.path.isdir(os.path.join('directory/', dir)) ]

SAMPLE = [os.path.basename(fn).replace("_M_1.fastq.gz", "")
            for fn in glob(f"directory/*/*_1.fastq.gz")]
'''

# hardcode those for demo
GROUP = ["group1/","group2/"]
SAMPLE = [["sample1.fastq.gz","sample2.fastq.gz"],["sample3.fastq.gz","sample4.fastq.gz"]]

# SAMPLE should be a list of lists matching the order of groups. Your code for making `SAMPLE` will need to be changed to use GROUP to iterate and collect the files inside. You may want to use `fnmatch` instead of `glob`. Post in the comments if you are struggling with that.

grouped_by_elements_in_combo_wanted = list(zip(GROUP,SAMPLE))
combos_wanted = []
for having_combos in grouped_by_elements_in_combo_wanted:
    directory = having_combos[0]
    files_list = having_combos[1]
    for file in files_list:
        combos_wanted.append(directory+file)

rule sample_per_group_task:
    input: combos_wanted

得出：

Building DAG of jobs...
MissingInputException in rule sample_per_group_task  in line 29 of /home/jovyan/Snakefile:
Missing input files for rule sample_per_group_task:
    affected files:
        group2/sample3.fastq.gz
        group1/sample2.fastq.gz
        group1/sample1.fastq.gz
        group2/sample4.fastq.gz

请注意，您必须删除我所做的演示硬编码，并更改收集SAMPLE的方式。SAMPLE将是一个列表列表，每个列表对应于特定组中的文件名，顺序与其在GROUP中的顺序相同。我在代码中添加了注解，建议如何实现。
我相信有一些更聪明的方法来使用通配符，使用itetools的一些东西，比如Snakemake文档使用itetools zip和扩展，但是我知道一些Python，对我来说使用Python更直接。

python 在Snakemake中将通配符扩展限制为某些组合

1条答案

相关问题

热门标签

最新问答