postgresql查询:我想找到dna序列中每个碱基的计数

5hcedyr0  于 2021-07-24  发布在  Java
关注(0)|答案(1)|浏览(258)

我有5个随机dna序列(20长度的dna碱基),我想找到碱基计数。
在第一节中,我准备了一个dna长度函数来生成一个5×20碱基长度的序列。但我想知道基数。序列中有多少个“a”,序列中有多少个“c”,序列中有多少个“g”,序列中有多少个“t”。

prepare dna_length(int) as
with t1 as (select chr(65) as s union select chr(67) union select chr(71) union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i,round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;

with t1 as (
    select 1 as rn, 'A' as s
    union select 2, 'C' 
    union select 3, 'T' 
    union select 4, 'G' 
), t2 as (
    select generate_series(1, 5) as sample
), t3 as ( 
    select t2.sample, generate_series(1,20) as i,
           round(random() * 4 + 0.5) as rn 
      from t2
), t4 as (
    select t3.sample, t3.i, t3.rn, t1.s
      from t3 
      join t1 on t1.rn = t3.rn
) 
select sample, string_agg(s, '' order by i) 
  from t4
 group by sample
 order by sample;

现在看起来是这样的:

id          DNA          
1   ACTGCTGCAGTCGTACGTAC 
2   TGCAGTCGTAGCTGACGTAG 
3   GCAGTGACCAACGTGTGACA 
4   TGACGTGTCGAGACGAAGAG 
5   CGTGTGAGAGTCGTAGAGTG

结果应该是这样的:

id          DNA            A   C   G   T
1   ACTGCTGCAGTCGTACGTAC   4   6   5   5
2   TGCAGTCGTAGCTGACGTAG   4   4   6   6
3   GCAGTGACCAACGTGTGACA   6   5   6   4
4   TGACGTGTCGAGACGAAGAG   4   3   8   3
5   CGTGTGAGAGTCGTAGAGTG   4   2   9   5
e0uiprwp

e0uiprwp1#

您可以在最终查询中执行条件计数:

with ...
select 
    sample, 
    string_agg(s, '' order by i) dna,
    count(*) filter(where s = 'A') a,
    count(*) filter(where s = 'C') c,
    count(*) filter(where s = 'G') g,
    count(*) filter(where s = 'T') t
from t4
group by sample
order by sample;

相关问题