为什么读取r中hdf5中阅读组会抛出H5Identifier not valid?

jdzmm42g  于 2023-03-20  发布在  其他
关注(0)|答案(1)|浏览(308)

我已经从archs rnaseq data下载了数据。人类hdf5文件(28G)。我想访问表情数据和组信息。我使用下面的代码:

h5_exprs <- h5read("archs4_gene_human_v2.1.2.h5", "data/expression")

它抛出
错误(scratch_11.R#9):h5checktype()中出错。H5标识符无效。
作为解决问题的额外步骤,我应该做些什么?
当我运行***h5ls(“archs4_gene_human_v2.1.2.h5”)***时,输出如下所示:

group                  name       otype  dclass            dim
0              /                  data   H5I_GROUP                       
1          /data            expression H5I_DATASET INTEGER 620825 x 62548
2              /                  meta   H5I_GROUP                       
3          /meta                 genes   H5I_GROUP                       
4    /meta/genes           gene_symbol H5I_DATASET  STRING          62548
5          /meta               samples   H5I_GROUP                       
6  /meta/samples         aligned_reads H5I_DATASET INTEGER         620825
7  /meta/samples         channel_count H5I_DATASET  STRING         620825
8  /meta/samples   characteristics_ch1 H5I_DATASET  STRING         620825
9  /meta/samples       contact_address H5I_DATASET  STRING         620825
10 /meta/samples          contact_city H5I_DATASET  STRING         620825
11 /meta/samples       contact_country H5I_DATASET  STRING         620825
12 /meta/samples     contact_institute H5I_DATASET  STRING         620825
13 /meta/samples          contact_name H5I_DATASET  STRING         620825
14 /meta/samples           contact_zip H5I_DATASET  STRING         620825
15 /meta/samples       data_processing H5I_DATASET  STRING         620825
16 /meta/samples  extract_protocol_ch1 H5I_DATASET  STRING         620825
17 /meta/samples         geo_accession H5I_DATASET  STRING         620825
18 /meta/samples      instrument_model H5I_DATASET  STRING         620825
19 /meta/samples      last_update_date H5I_DATASET  STRING         620825
20 /meta/samples     library_selection H5I_DATASET  STRING         620825
21 /meta/samples        library_source H5I_DATASET  STRING         620825
22 /meta/samples      library_strategy H5I_DATASET  STRING         620825
23 /meta/samples          molecule_ch1 H5I_DATASET  STRING         620825
24 /meta/samples          organism_ch1 H5I_DATASET  STRING         620825
25 /meta/samples           platform_id H5I_DATASET  STRING         620825
26 /meta/samples              relation H5I_DATASET  STRING         620825
27 /meta/samples             series_id H5I_DATASET  STRING         620825
28 /meta/samples singlecellprobability H5I_DATASET   FLOAT         620825
29 /meta/samples       source_name_ch1 H5I_DATASET  STRING         620825
30 /meta/samples                sra_id H5I_DATASET  STRING         620825
31 /meta/samples                status H5I_DATASET  STRING         620825
32 /meta/samples       submission_date H5I_DATASET  STRING         620825
33 /meta/samples             taxid_ch1 H5I_DATASET  STRING         620825
34 /meta/samples                 title H5I_DATASET  STRING         620825
35 /meta/samples                  type H5I_DATASET  STRING         620825
7gs2gvoe

7gs2gvoe1#

我不确定这个错误的原因。我还没有下载整个28GB的文件,但是如果我能够直接从S3存储读取/data/expression数据集的子集,例如:

library(rhdf5)

h5file <- 'https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5'

h5read(file = h5file, 
       name = "/data/expression", 
       index = list(1:10, 1:12),
       s3 = TRUE)

#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#>  [1,]  353 1110    3    0    0   51    0    2    0     0   467     0
#>  [2,]  342  873    2    0    1   33    1    5    0     0   388     0
#>  [3,]  358 1171    1    0    0   41    0    5    0     0   391     0
#>  [4,]  393  849    1    0    0   40    0    0    0     0   148     0
#>  [5,]  427  821    0    0    0   30    0    0    0     0   112     0
#>  [6,]  293  613    1    0    0   22    3    3    0     0   112     0
#>  [7,]    0    0    0    1    0    0    0    0    0     0     0     0
#>  [8,]    0    0    0    3    0    0    0    0    0     0     0     0
#>  [9,]    1    0    0    5    0    0    0    0    0     0     0     0
#> [10,]    0    0    0    3    0    0    0    0    0     0     0     0

一些想法:

  • 我猜想您所指出的h5read()命令确实是scratch_11.R的第9行中的命令。
  • 您可以在运行h5read()之前尝试运行h5errorHandling(type = "verbose"),这将给予更大的HDF5错误堆栈,并可能有助于缩小问题范围。
  • 阅读整个数据集将需要大约150GB的RAM,尽管如果这是问题所在,我预计R会产生unable to allocate vector of size ...错误。

相关问题