从R中的一列中选择与另一列中的值共享的值的最大数量

nkcskrwz  于 2023-07-31  发布在  其他
关注(0)|答案(6)|浏览(86)

bounty 15小时后到期。回答此问题可获得+300声望奖励。tmfmnk希望引起更多的注意这个问题。

我有一个大型数据集,其中包含40多年来不规则采样的站点。我想选择共享的网站的最大数量,比方说,至少5年的数据。
任何指针将不胜感激。
下面是一个示例数据集:

library(tidyverse)

set.seed(123)

(DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
  ) %>%
  rowwise() %>%
  mutate(Years = list(sample(1982:2021, NYears))) %>%
  unnest(Years) %>%
  select(-NYears))

字符串

6ljaweal

6ljaweal1#

免责声明

下面的方法 * 不 * 像solution by @jblood94一样有效(所以,如果你追求速度,不要使用我的解决方案来处理大型数据集),但是***只是为了改变思维方式,以图论的方式思考,并探索使用igraph解决问题的可能性***。

Brief Idea(图论视角)

一般来说,我认为这个问题可以用图论的方法来处理,用igraph来解决。如果你追求效率,你可能需要探索隐藏在图形背后的潜在属性。举例来说:

  • 共享Years的数量可以解释为与两个Sites顶点相关联的边权重。
  • 此外,由于在搜索集团时可以跳过具有权重<=4的一些边,因此可以进一步简化图。修剪网络并在之后进行搜索应该比迭代所有可能的组合更有效。
    如果您对细节感兴趣,请参考code breakdowns的后续回答。

一种igraph方式

下面可能是一个igraph选项来解决这个问题(有关详细信息,请参阅代码的注解):您可以在Sites上尝试graph_from_adjacency_matrix,并使用cliques()查找集团,例如,

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
    graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
    ) %>%
    # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
    subgraph.edges(E(.)[E(.)$weight > 4]) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

字符串
或替代方案

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    `>=`(5) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))


这给了

> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67

[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84


如果您想进一步显示共享年份,可以在res之上执行其他操作,例如,

lapply(
    res,
    \(q) {
        list(
            sites = as.integer(names(q)),
            sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
        )
    }
)


这给了

[[1]]
[[1]]$sites
[1] 31 59 67

[[1]]$shared_years
[1] 1989 1992 1999 2002 2005

[[2]]
[[2]]$sites
[1] 26 53 84

[[2]]$shared_years
[1] 1989 1991 1998 2001 2011

讨论

在上面的igraph选项中,cliques()将是性能瓶颈,特别是当条件是“共享Years的数量应该是>=k”时,对于小型k,例如k=1k=2。在这些情况下,cliques()Filter()之前枚举的集团明显更多。您可以参考@jblood94的基准测试结果。

qgelzfjb

qgelzfjb2#

这里是一种方法,从满足标准的站点对开始,然后迭代地将站点添加到每个组。它非常高效,几乎立即解决了OP数据集上的5个共享年问题。
几乎可以肯定的是,这种方法的效率还有待提高,但这应该给予一个很好的起点。
作为一个函数实现:

library(Matrix)
library(data.table)
library(Rfast) # for rowSort()

f <- function(df, shared) {
  if (shared == 1) {
    # special case
    # get the yeear(s) with the maximum number of sites
    dt <- setorder(setDT(df), Sites, Years)[
      Years %in% df[,.(.N), Years][N == max(N)][[1]],
      setDT(setNames(as.list(Sites), paste0("Site", 1:.N))), Years
    ]
    # combine years with the same set of sites
    setcolorder(
      dt[
        ,.(nShared = .N, Years = .(Years)), eval(names(dt)[-1])
      ], "nShared"
    )
  } else if (shared == 2) {
    # special case
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along rows
      match(df$Years, u1 <- sort(unique(df$Years))), # years along columns
      x = 1L
    )
    # shared sites between year pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    # return an empty data.table if no pairs meet the criterion
    if (!length(m2@x)) return(
      data.table(nShared = integer(0), Site1 = integer(0), Years = list())
    )
    # find the year pairs that share the maximum number of sites
    i <- which(m2@x == (mx <- max(m2@x)))
    setcolorder(
      # initialize the output
      as.data.table(
        matrix(
          u2[(m[,m2@i[i] + 1L, drop = FALSE]*m[,m2@j[i] + 1L, drop = FALSE])@i + 1L],
          length(i), mx, 1, list(NULL, paste0("Site", 1:mx))
        )
      )[
        , Years := as.list(
          as.data.frame(
            matrix(u1[c(m2@i[i] + 1L, m2@j[i] + 1L)], 2, byrow = TRUE)
          )
        )
        # combine year pairs with the same set of sites
      ][,.(Years = .(unique(unlist(Years)))), eval(paste0("Site", 1:mx))][
        ,nShared := lengths(Years)
      ], "nShared"
    )
  } else {
    # create a sparse matrix with year-site pairs
    m <- sparseMatrix(
      match(df$Years, u1 <- sort(unique(df$Years))), # years along rows
      match(df$Sites, u2 <- sort(unique(df$Sites))), # sites along columns
      x = 1L
    )
    # shared years between site pairs
    m2 <- as(triu(crossprod(m), 1), "TsparseMatrix")
    i <- m2@x >= shared # index of pairings with at least "shared" years shared
    # initialize the output
    dt <- data.table(nShared = m2@x[i], Site1 = m2@i[i] + 1L, Site2 = m2@j[i] + 1L)
    # return an empty data.table if no pairs meet the criterion
    if (!nrow(dt)) return(dt[,Years := list()])
    n <- 2L # current number of sites all sharing "shared" number of years
    
    while (1) {
      # find additional sites to that can be added to each group
      m2 <- as(crossprod(Reduce("*", lapply(2:(n + 1L), \(i) m[,dt[[i]], drop = FALSE])), m), "TsparseMatrix")
      # don't add the same site twice!
      m2[cbind(rep(1:nrow(dt), n), unlist(dt[,2:(n + 1L)]))] <- 0
      i <- m2@x >= shared
      
      if (any(i)) { # an additional site can be added to at least one group
        n <- n + 1L
        # update the outupt
        dt <- unique(dt[m2@i[i] + 1L][,paste0("Site", n) := m2@j[i] + 1L][
          ,nShared := m2@x[i]
        ][
          ,paste0("Site", 1:n) := as.data.frame(rowSort(as.matrix(.SD))),
          .SDcols = 2:(n + 1L)
        ])
      } else break
    }
    
    # convert site indices back to the original values and record the common
    # years for each group
    dt[
      ,Years := .(.(u1[as.logical(Reduce("*", lapply(.SD, \(i) m[,i])))])),
      1:nrow(dt), .SDcols = 2:(n + 1L)
    ][
      ,paste0("Site", 1:n) := as.data.frame(matrix(u2[unlist(.SD)], .N)),
      .SDcols = 2:(n + 1L)
    ]
  }
}

字符串
测试:

f(DF, 1)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13 Site14 Site15 Site16 Site17 Site18 Site19 Site20 Site21 Site22 Site23 Site24 Site25 Site26 Site27 Site28 Site29 Site30 Site31 Site32 Years
#> 1:       1     1     2     4     5    11    12    13    14    22     38     41     42     46     47     48     54     58     59     62     65     66     69     71     72     75     80     87     88     90     92     93    100  2004
f(DF, 2)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 Site10 Site11 Site12 Site13     Years
#> 1:       2     3    11    26    31    37    38    49    53    55     64     65     68     84 1989,1991
f(DF, 3)[]
#>    nShared Site1 Site2 Site3 Site4 Site5 Site6          Years
#> 1:       3    24    48    61    87    95    96 1984,2017,2018
#> 2:       3    26    31    37    53    65    84 1989,1991,2001
#> 3:       3    26    31    53    55    64    84 1989,1991,1998
f(DF, 4)[]
#>    nShared Site1 Site2 Site3 Site4               Years
#> 1:       4    24    48    87    95 1984,2014,2017,2018
#> 2:       4    26    31    53    84 1989,1991,1998,2001
#> 3:       4    26    37    53    84 1989,1991,2001,2011
f(DF, 5)[]
#>    nShared Site1 Site2 Site3                    Years
#> 1:       5    26    53    84 1989,1991,1998,2001,2011
#> 2:       5    31    59    67 1989,1992,1999,2002,2005
f(DF, 6)[]
#>     nShared Site1 Site2                             Years
#>  1:       6     5    13     1988,2004,2007,2010,2013,2021
#>  2:       6    11    13     1988,2004,2006,2007,2010,2020
#>  3:       7    26    31 1989,1991,1998,1999,2001,2017,...
#>  4:       6     5    32     1986,1987,1991,2007,2010,2013
#>  5:       6    31    59     1989,1992,1993,1999,2002,2005
#>  6:       6    31    67     1989,1990,1992,1999,2002,2005
#>  7:       6    59    67     1985,1989,1992,1999,2002,2005
#>  8:       6    31    68     1989,1991,1992,2002,2009,2021
#>  9:       6    10    69     1983,1987,1989,1994,1995,2013
#> 10:       6     4    84     1998,2001,2003,2005,2011,2018
#> 11:       6    31    84     1989,1990,1991,1998,2001,2005
#> 12:       6    20    87     1994,2001,2010,2014,2015,2021
#> 13:       7    24    87 1984,2002,2006,2014,2015,2017,...
#> 14:       7    58    87 1986,2004,2005,2006,2014,2017,...
#> 15:       6    68    87     1984,1994,2002,2014,2015,2021
#> 16:       6    72    87     1986,2001,2002,2004,2010,2017
#> 17:       6    31    88     1989,1992,1993,2005,2009,2021
#> 18:       6    33    88     2005,2008,2009,2011,2013,2021
#> 19:       6    82    88     1989,1997,2005,2011,2013,2021
#> 20:       6     8    89     1982,2000,2003,2006,2011,2017
#> 21:       6    24    89     1984,1996,2000,2006,2011,2017
#> 22:       6     5    92     1987,1991,1997,2004,2007,2013
#> 23:       6    21    94     1984,1985,1987,1993,1999,2020
#> 24:       6     8    97     1992,1997,2003,2006,2011,2020
#>     nShared Site1 Site2                             Years
f(DF, 8)[]
#> Empty data.table (0 rows and 4 cols): nShared,Site1,Site2,Years

标杆管理

基准测试将把这种方法与@ThomasIsCoding的igraph解决方案进行比较。但是,当为最小共享年数选择较小的值(OP的数据集为1或2)时,igraph会变得非常慢。这些被排除在基准之外。
igraph解决方案作为函数:

library(igraph)

f2 <- function(DF, shared) {
  lapply(
    DF %>%
      table() %>%
      tcrossprod() %>%
      # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
      graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
      ) %>%
      # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
      subgraph.edges(E(.)[E(.)$weight >= shared]) %>%
      # find all cliques
      cliques() %>%
      # double check if `Sites` in each clique meet the condition, using full info from `DF`
      Filter(
        \(q) {
          sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= shared
        }, .
      ) %>%
      # pick the clique that consists of the maximum number of `Sites`
      `[`(lengths(.) == max(lengths(.))),
    \(q) {
      list(
        sites = as.integer(names(q)),
        sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
      )
    }
  )
}


时间:

# The following will run for several minutes without returning a
# result. They will be excluded from the benchmark.
# system.time(f2(DF, 1))
# system.time(f2(DF, 2))

microbenchmark::microbenchmark(
  matrix1 = f(DF, 1),
  # igraph1 = f2(DF, 1),
  matrix2 = f(DF, 2),
  # igraph2 = f2(DF, 2),
  matrix3 = f(DF, 3),
  igraph3 = f2(DF, 3),
  matrix4 = f(DF, 4),
  igraph4 = f2(DF, 4),
  matrix5 = f(DF, 5),
  igraph5 = f2(DF, 5),
  matrix6 = f(DF, 6),
  igraph6 = f2(DF, 6),
  matrix8 = f(DF, 8),
  igraph8 = f2(DF, 8),
  times = 10
)

#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  matrix1    4.6677    6.1859    6.61501    6.83370    7.0784    8.2880    10
#>  matrix2    2.4428    2.5075    2.84573    2.80790    3.1315    3.5438    10
#>  matrix3   35.9857   37.0604   43.86820   41.89480   49.0314   61.3006    10
#>  igraph3 5934.9483 6136.9761 6240.44407 6278.98705 6362.0348 6471.6932    10
#>  matrix4   11.7949   12.2703   13.64224   13.99900   14.4769   16.3989    10
#>  igraph4  184.7337  191.0718  210.06073  203.10085  218.6591  255.5529    10
#>  matrix5    6.3713    6.8209    7.23380    7.27330    7.6825    7.9681    10
#>  igraph5   26.8565   29.4616   32.98676   33.06670   35.5976   39.0243    10
#>  matrix6    6.3562    6.5861    7.14332    6.86025    7.6393    8.6200    10
#>  igraph6   10.9108   11.6408   14.39365   12.39585   14.8864   28.6385    10
#>  matrix8    1.1340    1.2327    1.33831    1.33730    1.4175    1.5043    10
#>  igraph8    2.9498    3.6359    3.95806    3.72040    4.3448    5.2509    10


igraph方法始终较慢,有时甚至非常慢。

数据

set.seed(123)

(DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
) %>%
    rowwise() %>%
    mutate(Years = list(sample(1982:2021, NYears))) %>%
    unnest(Years) %>%
    select(-NYears))

2vuwiymt

2vuwiymt3#

这是一个简单的解决方案,虽然它可能对您的大型数据集不可行,但可能是寻找更好解决方案的良好起点:

library(tidyverse)

set.seed(123)

# I create the dataset myself, because I don't want it to be unnested
DF <- tibble(
  Sites = 1:100,
  NYears =rbinom(100, 40, .2)
  ) %>%
  rowwise() %>%
  mutate(Years = list(sort(sample(1982:2021, NYears)))) # sorting the years is good for later when I want to find the combinations, I can be sure that they will be in the same order

# basically, we're doing a crossjoin, filtering to overlaps larger than 5, then generating all possible combinations of those overlaps
overlaps <- cross_join(DF, DF) %>%
  filter(Sites.x < Sites.y) %>%
  mutate(Overlap = list(intersect(Years.x, Years.y))) %>%
  filter(length(Overlap) >= 5) %>%
  mutate(combinations = list(combn(Overlap, 5, simplify = FALSE))) %>%
  select(combinations, Sites.x, Sites.y) %>% 
  unnest(combinations)

most_common_fives <- overlaps %>%
  count(combinations) %>%
  slice_max(n) %>%
  pull(combinations)

overlaps %>%
    filter(combinations %in% most_common_fives) %>%
    group_by(combinations) %>%
    summarise(values = (list(unique(c(Sites.x, c(Sites.y)))))) %>%
    pull(combinations, values) 

$`c(26, 53, 84)`
[1] 1989 1991 1998 2001 2011

$`c(31, 59, 67)`
[1] 1989 1992 1999 2002 2005

字符串

jum4pzuy

jum4pzuy4#

这里是我最初的蛮力方法的一个(有点)优化版本,它现在不会遍历所有可能的年份组合,而是迭代地构建在 n+1年内采样的网站列表。

yr.sites <- tapply(DF$Sites, DF$Years, identity, simplify=F)
n.yrs <- length(yr.sites)
# year indices of each combination are stored in an attribute 'yrs'
for (i in seq_len(n.yrs)) attr(yr.sites[[i]], 'yrs') <- i
  
add_year <- function(ycs) {
  ycs <- lapply(ycs, \(sites) {
    yrs.comb <- attr(sites, 'yrs')
    last.yr <- tail(yrs.comb, 1)
    if (last.yr < n.yrs) { 
      # find subsets of sites sampled also in another year (for all years 
      #   after the last year already considered)
      lapply(seq(last.yr+1, n.yrs), \(yr) {
        inter <- intersect(sites, yr.sites[[yr]])
        # keep only intersections containing two or more sites
        if (length(inter) > 1) {
          attr(inter, 'yrs') <- c(yrs.comb, yr)
          inter
        } else NULL
      })
    } else NULL
  })
  ycs <- unlist(ycs, recursive=F)
  ycs[lengths(ycs) > 0]
}

yrs.comb.sites <- yr.sites
for (i in 2:5) {
  yrs.comb.sites <- add_year(yrs.comb.sites)
}

字符串
仍然需要大约一秒钟的时间来计算,因此其他答案可能更有效,但与蛮力方法相比,显着减少了访问的组合数量。
获奖者是:

best <- unname(yrs.comb.sites[lengths(yrs.comb.sites)==max(lengths(yrs.comb.sites))])
for (i in seq_along(best)) {
  attr(best[[i]], 'yrs') <- as.numeric(names(yr.sites))[attr(best[[i]], 'yrs')]
}
best
# [[1]]
# [1] 26 53 84
# attr(,"yrs")
# [1] 1989 1991 1998 2001 2011
# 
# [[2]]
# [1] 31 59 67
# attr(,"yrs")
# [1] 1989 1992 1999 2002 2005

原始答案

  • 这是一个非常缓慢的暴力破解方法,@Mark和@jblood94都提供了一个更有效的解决方案。在我的机器上(老实说,这台机器并不强大),这需要大约80秒的时间来计算。
yr.site <- tapply(DF$Sites, DF$Years, identity, simplify=F)
year.comb.sites <- combn(yr.site, 5, Reduce, f=intersect, simplify=F)
max.groups <- year.comb.sites[lengths(year.comb.sites) == max(lengths(year.comb.sites))]
max.groups
# [[1]]
# [1] 26 53 84
# 
# [[2]]
# [1] 31 59 67


代码很短,但背后有大量的计算。yr.sites是一个列表,存储每年采样的站点。combn线生成长度为5年的所有组合(有658008个这样的组合),并为每个组合找到在所有这些年中采样的站点。最后,有2组最多三个站点满足初始条件。
我们可以用

sort(table(subset(DF, Sites %in% max.groups[[1]])$Years))
# 1984 1988 1990 1993 1994 1996 1999 2000 2003 2005 2010 2017 2018 2021 1989 1991 1998 2001 2011 
#    1    1    1    1    1    1    1    1    1    1    1    1    1    1    3    3    3    3    3


这表明,例如第一组中的所有三个地点实际上都是在1989年、1991年、1998年、2001年和2011年取样的。
正如我已经说过的,这个解决方案是无效的,如果你想要求超过5年的共享,它将无法很好地扩展。例如,在6年的情况下,组合(choose(40, 6))的数量将增加到3838380,并且更高的年数也不是好消息。

nnt7mjpx

nnt7mjpx5#

igraph solution后续:分解代码

我不想让我的igraph答案太长而难以阅读,所以我创建了一个新的答案来详细说明该解决方案的工作原理。

虚拟示例

为了简单起见,我们可以从一个较小的虚拟示例开始,我们假设我们的目标是 * 找到共享至少3年 * 的Sites的最大数量。

set.seed(0)
DF <- data.frame(
    Sites = rep(1:10, each = 10),
    Years = sample(2000:2020, 100, replace = TRUE)
) %>%
    unique()

字符串
并且DF在图中的可视化看起来像可以通过

DF %>%
    graph_from_data_frame() %>%
    set_vertex_attr(name = "color", value = names(V(.)) %in% DF$Sites) %>%
    plot()


的数据

步骤

  • 1). * 正如我们所看到的,这个图可以被视为一个 * 二分图 *(SitesYears是两种不同类型的顶点),并且共享的Years实际上可以投影到边。由于我们需要跟踪共享Years的 * 数量 *,因此可以使用边属性"weight"来标识共享Years的计数。在这种情况下,在投影之前需要Sites的邻接矩阵,并且通过table + tcrossprod来获得Sites,例如,
adjmat <- DF %>%
    table() %>%
    tcrossprod()


这给了

> adjmat
     Sites
Sites 1 2 3 4 5 6 7 8 9 10
   1  8 2 3 0 4 4 4 2 5  3
   2  2 7 5 2 3 1 2 3 2  4
   3  3 5 9 4 4 3 2 3 2  4
   4  0 2 4 6 3 2 1 2 1  3
   5  4 3 4 3 9 5 4 4 4  4
   6  4 1 3 2 5 7 2 3 2  3
   7  4 2 2 1 4 2 9 4 3  4
   8  2 3 3 2 4 3 4 9 2  5
   9  5 2 2 1 4 2 3 2 7  4
   10 3 4 4 3 4 3 4 5 4  9

  • 2). * 如目标中所述,我们希望找出具有>=3共享年的组,这意味着边的"weight"至少应为3。在步骤1)中获得的邻接矩阵adjmat之上,我们可以进一步应用过滤器(>=3)来简化矩阵,这相当于 * 修剪 * 网络,即,
g <- adjmat %>%
    `>=`(3) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE)


plot(g)显示的项目如下所示

  • 3). * 我们已经知道,共享相同YearsSites应该产生一个完全子图,即 * 团 *。因此,我们由此可以枚举图g中的所有团,这由函数cliques()完成,即,
clq <- g %>%
    cliques(min = 2)


其中min = 2指定每个团的最小大小。如果您事先没有关于大小的信息,使用cliques()也可以,不需要任何额外的参数。现在,clq是一个列表,看起来像

> clq 
[[1]]
+ 2/10 vertices, named, from 3b71441:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 5

[[3]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  5  10

[[4]]
+ 2/10 vertices, named, from 3b71441:
[1] 3  10

[[5]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 7

[[6]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  7  10

[[7]]
+ 2/10 vertices, named, from 3b71441:
[1] 7  10

[[8]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 8

[[9]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 8

[[10]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  8  10

[[11]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  8  10

[[12]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 8

[[13]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 8

[[14]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  8  10

[[15]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  8  10

[[16]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 8

[[17]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  8  10

[[18]]
+ 2/10 vertices, named, from 3b71441:
[1] 8  10

[[19]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 7

[[20]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 7

[[21]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  7  10

[[22]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  7  10

[[23]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 3

[[24]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 5

[[25]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  5  10

[[26]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  3  10

[[27]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 5

[[28]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  5  10

[[29]]
+ 2/10 vertices, named, from 3b71441:
[1] 1  10

[[30]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 4

[[31]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 4 5

[[32]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  4  5  10

[[33]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  4  10

[[34]]
+ 2/10 vertices, named, from 3b71441:
[1] 4 5

[[35]]
+ 3/10 vertices, named, from 3b71441:
[1] 4  5  10

[[36]]
+ 2/10 vertices, named, from 3b71441:
[1] 4  10

[[37]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 9

[[38]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 7 9

[[39]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 5 7 9

[[40]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  5  7  9  10

[[41]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  7  9  10

[[42]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 9

[[43]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  9  10

[[44]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  9  10

[[45]]
+ 2/10 vertices, named, from 3b71441:
[1] 7 9

[[46]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 7 9

[[47]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  7  9  10

[[48]]
+ 3/10 vertices, named, from 3b71441:
[1] 7  9  10

[[49]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 9

[[50]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  9  10

[[51]]
+ 2/10 vertices, named, from 3b71441:
[1] 9  10

[[52]]
+ 2/10 vertices, named, from 3b71441:
[1] 1 6

[[53]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 3 6

[[54]]
+ 4/10 vertices, named, from 3b71441:
[1] 1 3 5 6

[[55]]
+ 5/10 vertices, named, from 3b71441:
[1] 1  3  5  6  10

[[56]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  3  6  10

[[57]]
+ 3/10 vertices, named, from 3b71441:
[1] 1 5 6

[[58]]
+ 4/10 vertices, named, from 3b71441:
[1] 1  5  6  10

[[59]]
+ 3/10 vertices, named, from 3b71441:
[1] 1  6  10

[[60]]
+ 2/10 vertices, named, from 3b71441:
[1] 6 8

[[61]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 6 8

[[62]]
+ 4/10 vertices, named, from 3b71441:
[1] 3 5 6 8

[[63]]
+ 5/10 vertices, named, from 3b71441:
[1] 3  5  6  8  10

[[64]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  6  8  10

[[65]]
+ 3/10 vertices, named, from 3b71441:
[1] 5 6 8

[[66]]
+ 4/10 vertices, named, from 3b71441:
[1] 5  6  8  10

[[67]]
+ 3/10 vertices, named, from 3b71441:
[1] 6  8  10

[[68]]
+ 2/10 vertices, named, from 3b71441:
[1] 3 6

[[69]]
+ 3/10 vertices, named, from 3b71441:
[1] 3 5 6

[[70]]
+ 4/10 vertices, named, from 3b71441:
[1] 3  5  6  10

[[71]]
+ 3/10 vertices, named, from 3b71441:
[1] 3  6  10

[[72]]
+ 2/10 vertices, named, from 3b71441:
[1] 5 6

[[73]]
+ 3/10 vertices, named, from 3b71441:
[1] 5  6  10

[[74]]
+ 2/10 vertices, named, from 3b71441:
[1] 6  10

[[75]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 8

[[76]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 8

[[77]]
+ 4/10 vertices, named, from 3b71441:
[1] 2 3 5 8

[[78]]
+ 5/10 vertices, named, from 3b71441:
[1] 2  3  5  8  10

[[79]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  8  10

[[80]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 5 8

[[81]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  5  8  10

[[82]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  8  10

[[83]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 3

[[84]]
+ 3/10 vertices, named, from 3b71441:
[1] 2 3 5

[[85]]
+ 4/10 vertices, named, from 3b71441:
[1] 2  3  5  10

[[86]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  3  10

[[87]]
+ 2/10 vertices, named, from 3b71441:
[1] 2 5

[[88]]
+ 3/10 vertices, named, from 3b71441:
[1] 2  5  10

[[89]]
+ 2/10 vertices, named, from 3b71441:
[1] 2  10

**4).需要注意的是,并非所有的集团都满足共享Years>=3要求,因为 “权重” 表示的是总共享Years的计数,而不是区分共享Years的计数。换句话说,2000, 2001, 2002是有效的,但2000, 2000, 2002不是,尽管后者的计数为3。因此,我们需要检查每个团中共享Years的分布。

例如,如果我们查看第三个团,即clq[[3]],并检查共享Years的分布

> q <- clq[[3]]

> table(subset(DF, Sites %in% names(q)))
     Years
Sites 2000 2001 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
   3     1    1    1    0    0    0    1    1    0    1    0    1    1    0
   5     0    1    0    1    1    1    0    1    0    1    1    1    0    0
   10    1    0    0    0    1    1    0    1    1    1    0    0    1    1
     Years
Sites 2018 2019 2020
   3     0    1    0
   5     1    0    0
   10    0    0    1

我们看到所有1的列是20092011,这意味着它们只有 * 2 * 共享Years,因此无效。为了挑选有效的集团,我们可以使用Filter和过滤标准,例如,

validclq <- clq %>%
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 3
        }, .
    )

等着瞧吧

> validclq
[[1]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5  10

[[2]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 5

[[3]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3  10

[[4]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 7

[[5]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7  10

[[6]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 8

[[7]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 8

[[8]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 8

[[9]]
+ 2/10 vertices, named, from 9bd9430:
[1] 8  10

[[10]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 7

[[11]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 3

[[12]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 5

[[13]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1  10

[[14]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 4

[[15]]
+ 3/10 vertices, named, from 9bd9430:
[1] 3  4  10

[[16]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4 5

[[17]]
+ 2/10 vertices, named, from 9bd9430:
[1] 4  10

[[18]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 9

[[19]]
+ 3/10 vertices, named, from 9bd9430:
[1] 1 5 9

[[20]]
+ 2/10 vertices, named, from 9bd9430:
[1] 7 9

[[21]]
+ 3/10 vertices, named, from 9bd9430:
[1] 7  9  10

[[22]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 9

[[23]]
+ 2/10 vertices, named, from 9bd9430:
[1] 9  10

[[24]]
+ 2/10 vertices, named, from 9bd9430:
[1] 1 6

[[25]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6 8

[[26]]
+ 2/10 vertices, named, from 9bd9430:
[1] 3 6

[[27]]
+ 2/10 vertices, named, from 9bd9430:
[1] 5 6

[[28]]
+ 2/10 vertices, named, from 9bd9430:
[1] 6  10

[[29]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 8

[[30]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 3

[[31]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2 5

[[32]]
+ 2/10 vertices, named, from 9bd9430:
[1] 2  10

其中89个团中的32个满足要求。

  • 5). * 作为最后一步,我们从有效的团validclq中选择具有最大大小的团(因为我们正在寻找Sites的 * 最大 * 数量)。使用lenghts来获得每个团的大小,我们可以过滤具有最大大小的团,例如,
res <- validclq %>%
    `[`(lengths(.) == max(lengths(.)))

我们终于得到了

[[1]]
+ 3/10 vertices, named, from 75803c6:
[1] 3  4  10

[[2]]
+ 3/10 vertices, named, from 75803c6:
[1] 1 5 9

[[3]]
+ 3/10 vertices, named, from 75803c6:
[1] 7  9  10
mtb9vblg

mtb9vblg6#

这里有两个使用动态规划技术的基本R选项,它们通过一棵“树”搜索所需的输出,当添加新的可行Sites时,“树”可以扩展。

Base R实现

  • 选项1x 1 m1n1x,checker从矩阵中检索行,并检查共享年份数是否满足要求
ftic1 <- function(DF, kmin) {
    # create a `Sites`-to-`Years` table
    M <- table(DF)
    nr <- nrow(M)

    # helper function that checks the validity
    checker <- function(idx, kmin) {
        sum(colSums(M[idx, , drop = FALSE]) == length(idx)) >= kmin
    }

    # start from each site
    res <- as.list(seq(nr))
    repeat {
        lst <- list()
        for (l in res) {
            u <- unlist(l)
            uend <- u[length(u)]
            if (uend < nr) {
                for (p in (uend + 1):nr) {
                    ks <- c(u, p)
                    if (checker(ks, kmin)) {
                        lst[length(lst) + 1] <- list(ks)
                    }
                }
            }
        }
        if (length(lst) == 0) {
            break
        } else {
            res <- lst
        }
    }
    res
}

字符串

  • 选项2 ftic2,其中我们应用掩码msk来提升checker,因为它只需要当前掩码和新行进行检查(* 因为colSums一次又一次地使用冗余信息,这导致效率低下 *)。因此,ftic2应该比ftic1性能更好,尽管它看起来不那么整洁!
ftic2 <- function(DF, kmin) {
    # create a `Sites`-to-`Years` table
    M <- table(DF)
    nr <- nrow(M)

    # helper function that checks the validity
    checker <- function(msk, v, kmin) {
        sum(msk & v) >= kmin
    }

    # start from each site
    res <- lapply(1:nr, \(x) list(idx = x, msk = M[x, ]))
    repeat {
        lst <- list()
        for (l in res) {
            u <- l$idx
            msk <- l$msk
            uend <- u[length(u)]
            if (uend < nr) {
                for (p in (uend + 1):nr) {
                    v <- M[p, ]
                    if (checker(msk, v, kmin)) {
                        lst[[length(lst) + 1]] <- list(idx = c(u, p), msk = msk & v)
                    }
                }
            }
        }
        if (length(lst) == 0) {
            break
        } else {
            res <- lst
        }
    }
    res
}

输出示例

> ftic1(DF, 5)
[[1]]
[1] 26 53 84

[[2]]
[1] 31 59 67

> ftic2(DF, 5)
[[1]]
[[1]]$idx
[1] 26 53 84

[[1]]$msk
 1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
 1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007
FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020
FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2021
FALSE

[[2]]
[[2]]$idx
[1] 31 59 67

[[2]]$msk
 1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994
FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
 1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007
FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
 2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019  2020
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2021
FALSE


正如我们所看到的,在ftic2的结果中,msk字段指示哪些Yearsidx字段提供的Sites索引共享。从这个意义上说,当您打算获取共享Years的信息时,ftic2ftic1更方便。

对标

在这里,我们将@jblood94的解决方案作为基线,因为该解决方案是迄今为止最有效的。

  • >=3
microbenchmark(
    ftic1(DF, 3),
    ftic2(DF, 3),
    f(DF, 3)[],
    times = 20L,
    unit = "relative"
)
Unit: relative
         expr      min       lq      mean   median        uq       max neval
 ftic1(DF, 3) 22.44911 21.49605 19.464670 21.41181 19.943792 10.261857    20
 ftic2(DF, 3) 10.68673 10.27488  9.004385 10.18160  9.215856  3.757149    20
   f(DF, 3)[]  1.00000  1.00000  1.000000  1.00000  1.000000  1.000000    20

  • >=4
microbenchmark(
    ftic1(DF, 4),
    ftic2(DF, 4),
    f(DF, 4)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr       min       lq      mean    median        uq      max neval
 ftic1(DF, 4) 20.135953 19.60166 15.052466 19.268706 17.753250 4.583222    20
 ftic2(DF, 4)  9.843199  9.47094  7.098557  9.405802  8.107661 1.985759    20
   f(DF, 4)[]  1.000000  1.00000  1.000000  1.000000  1.000000 1.000000    20

  • >=5
microbenchmark(
    ftic1(DF, 5),
    ftic2(DF, 5),
    f(DF, 5)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr       min        lq     mean    median        uq       max neval
 ftic1(DF, 5) 16.953898 16.060962 9.512281 15.488721 13.738142 1.5683931    20
 ftic2(DF, 5)  7.793642  7.567225 4.546013  7.409521  6.658683 0.7780443    20
   f(DF, 5)[]  1.000000  1.000000 1.000000  1.000000  1.000000 1.0000000    20

  • >=6
microbenchmark(
    ftic1(DF, 6),
    ftic2(DF, 6),
    f(DF, 6)[],
    times = 20L,
    unit = "relative"
)

Unit: relative
         expr      min       lq     mean    median       uq       max neval
 ftic1(DF, 6) 9.031968 9.819239 5.744555 10.091242 9.760496 1.0847252    20
 ftic2(DF, 6) 4.417896 4.552029 2.721448  4.861791 4.782947 0.4812401    20
   f(DF, 6)[] 1.000000 1.000000 1.000000  1.000000 1.000000 1.0000000    20

相关问题