在data.table包中是否有“ifelse(any(startsWith)”的替代方法?

byqmnocz  于 2023-06-19  发布在  其他
关注(0)|答案(2)|浏览(106)

因此,我试图将我的dplyr转换为DT,以加快处理时间,但我无法转换我的ifelse(any(startsWith...对DT的声明无论我尝试什么,它总是做一个极端或另一个极端,或者对于“Tag”的情况,它只是说它不存在。也许问题出在rowwise,但我想不出来。先谢谢你了!
下面是我的dplyr代码:

df <- df %>% 
  rowwise() %>%
  mutate(
    'Position' = coalesce( 
      ifelse(any(c_across(starts_with("Tag")) == "goalkeeper"), "Goalkeeper", NA),
      ifelse(any(c_across(starts_with("Tag")) == "striker"), "Striker", NA),
    ),
    Favorite = ifelse(any(c_across(starts_with("Tag")) == "favorite"), TRUE, FALSE),
    across(starts_with("Tag"), ~ifelse(. %in% c("goalkeeper", "striker", "favorite"), NA_character_, .))
)

我的DT尝试

df[, `Position` := coalesce(
  ifelse(any(startsWith(Tag, "goalkeeper")), "Goalkeeper", NA_character_), #tried this
  ifelse(grepl("striker", "^Tag"), "Striker", NA_character_), #and this
)]

df[, Favorite := any(startsWith(Tag1, "favorite"))]

df[, (grep("Tag", names(df), value = TRUE)) :=
             lapply(.SD, function(x) ifelse(x %in% c("goalkeeper", "striker", "favorite"), NA_character_, x)),
           .SDcols = patterns("Tag")]

数据:
| 姓名|标签1|标签2|标签3|
| - -----|- -----|- -----|- -----|
| 一个|守门员|不适用|不适用|
| B|不适用|撞针|最喜欢的|
预期输出:
| 姓名|位置|最新资讯|
| - -----|- -----|- -----|
| 一个|守门员|假的|
| B|前锋|真|

tvz2xvvm

tvz2xvvm1#

应用函数tidyData查找每行的位置/收藏夹。要在行上执行此操作,请使用transpose。第二个transpose是以2列的形式获取所需的数据。

tidyData <- function(vec){
  vec <- vec[!is.na(vec)]
  c(position = vec[vec != "favorite"], favorite = any(vec == "favorite"))
}

dt[
  , 
  (c("position", "favorite")) := transpose(lapply(transpose(.SD), tidyData)),
  .SDcols = startsWith(names(dt), "Tag")
][, .(name, position, favorite)]

数据:

dt <- data.table(
  name = c("A", "B"),
  Tag1 = c("golakeeper", NA),
  Tag2 = c(NA, "striker"),
  Tag3 = c(NA, "favorite")
)
ovfsdjhp

ovfsdjhp2#

由于您正在按行方式创建多列快照,我不知道是否有很好的方法来实现这一点,但也许这就足够了?

tags <- grep("Tag", names(df), value=TRUE)
tags
# [1] "Tag1" "Tag2" "Tag3"

df[, c("Position", "Favorite") := .(
  apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
  apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags]
df
#      Name       Tag1    Tag2     Tag3   Position Favorite
#    <char>     <char>  <char>   <char>     <char>   <lgcl>
# 1:      A goalkeeper    <NA>     <NA> goalkeeper    FALSE
# 2:      B       <NA> striker favorite    striker     TRUE

(And您可以轻松地删除标签。)
使用apply的成本有点高,因为它会导致帧(.SD,在本例中只是Tag#列)在内部转换为matrix。正是因为这个对话,在框架行上下文中使用apply可能会很昂贵,这是理所当然的。
另一种选择:

fun <- function(...) {
  dots <- unlist(list(...))
  list(Position = intersect(c("goalkeeper", "striker"), dots)[1], Favorite = "favorite" %in% dots)
}
df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags]

这两个执行速度有点相同(medianitr/sec),但第一个具有较低的mem_alloc,* 也许 * 表明它可能更适合更大的数据。但不要过于草率地对标小数据...

bench::mark(
  a = df[, c("Position", "Favorite") := .(
    apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
    apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
  b = df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags],
  min_iterations=10000)
# # A tibble: 2 × 13
#   expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 a            243µs  288µs     3262.    16.4KB     12.1  9963    37      3.05s <dt>   <Rprofmem> <bench_tm> <tibble>
# 2 b            253µs  293µs     3109.    48.7KB     10.6  9966    34      3.21s <dt>   <Rprofmem> <bench_tm> <tibble>

扩展到一个更大的数据集,

dfbig <- rbindlist(replicate(10000, df, simplify=FALSE))

我们得到了这些基准测试结果:

bench::mark(
  a = dfbig[, c("Position", "Favorite") := .(
    apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
    apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
  b = dfbig[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags], 
  min_iterations = 500)
# # A tibble: 2 × 13
#   expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 a            202ms  257ms      3.78    2.69MB    12.5    500  1655      2.21m <dt>   <Rprofmem> <bench_tm> <tibble>
# 2 b            218ms  398ms      2.56  908.43KB     6.19   500  1210      3.26m <dt>   <Rprofmem> <bench_tm> <tibble>

mem_alloc对于第二种实现(Map)来说要低一些,尽管medianitr/sec稍微慢一些。我不知道你的情况下哪个更好。

相关问题