R语言 按顺序删除/折叠连续重复值

n7taea2i  于 2023-01-28  发布在  其他
关注(0)|答案(5)|浏览(218)

我有以下 * Dataframe *:

a a a b c c d e a a b b b e e d d

所需结果应为

a b c d e a b e d

这意味着没有两个连续的行应该有相同的值。如何可以做到这一点,而不使用循环。
由于我的数据集非常大,执行循环需要花费大量时间。
Dataframe 结构如下所示

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10

结果:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4

它应该删除整行。

tnkciper

tnkciper1#

一种简单的方法是使用rle
以下是您的示例数据:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle返回具有两个值的list:游程长度("lengths"),以及为该游程重复的值("values")。

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

更新:对于data.frame

如果您使用的是data.frame,请尝试以下操作:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

更新2

"data.table"包中有一个函数rleid,可以让你很容易地完成这个任务。使用上面的mydf,尝试:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4
bq3bfh9z

bq3bfh9z2#

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

编辑:对于data.frame

mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)

DPLYR溶液是一种线性:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4

事后脚本

@Carl Witthoft建议的lead(x,1)以相反的顺序迭代。

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10
baubqpgj

baubqpgj3#

以R为底,我喜欢有趣的算法:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
u4dcyp6a

u4dcyp6a4#

虽然我很喜欢,......呃,* 爱 * rle,这里有一个枪战:
编辑:不能弄清楚dplyr到底是怎么回事,所以我用了dplyr::lead。我在OSX,R3.1.2,和最新的dplyr从CRAN。

xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]


  microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
         expr      min       lq   median       uq      max neval
  rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
  lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
 tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20
xxe27gdn

xxe27gdn5#

Tidyverse解决方案:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |> 
 mutate(id = consecutive_id(x)) |> 
 distinct(x, id)

此外,如果存在与连续值列相关联的另一列y,则此解决方案允许一些灵活性:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |> 
    group_by(id = consecutive_id(x)) |> 
    slice_min(y)

我们可以选择不同的切片函数,如slice_max、slice_min、slice_head和slice_tail。
这个堆栈溢出线程出现在R4DS的第二版中,在书中的数字章节。

相关问题