R -从 Dataframe 中剪切数据以平衡 Dataframe

omqzjyyz 于 2023-02-10 发布在其他

关注(0)|答案(2)|浏览(106)

我有一个2600个条目的数据框架，分布在249个因素水平（人）。数据集不平衡。

我想删除在一个因子中出现次数少于5次的所有条目。我还想将出现次数多于5次的条目修剪为5次。因此，最后我想有一个整体条目较少的数据框，但它在人的因子上是平衡的。
数据集构建如下：

file_list <- list.files("path/to/image/folder", full.names=TRUE) 
# the folder contains 2600 images, which include information about the 
# person factor in their file name

file_names <- sapply(strsplit(file_list , split = '_'), "[",  1)
person_list <- substr(file_names, 1 ,3)
person_class <- as.factor(person_list)

imageWidth = 320; # uniform pixel width of all images
imageHeight = 280; # uniform pixel height of all images
variableCount = imageHeight * imageWidth + 2

images <- as.data.frame(matrix(seq(count),nrow=count,ncol=variableCount ))
images[1] <- person_class
images[2] <- eyepos_class

for(i in 1:count) {
  img <- readJPEG(file_list[i])
  image <- c(img)
  images[i, 3:variableCount] <- image
}

因此，基本上我需要获得每个因子水平的样本量（例如使用summary(images[1])时），然后执行操作来修剪数据集。我真的不知道如何开始，希望您能提供帮助

r

来源：https://stackoverflow.com/questions/37779289/r-cut-data-from-data-frame-to-balance-it

2条答案

按热度按时间

inb24sb21#

使用data.table的选项

library(data.table)
res <- setDT(images)[, if(.N > = 5) head(.SD, 5) , by = V1]

赞(0）回复(0）举报 2023-02-10

jhiyze9q2#

使用dplyr：

library(dplyr)
group_by(images, V1) %>%  # group by the V1 column
    filter(n() >= 5) %>%  # keep only groups with 5 or more rows
    slice(1:5)            # keep only the first 5 rows in each group

你可以把结果赋给一个对象，比如my_desired_result = group_by(images, ...

赞(0）回复(0）举报 2023-02-10

我来回答

R -从 Dataframe 中剪切数据以平衡 Dataframe

2条答案

相关问题

热门标签

最新问答