R语言跨多列存储数据时的One-hot编码

jyztefdp 于 2023-02-20 发布在其他

关注(0)|答案(5)|浏览(205)

假设我有一个 Dataframe
| 原色|二次色|第三色|
| - ------|- ------|- ------|
| 红色|蓝色|绿色|
| 黄色|红色|不适用|
我希望通过检查颜色是否存在于三列中的任何一列（1）或三列中的任何一列（0）来进行编码。
| 红色|蓝色|绿色|黄色|
| - ------|- ------|- ------|- ------|
| 1个|1个|1个|无|
| 1个|无|无|1个|
我在R中工作，我知道我可以通过为每种颜色写一堆ifelse语句来完成这个任务，但是我的实际问题有更多的颜色，有没有更简洁的方法来完成这个任务？

来源：https://stackoverflow.com/questions/75497485/one-hot-encoding-when-data-is-stored-across-multiple-columns

5条答案

按热度按时间

sczxawaw1#

你可以创建一个新的列来跟踪每一行，获得长格式的数据，并通过计算每种颜色的出现次数来恢复宽格式。

library(dplyr)
library(tidyr)

df %>%
  mutate(row = row_number()) %>%
  pivot_longer(cols = -row) %>%
  pivot_wider(names_from = value, values_from = value, id_cols = row, 
              values_fn = length, values_fill = 0) %>%
  select(-row)

#    red  blue green yellow
#  <int> <int> <int>  <int>
#1     1     1     1      0
#2     1     1     0      1

数据

df <- structure(list(primary_color = c("red", "yellow"), secondary_color = 
c("blue", "red"), tertiary_color = c("green", "blue")), row.names = c(NA, 
-2L), class = "data.frame")

赞(0）回复(0）举报 2023-02-20

xurqigkl2#

在以R为基数的情况下，你可以使用sapply和一个函数来检查所需名称的向量：

nnames <- c("red", "blue", "green", "yellow")

new_df <- t(sapply(seq_len(nrow(df)),
                   function(x)(nnames %in% df[x, ]) * 1))

colnames(new_df) <- nnames

#  red blue green yellow
#1   1    1     1      0
#2   1    0     0      1

注意，如果您不关心第二个表中列的顺序，可以将nnames泛化为nnames <- unique(unlist(df[!is.na(df)]))
数据

df <- read.table(text = "primary_color  secondary_color tertiary_color
red blue    green
yellow  red NA", h = TRUE)

赞(0）回复(0）举报 2023-02-20

t0ybt7op3#

使用outer。

uc <- unique(unlist(dat))[c(1, 3, 4, 2)]
t(+outer(uc, asplit(dat, 1), Vectorize(`%in%`))) |> `colnames<-`(uc)
#      red blue green yellow
# [1,]   1    1     1      0
# [2,]   1    0     0      1

数据：*

dat <- structure(list(primary_color = c("red", "yellow"), secondary_color = c("blue", 
"red"), tertiary_color = c("green", NA)), class = "data.frame", row.names = c(NA, 
-2L))

赞(0）回复(0）举报 2023-02-20

jum4pzuy4#

以R为基：

table(row(df), as.matrix(df))
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

如果您希望它作为 Dataframe ：

as.data.frame.matrix(table(row(df), as.matrix(df)))

  blue green red yellow
1    1     1   1      0
2    0     0   1      1

如果同一行的多列中有一种颜色：

+(table(row(df), as.matrix(df))>0)
   
    blue green red yellow
  1    1     1   1      0
  2    0     0   1      1

赞(0）回复(0）举报 2023-02-20

bfnvny8b5#

使用mtabulate

library(qdapTools)
 mtabulate(as.data.frame(t(df1)))
   blue green red yellow
V1    1     1   1      0
V2    1     0   1      1

或使用base R

table(c(row(df1)), unlist(df1))
     blue green red yellow
  1    1     1   1      0
  2    1     0   1      1

赞(0）回复(0）举报 2023-02-20

我来回答

R语言跨多列存储数据时的One-hot编码

5条答案

相关问题

热门标签

最新问答

R语言 跨多列存储数据时的One-hot编码

5条答案

相关问题

热门标签

最新问答

R语言跨多列存储数据时的One-hot编码