R语言 笛卡尔积使用数据表包

ie3xauqp  于 2023-10-13  发布在  其他
关注(0)|答案(4)|浏览(115)

使用R中的data.table包,我尝试使用合并方法来创建两个data.tables的carbohydrate产品,就像在基R中一样。
在基地以下工程:

#assume this order data
orders <- data.frame(date = as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01')),
                     first.name = as.character(c('John','George','Henry')),
                     last.name = as.character(c('Doe','Smith','Smith')),
                     qty = c(10,50,6))

#and these dates
dates <- data.frame(date = seq(from = as.POSIXct('2012-08-28'),
                               to = as.POSIXct('2012-09-07'), by = 'day'))

#get the unique customers
cust<-unique(orders[,c('first.name','last.name')])

#using merge from base R, get the cartesian product
merge(dates, cust, by = integer(0))

然而,相同的技术在使用data.table时不起作用,并抛出以下错误:

"Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : 
  A non-empty vector of column names for `by` is required."
#data.table approach
library(data.table)

orders.dt <- data.table(orders)

dates.dt <- data.table(dates)

cust.dt <- unique(orders.dt[, list(first.name, last.name)])

#try to use merge (data.table) in the same manner as base
merge(dates.dt, cust.dt, by = integer(0))
Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : 
  A non-empty vector of column names for `by` is required.

我希望结果能反映所有日期的所有客户名称,就像在base中一样,但要以数据.表为中心。这可能吗?

nfs0ujit

nfs0ujit1#

如果您首先从cust-嵌套框架中的第一个和最后一个构造全名,则可以使用CJ(交叉连接)。你不能使用所有三个向量,因为会有99个项目,名字会不适当地与姓氏混合。

> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99

这将返回所需的数据。table对象:

> CJ(dates$date,paste(cust$first.name, cust$last.name) )
            V1           V2
 1: 2012-08-28 George Smith
 2: 2012-08-28  Henry Smith
 3: 2012-08-28     John Doe
 4: 2012-08-29 George Smith
 5: 2012-08-29  Henry Smith
 6: 2012-08-29     John Doe
 7: 2012-08-30 George Smith
 8: 2012-08-30  Henry Smith
 9: 2012-08-30     John Doe
10: 2012-08-31     John Doe
11: 2012-08-31 George Smith
12: 2012-08-31  Henry Smith
13: 2012-09-01     John Doe
14: 2012-09-01 George Smith
15: 2012-09-01  Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02  Henry Smith
18: 2012-09-02     John Doe
19: 2012-09-03  Henry Smith
20: 2012-09-03     John Doe
21: 2012-09-03 George Smith
22: 2012-09-04  Henry Smith
23: 2012-09-04     John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05  Henry Smith
27: 2012-09-05     John Doe
28: 2012-09-06 George Smith
29: 2012-09-06  Henry Smith
30: 2012-09-06     John Doe
31: 2012-09-07 George Smith
32: 2012-09-07  Henry Smith
33: 2012-09-07     John Doe
            V1           V2
kkbh8khc

kkbh8khc2#

merge.data.table(x, y)是一个方便的函数,它 Package 了对x[y]的调用,因此合并需要基于两个data.table中的列。(这就是错误消息试图告诉你的)。
一种解决方法是向两个data.tables添加一个虚拟列,其唯一目的是使合并成为可能:

## Add a column "k", and append it to each data.table's vector of keyed columns.
setkeyv(cust.dt[,k:=1], c(key(cust.dt), "k"))
setkeyv(dates.dt[,k:=1], c(key(dates.dt), "k"))

## Merge and then remove the dummy column
res <- merge(dates.dt, cust.dt, by="k")
head(res[,k:=NULL])
#          date first.name last.name
# 1: 2012-08-28     George     Smith
# 2: 2012-08-28      Henry     Smith
# 3: 2012-08-28       John       Doe
# 4: 2012-08-29     George     Smith
# 5: 2012-08-29      Henry     Smith
# 6: 2012-08-29       John       Doe

## Maybe also clean up cust.dt and dates.dt    
# cust.dt[,k:=NULL]
# dates.dt[,k=NULL]
fhity93d

fhity93d3#

来自@JoshO'Brien的解决方案使用merge,但下面是一个类似的替代方案,它不使用(AFAIK)。
如果我正确理解了?data.table::merge中的文档,那么X[Y]应该比data.table::merge(X,Y)稍快(从1.8.7版开始)。它引用了FAQ 2.12来解决这个问题,但FAQ有点令人困惑。首先,正确的参考应该是1.12,而不是2.12。它们也没有指出它们是引用merge的基本版本还是data. tableone,或者两者都引用。所以,这可能最终只是一个看起来更混乱的解决方案,它是等价的,或者它可能更快。
[Edit from Matthew]谢谢:现在在v1.8.7中进行了改进(?merge.data.table,FAQ 1.12并添加了新的FAQ 2.24)

DT_orders<-data.table(date=as.POSIXct(c('2012-08-28','2012-08-29','2012-08-29','2012-09-01')),
                      first.name=as.character(c('John','John','George','Henry')),
                      last.name=as.character(c('Doe','Doe','Smith','Smith')),
                      qty=c(10,2,50,6),
                      key="first.name,last.name")

# Note that I added a second record to the orders table for John Doe, to make sure it could handle duplicate first/last name combinations.

DT_dates<-data.table(date=seq(from=as.POSIXct('2012-08-28'),
                              to=as.POSIXct('2012-09-07'),by='day'),
                     key="date")

DT_custdates<-data.table(k=1,unique(DT_dates),key="k")[unique(DT_orders)[,list(k=1,first.name,last.name)]][,k:=NULL]
lc8prwob

lc8prwob4#

还有两个选择:
1.力:base::merge(..)

base::merge.data.frame(dates.dt, cust.dt, by = integer(0))
#          date first.name last.name
# 1  2012-08-28       John       Doe
# 2  2012-08-29       John       Doe
# 3  2012-08-30       John       Doe
# 4  2012-08-31       John       Doe
# 5  2012-09-01       John       Doe
# 6  2012-09-02       John       Doe
# 7  2012-09-03       John       Doe
# 8  2012-09-04       John       Doe
# 9  2012-09-05       John       Doe
# 10 2012-09-06       John       Doe
# 11 2012-09-07       John       Doe
# 12 2012-08-28     George     Smith
# 13 2012-08-29     George     Smith
# 14 2012-08-30     George     Smith
# 15 2012-08-31     George     Smith
# 16 2012-09-01     George     Smith
# 17 2012-09-02     George     Smith
# 18 2012-09-03     George     Smith
# 19 2012-09-04     George     Smith
# 20 2012-09-05     George     Smith
# 21 2012-09-06     George     Smith
# 22 2012-09-07     George     Smith
# 23 2012-08-28      Henry     Smith
# 24 2012-08-29      Henry     Smith
# 25 2012-08-30      Henry     Smith
# 26 2012-08-31      Henry     Smith
# 27 2012-09-01      Henry     Smith
# 28 2012-09-02      Henry     Smith
# 29 2012-09-03      Henry     Smith
# 30 2012-09-04      Henry     Smith
# 31 2012-09-05      Henry     Smith
# 32 2012-09-06      Henry     Smith
# 33 2012-09-07      Henry     Smith

1.另一个data.table-技巧:

dates.dt[, as.list(cust.dt), by = dates.dt]
#           date first.name last.name
#         <POSc>     <char>    <char>
#  1: 2012-08-28       John       Doe
#  2: 2012-08-28     George     Smith
#  3: 2012-08-28      Henry     Smith
#  4: 2012-08-29       John       Doe
#  5: 2012-08-29     George     Smith
#  6: 2012-08-29      Henry     Smith
#  7: 2012-08-30       John       Doe
#  8: 2012-08-30     George     Smith
#  9: 2012-08-30      Henry     Smith
# 10: 2012-08-31       John       Doe
# ---                                
# 24: 2012-09-04      Henry     Smith
# 25: 2012-09-05       John       Doe
# 26: 2012-09-05     George     Smith
# 27: 2012-09-05      Henry     Smith
# 28: 2012-09-06       John       Doe
# 29: 2012-09-06     George     Smith
# 30: 2012-09-06      Henry     Smith
# 31: 2012-09-07       John       Doe
# 32: 2012-09-07     George     Smith
# 33: 2012-09-07      Henry     Smith

相关问题