有:采购记录。使用sparkyr::sdf\u sql将spark表读入tbl\u spark。
太大,无法使用dbi::dbgetquery(10^8行和30+列)输出到驱动程序节点上的local r data.frame
个人出行日期产品价格130-01-2021适用$1.50130-01-2021加纳$0.89127-01-2021范围$1.00127-01-2021适用$1.20229-01-2021适用$2.00229-01-2021适用$1.20228-01-2021适用$2.50
需求:独特买家表,按最近旅行筛选。
个人出行日期产品价格130-01-2021适用$1.50229-01-2021价格$2.00
spark_version <- "3.1.0"
sc <- spark_connect(method = "databricks")
五次尝试
1.
Fruit_buyers <- Fruit_purchases %>% distinct(Person_id, .keep_all=TRUE)
# Error : Can only find distinct value of specified columns if .keep_all is FALSE
Fruit_buyers <- Fruit_purchases %>%
arrange(Person_id, desc(Trip_date)) %>%
group_by(Person_id) %>%
slice(1)
# Error in slice_.tbl_spark(.data, .dots = compat_as_lazy_dots(...)) : Slice is not supported in this version of sparklyr
Fruit_buyers <- Fruit_purchases %>%
arrange(Person_id, desc(Trip_date)) %>%
group_by(Person_id) %>%
slice_head()
# Error in slice_head(.) : could not find function "slice_head"
Fruit_buyers <- Fruit_purchases %>%
arrange(Person_id, desc(Trip_date)) %>%
group_by(Person_id) %>%
top_n(1)
# Databricks log: Selecting by Person_id
# Then later error when printing:
# org.apache.spark.sql.AnalysisException: Undefined function: 'top_n_rank'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'
Fruit_buyers <- Fruit_purchases %>%
arrange(Person_id, desc(Trip_date)) %>%
group_by(Person_id) %>%
summarise_all(first) %>%
# Error in nth(x, 1L, order_by = order_by, default = default) : object 'Product' not found
6.分组后索引
Fruit_buyers <- Fruit_purchases %>%
arrange(Person_id, desc(Trip_date)) %>%
group_by(Person_id) %>%
mutate(RowN = row_number()) %>%
filter(RowN == 1)
# Error : java.lang.IllegalArgumentException: invalid method count for object 44/java.lang.Class fields 0 selected 0
不能做总结所有(分钟),因为这混淆了价格列。
SparkyR是否支持postgresql及其独特的语法?
进一步的问题:可以收集多少数据?驱动程序节点的大小是唯一的限制吗
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.5 sparklyr_1.4.0
暂无答案!
目前还没有任何答案,快来回答吧!