烫伤:在groupby之后保留所有字段

fcg9iug3  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(513)

我在做一件事 groupBy 用于计算值,但当我按分组时,似乎丢失了聚合键中没有的所有字段:

filtered.filterNot('site) {s:String => ...}
        .filterNot('date) {s:String => ...}
aggr = filtered.groupBy('id, 'contentHost) { group =>
    group.min('timestamp -> 'min)
    //how do I keep original fields? (eg: site, date)
}

aggr.store(Tsv(...)) //eg: field "site" won't be here

在pig中,是这样的:

aggr = group filtered by concat('id, 'contentHost);

result = foreach aggr {
  generate flatten(filtered), //how to do this in scalding?
           min(filtered.timestamp) as min;
}
cgyqldqp

cgyqldqp1#

我对元组api也有同样的问题,只能通过使用类型化api来解决。
您可以使用scala元组,也可以在工作之外定义自己的case类。例如。:

case class Data(id: String, site: String, date: String, contentHost: String)

然后你会这样处理:

val filtered: TypedPipe[Data] = TypedPipe.from(Seq(Data("...", "2014-04-14", "...", "...")))

filtered
  .filterNot ( data => data.site == "fr" )
  .filterNot ( data => data.date == "2014-02-01" )
  .groupBy (data => (data.id, data.contentHost)) // (String,String) -> Data
  .min // or .minBy { ... }
  .toTypedPipe
  .write(TypedTsv[((String, String), Data)]("/path/"))

相关问题