基于条件连接并按时间范围过滤&限制到pig中的第一行

ccgok5k5 于 2021-05-30 发布在 Hadoop

关注(0)|答案(1)|浏览(294)

我有关系a，关系b。对于a中的每一行，关系b中可能有多个Map。
说：

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

我需要在（type，location，gender）和（startdatetime>registerstartdatetime）以及（startdatetime<registerenddatetime）加入a&b
这个连接可以从b返回多个具有不同值的行。我只想选取返回的第一行，最后输出。

output = Join A by (type, location, gender), B by (type, location, gender)

如何将日期时间范围条件添加到上述联接中？当执行连接时，如何限制b中只有一行？
在sql中：

SELECT 
a.id, b.value
FROM
    a, b
WHERE
    a.type = b.type
        AND a.location = b.location
        AND a.gender = b.gender
        AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime 
limit 1;

如何在Pig身上做同样的事？

hadoop apache-pig

来源：https://stackoverflow.com/questions/30886055/join-based-on-condition-and-filter-by-timerange-limit-to-just-the-first-row-in

1条答案

按热度按时间

bkhjykvo1#

试试这个：

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

output = Join A by (type, location, gender), B by (type, location, gender)

filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);

/*sortoutput = order filteroutput by  startDateTime ; 

  limitoutput = limit sortoutput 1 ;

* /

  limitoutput = limit filteroutput 1 ;

赞(0）回复(0）举报 2021-05-30

我来回答

基于条件连接并按时间范围过滤&限制到pig中的第一行

1条答案

相关问题

热门标签

最新问答