基于条件连接并按时间范围过滤&限制到pig中的第一行

ccgok5k5  于 2021-05-30  发布在  Hadoop
关注(0)|答案(1)|浏览(294)

我有关系a,关系b。对于a中的每一行,关系b中可能有多个Map。
说:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

我需要在(type,location,gender)和(startdatetime>registerstartdatetime)以及(startdatetime<registerenddatetime)加入a&b
这个连接可以从b返回多个具有不同值的行。我只想选取返回的第一行,最后输出。

output = Join A by (type, location, gender), B by (type, location, gender)

如何将日期时间范围条件添加到上述联接中?当执行连接时,如何限制b中只有一行?
在sql中:

SELECT 
a.id, b.value
FROM
    a, b
WHERE
    a.type = b.type
        AND a.location = b.location
        AND a.gender = b.gender
        AND a.startDateTime between b.registerStartDateTime and b.registerEndDateTime 
limit 1;

如何在Pig身上做同样的事?

bkhjykvo

bkhjykvo1#

试试这个:

A = (id1, type, location, gender, startDateTime)
B = (id2, type, location, gender, registerStartDateTime, registerEndDateTime, value)

output = Join A by (type, location, gender), B by (type, location, gender)

filteroutput = filter output by (startDateTime > registerStartDateTime) AND (startDateTime < registerEndDateTime);

/*sortoutput = order filteroutput by  startDateTime ; 

  limitoutput = limit sortoutput 1 ;

* /

  limitoutput = limit filteroutput 1 ;

相关问题