pig 0.11.1-在时间范围内计数组

jexiocij 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(330)

我有一个数据集， A ，具有时间戳、访问者、url:

(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) 
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) 
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)

我想在一个10分钟的时间窗口中测量每个用户访问每个url的次数，但是作为一个滚动窗口，它以分钟为单位递增。输出为：

(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)

为了简化计算，我将时间戳改为一天中的分钟，如下所示：

(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */

为了通过移动时间窗口迭代“a”，我创建了一个一天中分钟数的数据集b：

(0)
(1)
(2)
.
.
.
.
(1440)

理想情况下，我想做一些事情，比如：

A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)

foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}

DUMP B;

我知道“group”不允许出现在“foreach”循环中，但是是否有一个解决方法来实现相同的结果？
谢谢！

hadoop mapreduce apache-pig range

来源：https://stackoverflow.com/questions/18004054/pig-0-11-1-count-groups-in-a-time-range

2条答案

按热度按时间

gblwokeq1#

A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);

像这样的东西应该会有帮助，如果一个小时是一个自定义项，您可以为您所需的持续时间创建类似的东西。然后按小时分组，然后用户可以使用generate进行计数。
http://pig.apache.org/docs/r0.7.0/tutorial.html

赞(0）回复(0）举报 2021-06-04

bzzcjhmw2#

也许你可以这样做？
注意：这取决于为整数日志创建的分钟数。如果他们不在，那么你可以到最近的一分钟。

myudf.py公司


# !/usr/bin/python

@outputSchema('expanded: {(num:int)}')
def expand(start, end):
        return [ (x) for x in range(start, end) ]

myscript.pig文件

register 'myudf.py' using jython as myudf ;

-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.

B = FOREACH A1 GENERATE minute, 
                        FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.

-- Now we join on the minute in the second column of B with the 
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name)) 
            GENERATE FLATTEN(group), COUNT(C) as count ;

我有点担心速度较大的原木，但它应该工作。如果你需要我解释什么，请告诉我。

赞(0）回复(0）举报 2021-06-04

我来回答

pig 0.11.1-在时间范围内计数组

2条答案

myudf.py公司

myscript.pig文件

相关问题

热门标签

最新问答