用于滚动聚合的globalwindow的替代方案

我想知道flink是否适合以下用例。假设我有一个测量流（设备id，值），例如。
(1, 10.2), (2, 3.4), (3, 9.1), (1, 7.0), (3, 6.3), (5, 17.8)
我每分钟都要报告到目前为止看到的任何设备id的最新值。
根据数据：

data:  (1, 10.2), (2, 3.4), (3, 9.1), (1, 7.0), (3, 6.3), (5, 17.8)
time: 0 ----------------- 1min -------------- 2min ------------------ 3min

我想要一个结果：
1: { (1, 10.2), (2, 3.4) }
2: { (1, 7.0), (2, 3.4), (3, 9.1) }
3: { (1, 7.0), (2, 3.4), (3, 6.3), (5, 17.8) }
我提出了包括

.windowAll(GlobalWindows.create()).trigger(CountTrigger.of(1)).apply( ... )

但是在一个大的数据集上它看起来并不好（内存方面）。还有别的办法吗？

您可能需要考虑以下内容作为起点：

public class StreamingJob {
  private static final TimeUnit windowTimeUnit = TimeUnit.SECONDS;
  private static final long windowLength = 10;
  private static long getNearestRightBoundaryFor(Long timestamp, Long duration, TimeUnit unit){
    Long durationEpoch = unit.toMillis(duration);
    Long quotient = timestamp / durationEpoch;
    return (quotient + 1) * durationEpoch - 1;
  }
  public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    env.fromElements(
            Tuple3.of(1000L, 1L, 3.8f), Tuple3.of(2003L, 2L, 82.3f), Tuple3.of(3006L, 1L, 4.2f), // 0 - 09
            Tuple3.of(11120L, 2L, 10f), Tuple3.of(12140L, 2L, 7.15f), Tuple3.of(13150L, 3L, 3.33f), // 10 - 19
            Tuple3.of(21200L, 2L, 1.09f), Tuple3.of(22270L, 1L, 2.22f), Tuple3.of(23280L, 2L, 3.8f), // 20 - 29
            Tuple3.of(31310L, 3L, 3.12f), Tuple3.of(32330L, 2L, 9.2f), Tuple3.of(33390L, 1L, 4.0f) // 30 - 39
    )
    .assignTimestampsAndWatermarks(
            new AssignerWithPunctuatedWatermarks<Tuple3<Long,Long,Float>>() {
                @Nullable
                @Override
                public Watermark checkAndGetNextWatermark(Tuple3<Long, Long, Float> lastElement, long extractedTimestamp) {
                    return new Watermark(extractedTimestamp);
                }
                @Override
                public long extractTimestamp(Tuple3<Long, Long, Float> element, long previousElementTimestamp) {
                    return element.f0;
                }
            })
    .keyBy(new KeySelector<Tuple3<Long,Long,Float>, Long>() {
        @Override
        public Long getKey(Tuple3<Long, Long, Float> value) throws Exception {
            return value.f1;
        }
    })
    .process(new KeyedProcessFunction<Long, Tuple3<Long,Long,Float>, Tuple4<Long, Long, Long, Float>>() {
        private ValueState<Tuple3<Long, Long, Float>> state;
        @Override
        public void open(Configuration parameters) {
            ValueStateDescriptor<Tuple3<Long, Long, Float>> descriptor = new ValueStateDescriptor<>(
                    "state",
                    TypeInformation.of(new TypeHint<Tuple3<Long, Long, Float>>() {
                    }));
            state = getRuntimeContext().getState(descriptor);
        }
        @Override
        public void processElement(Tuple3<Long, Long, Float> value, Context ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws Exception {
            Tuple3<Long, Long, Float> currentValue = state.value();
            if (currentValue == null) {
                Long ts = getNearestRightBoundaryFor(value.f0, windowLength, windowTimeUnit);
                ctx.timerService().registerEventTimeTimer(ts);
                state.update(value);
            }
            else if (value.f0 > currentValue.f0) { // ignore out-of-order events
                state.update(value);
            }
        }
        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple4<Long, Long, Long, Float>> out) throws IOException {
            Tuple3<Long, Long, Float> currentValue = state.value();
            out.collect(new Tuple4(timestamp, currentValue.f0, currentValue.f1, currentValue.f2));
            Long newTs = timestamp + windowTimeUnit.toMillis(windowLength);
            if (ctx.timerService().currentWatermark() < Long.MAX_VALUE) {
                ctx.timerService().registerEventTimeTimer(newTs);
            }
        }
    })
    .print();
    env.execute("Flink FTW!");
  }
}

需要指出的是：
我不建议用windows来做这个。使用globalwindows，管理过期状态变得很复杂。
我使用了带标点水印的赋值函数，而不是升序时间戳抽取函数。我这样做有三个原因：（1）一旦切换到并行运行，可能很难确保事件按顺序到达(2） AscendingTimeStampExtractor定期生成水印（默认情况下，每200毫秒实时一次），在本例中，应用程序在生成第一个水印之前已经消耗了所有输入(3） processelement方法中的一个简单检查就是处理无序事件所需的全部。但是如果事件确实有序，那么在生产中使用ascendingtimestampextractor或BoundedAutoforErnessTimestampExtractor可能会更好。
输出如下所示：

(9999,11120,2,10.0)
(19999,21200,2,1.09)
(19999,13150,3,3.33)
(29999,23280,2,3.8)
(29999,31310,3,3.12)
(39999,32330,2,9.2)
(39999,31310,3,3.12)
(9999,3006,1,4.2)
(19999,3006,1,4.2)
(29999,22270,1,2.22)
(39999,33390,1,4.0)

（11120,2,10.0）在9999触发的原因是，正是这个时间戳为11120的事件的到来使水印前进到9999之后，导致计时器触发。在调用ontimer时，onelement已经被调用了。
ctx.timerservice（）.currentwatermark（）<long.max\u值的检入计时器是这样，这个有限的示例不会永远运行。如果流式处理作业到达其输入的结尾，则注入时间戳为long.max\u值的最终水印，以导致任何剩余计时器的最后一次触发。在这种情况下，我们不应该创建另一个计时器。

展开查看全部

用于滚动聚合的globalwindow的替代方案

1条答案

相关问题

热门标签

最新问答