我有一个用例,在这个用例中,多种类型的avro记录出现在一个kafka主题中,因为我们在schema registry中针对这个主题使用topicrecordnamestrategy。
现在我已经编写了一个consumer来阅读这个主题并构建genericord的数据流。现在我不能将这个流以parquet格式接收到hdfs/s3,因为这个流包含不同类型的模式记录。因此,我为每种类型过滤不同的记录,方法是应用一个过滤器,创建不同的流,然后分别下沉每个流。
下面是我正在使用的代码--``
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient;
import io.confluent.kafka.schemaregistry.client.SchemaMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.ExecutionConfig;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.core.fs.Path;
import org.apache.flink.formats.parquet.avro.ParquetAvroWriters;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;
public class EventStreamProcessor {
private static final Logger LOGGER = LoggerFactory.getLogger(EventStreamProcessor.class);
private static final String KAFKA_TOPICS = "events";
private static Properties properties = new Properties();
private static String schemaRegistryUrl = "";
private static CachedSchemaRegistryClient registryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 1000);
public static void main(String args[]) throws Exception {
ParameterTool para = ParameterTool.fromArgs(args);
InputStream inputStreamProperties = EventStreamProcessor.class.getClassLoader().getResourceAsStream(para.get("properties"));
properties.load(inputStreamProperties);
int numSlots = para.getInt("numslots", 1);
int parallelism = para.getInt("parallelism");
String outputPath = para.get("output");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
env.getConfig().enableForceAvro();
env.enableCheckpointing(60000);
ExecutionConfig executionConfig = env.getConfig();
executionConfig.disableForceKryo();
executionConfig.enableForceAvro();
FlinkKafkaConsumer kafkaConsumer010 = new FlinkKafkaConsumer(KAFKA_TOPICS,
new KafkaGenericAvroDeserializationSchema(schemaRegistryUrl),
properties);
Path path = new Path(outputPath);
DataStream<GenericRecord> dataStream = env.addSource(kafkaConsumer010).name("bike_flow_source");
try {
final StreamingFileSink<GenericRecord> sink = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
return true;
}
return false;
}).addSink(sink).name("search_list_sink").setParallelism(parallelism);
final StreamingFileSink<GenericRecord> sink_search_details = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_details")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_details")) {
return true;
}
return false;
}).addSink(sink_search_details).name("search_details_sink").setParallelism(parallelism);
final StreamingFileSink<GenericRecord> search_list = StreamingFileSink.forBulkFormat
(path, ParquetAvroWriters.forGenericRecord(SchemaUtils.getSchema("events-com.events.search_list")))
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
dataStream.filter((FilterFunction<GenericRecord>) genericRecord -> {
if (genericRecord.get(Constants.EVENT_NAME).toString().equals("search_list")) {
return true;
}
return false;
}).addSink(search_list).name("search_list_sink").setParallelism(parallelism);
} catch (Exception e) {
LOGGER.info("exception in sinking event");
}
env.execute("event_stream_processor");
}
}
``
所以在我看来这是非常低效的。每次添加新事件时,我都必须更改代码。2我必须创建多个流的过滤器和所有。
因此,请向我建议是否可以编写一个genericord流而不创建多个流。如果不是的话,我如何使用一些配置文件使这个代码更加动态,这样每次我就不必为一个新事件再次编写相同的代码了?
请提出一些更好的方法来解决这个问题。
2条答案
按热度按时间62o28rlo1#
您可以简单地将可能的消息类型列表作为配置参数传递,然后简单地对其进行迭代。你会得到这样的结果:
这意味着您只需要在新消息类型到达时使用更改的配置重新启动作业。
j0pj023g2#
我试着喜欢这样,但它不工作。。。。
请提供实现这一目标的正确方法。
谢谢。