pyspark上集成apachespark和kafka

mmvthczy 于 2021-07-12 发布在 Spark

关注(0)|答案(0)|浏览(324)

这些是我集成Kafka和spark的开发环境。

IDE : eclipse 2020-12
python : Anaconda 2020.02 (Python 3.7)
kafka : 2.13-2.7.0
spark : 3.0.1-bin-hadoop3.2

我的eclipse配置参考站点在这里。所以下面的图片是eclipse pyspark配置的。

spark-pyspark的简单代码工作成功，没有错误。但是Kafka和spark结构化流媒体的融合带来了错误。这些是密码。

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("appName").getOrCreate()
df = spark.read.format("kafka")\
            .option("kafka.bootstrap.servers", "localhost:9092")\
            .option("subscribe", "topicForMongoDB")\
            .option("startingOffsets", "earliest")\
            .load()\
            .selectExpr("CAST(value AS STRING) as column")
df.printSchema()
df.show()

抛出的错误是

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

所以我插入了绑定相关jar文件的python代码。

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.0,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.1.0'

但这次又出现了另一个错误。

Error: Missing application resource.
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.

我被困在这里了。我的eclipse配置和pyspark代码有一些问题。但我不知道是什么导致了这些错误。请告诉我Kafka和spark-pyspark的集成配置。欢迎任何回复。

apache-kafka apache-spark pyspark eclipse

来源：https://stackoverflow.com/questions/66394220/integration-of-apache-spark-and-kafka-on-eclipse-pyspark

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

pyspark上集成apachespark和kafka

暂无答案！

相关问题

热门标签

最新问答