我正在尝试运行一个简单的spark程序来读取pycharm环境中的avro文件。我不断遇到这个我无法解决的错误。我感谢你的帮助。
from environment_variables import *
import avro.schema
from pyspark.sql import SparkSession
Schema = avro.schema.parse(open(SCHEMA_PATH, "rb").read())
print(Schema)
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(list(Schema))
print(df)
打印的模式如下所示
{"type": "record", "name": "DefaultEventRecord", "namespace": "io.divolte.record", "fields": [{"type": "boolean", "name": "detectedDuplicate"}, {"type": "boolean", "name": "detectedCorruption"}, {"type": "boolean", "name": "firstInSession"}, {"type": "long", "name": "clientTimestamp"}, {"type": "long", "name": "timestamp"}, {"type": "string", "name": "remoteHost"}, {"type": ["null", "string"], "name": "referer", "default": null}, {"type": ["null", "string"], "name": "location", "default": null}, {"type": ["null", "int"], "name": "devicePixelRatio", "default": null}, {"type": ["null", "int"], "name": "viewportPixelWidth", "default": null}, {"type": ["null", "int"], "name": "viewportPixelHeight", "default": null}, {"type": ["null", "int"], "name": "screenPixelWidth", "default": null}, {"type": ["null", "int"], "name": "screenPixelHeight", "default": null}, {"type": ["null", "string"], "name": "partyId", "default": null}, {"type": ["null", "string"], "name": "sessionId", "default": null}, {"type": ["null", "string"], "name": "pageViewId", "default": null}, {"type": ["null", "string"], "name": "eventId", "default": null}, {"type": "string", "name": "eventType", "default": "unknown"}, {"type": ["null", "string"], "name": "userAgentString", "default": null}, {"type": ["null", "string"], "name": "userAgentName", "default": null}, {"type": ["null", "string"], "name": "userAgentFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentVendor", "default": null}, {"type": ["null", "string"], "name": "userAgentType", "default": null}, {"type": ["null", "string"], "name": "userAgentVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentDeviceCategory", "default": null}, {"type": ["null", "string"], "name": "userAgentOsFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVendor", "default": null}, {"type": ["null", "int"], "name": "cityIdField", "default": null}, {"type": ["null", "string"], "name": "cityNameField", "default": null}, {"type": ["null", "string"], "name": "countryCodeField", "default": null}, {"type": ["null", "int"], "name": "countryIdField", "default": null}, {"type": ["null", "string"], "name": "countryNameField", "default": null}]}
得到的错误是,
21/03/02 16:06:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "X:\Git_repo\Project_Red\spark_streaming\spark_scripting.py", line 15, in <module>
df = spark.read.format("avro").load(list(jsonFormatSchema))
TypeError: 'RecordSchema' object is not iterable
我感谢你的帮助。
1条答案
按热度按时间kse8i1jr1#
代码中必须有3个更正:
您不必单独加载模式文件,因为任何avro数据文件的头文件中都已经包含了它。
这个
load()
方法spark.read.format("avro").load(list(Schema))
需要指向avro文件的路径,而不是架构。print(df)
不会产生任何有意义的结果。只是使用df.show()
如果您想浏览avro文件中的数据。话虽如此,您可能已经对代码中必须更改的内容有所了解: