azure 将数组列表添加到pyspark DataFrame

gblwokeq  于 2023-03-31  发布在  Spark
关注(0)|答案(1)|浏览(136)

我对Pyspark非常陌生。我试图将我使用“www.example.com(WORK_FOLDER)”命令收集的数组列表(文件列表)添加mssparkutils.fs.ls到DataFrame。但我得到“TypeError:StructType无法接受类型〈class 'str'〉中的对象'20230205'”错误。
代码如下:

# Validation Id Checking
columns = StructType([StructField('Name',StringType())])
FileList = []
files = mssparkutils.fs.ls(WORK_FOLDER)
for file in files:
    if file.name.endswith('csv'):
            fileName = file.name
            array = fileName.split("_")
            for word in array:
                index = word.find('Exchange')
                if index != 0:
                    FileList.append(str(word))
print(FileList)
df = spark.createDataFrame(data=FileList,schema=columns) `

========================================================================================
print(FileList)命令给出以下输出:['20230205',' 001040.csv','20230205',' 200005.csv','20230206',' 200006.csv','20230207',' 200021.csv','20230208',' 200007.csv','20230209',' 200010.csv','20230210','200009.csv']
我正在尝试将“FileList”值添加到Dataframe df。使用列名'Name'的StringType。

9wbgstp7

9wbgstp71#

在创建数据框之前,请确保输入文件列表具有2D结构

spark.createDataFrame([[item] for item in FileList], schema=columns)

结果

+----------+
|      name|
+----------+
|  20230205|
|001040.csv|
|  20230205|
|200005.csv|
|  20230206|
|200006.csv|
|  20230207|
|200021.csv|
|  20230208|
|200007.csv|
|  20230209|
|200010.csv|
|  20230210|
|200009.csv|
+----------+

相关问题