我的源文件如下所示,我试图在pyspark中阅读此文件以进行进一步的转换。
"ID","FNAME","LNAME","AGE","DESIGNATION"
"1","John","Denver","34","Tech Staff"
"2","Philip","Spencer","30","Tech Staff "CONTRACT""
数据截图为:
代码如下
%pyspark
df = spark.read.csv("s3://emp_bucket/test_files/emp.csv",sep=",",quote='"',header='true')
df.show(truncate=False)
我希望结果如下:
+---+------+-------+---+-----------------------+
|ID |FNAME |LNAME |AGE|DESIGNATION |
+---+------+-------+---+-----------------------+
|1 |John |Denver |34 |Tech Staff |
|2 |Philip|Spencer|30 |Tech Staff "CONTRACT"|
+---+------+-------+---+-----------------------+
但结果却出乎意料,如下所示:
+---+------+-------+---+-----------------------+
|ID |FNAME |LNAME |AGE|DESIGNATION |
+---+------+-------+---+-----------------------+
|1 |John |Denver |34 |Tech Staff |
|2 |Philip|Spencer|30 |"Tech Staff "CONTRACT""|
+---+------+-------+---+-----------------------+
我尝试使用转义符,但Pypark无法避免“tech staff”contract“”中的外部双引号。
有人能看看这是不是正确的行为吗?
1条答案
按热度按时间pgpifvop1#
如果你看这一行:
您将看到最后一列中的引号没有转义。
应该是:
甚至不加引号(因为内容中没有逗号):