pyspark-read带引号的csv文件

yk9xbfzb 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(573)

我有如下csv文件

name|age|county|state|country
"alex"john"|"30"|"burlington"|"nj"|"usa"

我用spark读取csv文件

input_df = spark.read.format('csv').options(header='true', inferSchema='false', sep='|').load('s3://path_to_file')

display(input_df)

输出（不确定为什么在alex“john”周围有引号，但在其他字段周围没有引号）

name        age county     state    country
"alex"john" 30  burlington  nj      usa

预期产量：

name        age county     state    country
alex"john   30  burlington  nj      usa

apache-spark pyspark csv

来源：https://stackoverflow.com/questions/61922251/pyspark-read-csv-file-with-quotes

2条答案

按热度按时间

frebpwbc1#

spark选择将name的所有内容作为一个字符串（包括所有引号）读取，因为中间的引号将其抛出。只需删除像这样的第一个和最后一个双引号（阅读后）：

from pyspark.sql import functions as F
df.withColumn("name", F.expr("""substring(name,2,length(name)-2)""")).show()

# +---------+---+----------+-----+-------+

# |name     |age|county    |state|country|

# +---------+---+----------+-----+-------+

# |alex"john|30 |burlington|nj   |usa    |

# +---------+---+----------+-----+-------+

为了在 dynamically for all columns, 我建议正则表达式如下：

from pyspark.sql import functions as F
df.select(*[F.regexp_replace(x,'^\"|\"$','').alias(x) for x in df.columns]).show()

# +---------+---+----------+-----+-------+

# |name     |age|county    |state|country|

# +---------+---+----------+-----+-------+

# |alex"john|30 |burlington|nj   |usa    |

# +---------+---+----------+-----+-------+

赞(0）回复(0）举报 2021-05-27

5vf7fwbs2#

这是一个棘手的问题，因为没有什么东西可以逃脱内心的引用（比如“\”）。
如果找不到转义内部引号的方法，我建议您按原样读取数据，并使用regex\u replace函数修剪周围的引号，如下所示：

from pyspark.sql.functions import regexp_replace
df = spark.read.option("delimiter", "|").option("inferSchema", "true").option("header", "true").csv("tmp.csv")
df.withColumn("formatted_name", regexp_replace(df.name, '^\"|\"$', "")).show()

输出：

+-----------+---+----------+-----+-------+--------------+
|       name|age|    county|state|country|formatted_name|
+-----------+---+----------+-----+-------+--------------+
|"alex"john"| 30|burlington|   nj|    usa|     alex"john|
+-----------+---+----------+-----+-------+--------------+

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark-read带引号的csv文件

2条答案

相关问题

热门标签

最新问答