pyspark-read带引号的csv文件

yk9xbfzb  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(564)

我有如下csv文件

name|age|county|state|country
"alex"john"|"30"|"burlington"|"nj"|"usa"

我用spark读取csv文件

input_df = spark.read.format('csv').options(header='true', inferSchema='false', sep='|').load('s3://path_to_file')

display(input_df)

输出(不确定为什么在alex“john”周围有引号,但在其他字段周围没有引号)

name        age county     state    country
"alex"john" 30  burlington  nj      usa

预期产量:

name        age county     state    country
alex"john   30  burlington  nj      usa
frebpwbc

frebpwbc1#

spark选择将name的所有内容作为一个字符串(包括所有引号)读取,因为中间的引号将其抛出。只需删除像这样的第一个和最后一个双引号(阅读后):

from pyspark.sql import functions as F
df.withColumn("name", F.expr("""substring(name,2,length(name)-2)""")).show()

# +---------+---+----------+-----+-------+

# |name     |age|county    |state|country|

# +---------+---+----------+-----+-------+

# |alex"john|30 |burlington|nj   |usa    |

# +---------+---+----------+-----+-------+

为了在 dynamically for all columns, 我建议正则表达式如下:

from pyspark.sql import functions as F
df.select(*[F.regexp_replace(x,'^\"|\"$','').alias(x) for x in df.columns]).show()

# +---------+---+----------+-----+-------+

# |name     |age|county    |state|country|

# +---------+---+----------+-----+-------+

# |alex"john|30 |burlington|nj   |usa    |

# +---------+---+----------+-----+-------+
5vf7fwbs

5vf7fwbs2#

这是一个棘手的问题,因为没有什么东西可以逃脱内心的引用(比如“\”)。
如果找不到转义内部引号的方法,我建议您按原样读取数据,并使用regex\u replace函数修剪周围的引号,如下所示:

from pyspark.sql.functions import regexp_replace
df = spark.read.option("delimiter", "|").option("inferSchema", "true").option("header", "true").csv("tmp.csv")
df.withColumn("formatted_name", regexp_replace(df.name, '^\"|\"$', "")).show()

输出:

+-----------+---+----------+-----+-------+--------------+
|       name|age|    county|state|country|formatted_name|
+-----------+---+----------+-----+-------+--------------+
|"alex"john"| 30|burlington|   nj|    usa|     alex"john|
+-----------+---+----------+-----+-------+--------------+

相关问题