pyspark

hi3rlvi2  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(267)

我正在将csv文件读入sparkDataframe。csv在许多列中有空格“”,我想删除这些空格。csv中有500列,因此我无法在代码中手动指定列
样本数据:

ADVANCE_TYPE  CHNG_DT    BU_IN
     A          20190718    1
                20190728    2 
                20190714     
     B          20190705     
                20190724    4

代码:

from pyspark.sql.functions import col,when,regexp_replace,trim

    df_csv = spark.read.options(header='true').options(delimiter=',').options(inferSchema='true').options(nullValue="None").csv("test41.csv")  

    for col_name in df_csv.columns:
       df_csv = df_csv.select(trim(col(col_name)))

但这些代码并没有删除空的空格。请帮帮我!

b4wnujal

b4wnujal1#

可以使用列表理解对所有必需的列应用trim。 Example: ```
df=spark.createDataFrame([(" ","12343"," ","9 "," 0")])

finding length of each column

expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]

df.select(expr).show()

+--------+--------+--------+--------+--------+

|length_1|length_2|length_3|length_4|length_5|

+--------+--------+--------+--------+--------+

| 3| 5| 3| 3| 4|

+--------+--------+--------+--------+--------+

trim on all the df columns

expr=[trim(col(col_name)).name(col_name) for col_name in df.columns]

df1=df.select(expr)
df1.show()

+---+-----+---+---+---+

| _1| _2| _3| _4| _5|

+---+-----+---+---+---+

| |12343| | 9| 0|

+---+-----+---+---+---+

length on df1 columns

expr=[length(col(col_name)).name('length'+ col_name) for col_name in df.columns]
df1.select(expr).show()

+--------+--------+--------+--------+--------+

|length_1|length_2|length_3|length_4|length_5|

+--------+--------+--------+--------+--------+

| 0| 5| 0| 1| 1|

+--------+--------+--------+--------+--------+

相关问题