正在从PySpark Dataframe 中删除NULL、NAN和空白空间

m3eecexj 于 2022-11-01 发布在 Spark

关注(0)|答案(2)|浏览(225)

我在PySpark中有一个 Dataframe ，其中包含空白、Null和Nan。我想删除包含这些内容的行。我尝试了下面的命令，但是，似乎没有任何效果。

myDF.na.drop().show()
myDF.na.drop(how='any').show()

下面是数据框：

+---+----------+----------+-----+-----+
|age|  category|      date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018|  101|  abc|
| 24|    sports|16-01-2018|  102|  def|
| 23|electronic|17-01-2018|  103|  hhh|
| 23|electronic|16-01-2018|  104|  yyy|
| 29|       men|12-01-2018|  105| ajay|
| 31|      kids|17-01-2018|  106|vijay|
|   |       Men|       nan|  107|Sumit|
+---+----------+----------+-----+-----+

我错过了什么？什么是最好的方法来处理NULL，Nan或空白空间，使在实际计算中没有问题？

pyspark

来源：https://stackoverflow.com/questions/48421651/removing-null-nan-empty-space-from-pyspark-dataframe

2条答案

按热度按时间

tcbh2hod1#

NaN（不是数字）有不同的含义，NULL和空字符串只是一个普通值（可以通过csv阅读器自动转换为NULL），因此na.drop不会匹配这些值。
您可以将全部转换为null并删除

from pyspark.sql.functions import col, isnan, when, trim

df = spark.createDataFrame([
    ("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")), 
    ("good", 42, 42.0)])

def to_null(c):
    return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))

df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()

# +----+---+----+

# |  _1| _2|  _3|

# +----+---+----+

# |good| 42|42.0|

# +----+---+----+

赞(0）回复(0）举报 2022-11-01

rpppsulh2#

也许在你的情况下这并不重要，但这段代码（修改了Alper t. Turker的答案）可以相应地处理不同的数据类型。当然，数据类型可以根据你的DataFrame而变化。（在Spark版本上测试：第2.4节）

from pyspark.sql.functions import col, isnan, when, trim

# Find out dataType and act accordingly

def to_null_bool(c, dt):
    if df == "double":
        return c.isNull() | isnan(c)
    elif df == "string":
        return ~c.isNull() & (trim(c) != "")
    else:
        return ~c.isNull()

# Only keep columns with not empty strings

def to_null(c, dt):
    c = col(c)
    return when(to_null_bool(c, dt), c)

df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

赞(0）回复(0）举报 2022-11-01

我来回答

正在从PySpark Dataframe 中删除NULL、NAN和空白空间

2条答案

相关问题

热门标签

最新问答