从Pyspark动态帧中的结构化字段过滤空值

我有一个非常嵌套的json，在只过滤了我想处理的类别字段之后，我的列'data'就剩下了和原始dataframe相同的复杂结构/模式：

+--------------------+...
|                data|    
+--------------------+...
|{null, null, 833,...|...

并且数据结构包括大约250个嵌套字段，其中95%是空的。
我的目标是将数据列转换为仅包含非空字段的聚合，理想情况下仅包含非空字段的子集，但结构化模式是从读取数据时继承的，而且似乎找不到可以再次重新创建模式的方法。
我尝试过的方法：

filtered4.filter(f.col('data').isNotNull()) / isNull()但这会清除整行/不做任何事情。
ws_concat和coalesce（）类似于：df.withColumn("data", concat_ws(", ", coalesce(col("data.street"), ""), coalesce(col("data.city.neighborhood"), ""), coalesce(col("address.state"), ""))) ...，但这不是一个选项，因为我有100个字段。
1.将“data”列转换为字符串，然后使用regex进行清理，但这样做会丢失希望保留的字段的结构/名称
对于上下文，这些是动态的结构化字段（例如网站的请求标题），并且根据数据的不同而改变结构，因此我想把它放在一个地方，并且只在我想要的时候访问它。我认为最好保持它为一个字符串类型，没有所有这些空值（“{null，null，833，..”），但是我非常愿意听取人群中更有经验的pysparker的建议。

关于您尝试的第二个选项（以及其中的难点），您可以尝试以下方法来收集struct列中的所有字段。

import typing
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def construct(prefix: str, fields: typing.List) -> typing.List:
   '''
   Recursive function to find all nested fields in the struct field
   '''
   # List holding all nested fields found in this call
   found_fields = []
   # Loop over the list of fields passed as an argument
   for field in fields:
      if isinstance(field.dataType, StructType):
         found_fields += \
            [
               prefix+'.'+item for item in
               construct(field.name, field.dataType.fields)
            ]
      else:
         found_fields.append(prefix+'.'+field.name)
   return found_fields
def all_nested_fields(df: DataFrame, col: str) -> typing.List[str]:
   '''
   Collect all nested fields in a struct column in a Spark data frame
   '''
   if isinstance(df.schema[col].dataType, StructType):
      print('%s is a struct column. Finding all nested fields' % (col, ))
      return construct(col, schema[col].dataType.fields)
   else:
      print('%s is not a struct column.' % (col, ))
spark = SparkSession.builder.getOrCreate()
data = [
((((("Iran", 30)),), "Sajad", "Safarveisi"), "Tehran", "Persian Golf"),
((((("USA", 40)),), "James", "Baker"), "Washington", "Wall street"),
((((("Germany", 25)),), "Patrick", "Gottmann"), "Colon", "Black forest")]
schema = StructType([
StructField("name", StructType([
        StructField("details", StructType([
                StructField("nationality", StructType([
                    StructField("country", StringType(), True),
                    StructField("age", IntegerType(), True)
                ]), False)]), False),
        StructField("firstname", StringType(), True),
        StructField("lastname", StringType(), True),
    ]), False),
StructField("city", StringType(), True),
StructField("attribute", StringType(), True)])
df = spark.createDataFrame(data=data, schema=schema)
# All nested fields in the column 'name'
nested_fields = all_nested_fields(df, 'name')
# Create a spark data frame from them (example operation)
df.select(*[F.col(field) for field in nested_fields]).show()

展开查看全部

从Pyspark动态帧中的结构化字段过滤空值

1条答案

相关问题

热门标签

最新问答