在pyspark中使用另一个表的列创建表的行

piok6c0g 于 2023-01-01 发布在 Spark

关注(0)|答案(1)|浏览(193)

我有一个包含多列的表（df）：列1、列2、列3等等。
| 列1|列2|第3栏|- -|科隆|
| - ------| - ------| - ------| - ------| - ------|
| 1个|美国广播公司|1个||奎|
| 1个|某某||||
| 第二章||三个|||
| 三个|美国广播公司|六个||奎|
我希望我的最终表格（df）有以下列：

attribute_name: contains the name of columns from previous table
count: contains total count of the table
distinct_count: contains distinct count of each column from previous table
null_count: contains count of null values of each column from previous table

最终表格应如下所示：
| 属性名称|计数|非重复计数|空计数|
| - ------| - ------| - ------| - ------|
| 列1|四个|三个|无|
| 列2|四个|第二章|1个|
| 第3栏|四个|三个|1个|
| 科隆|四个|1个|第二章|
有人能帮助我如何在pyspark中实现这一点吗？

pyspark

来源：https://stackoverflow.com/questions/74961081/creating-rows-of-a-table-using-the-columns-of-another-table-in-pyspark

1条答案

按热度按时间

ubby3x7f1#

我没有测试它或检查它是否正确，但像这样的东西应该工作：

attr_df_list = []
for column_name in df.columns:
    attr_df_list.append(
      df.selectExpr(
          f"{column_name} AS attribute_name",
          "COUNT(*) AS count",
          f"COUNT(DISTINCT {column_name}) AS distinct_count",
          f"COUNT_IF({column_name} IS NULL) AS null_count"
      )
    )
result_df = reduce(lambda df1, df2: df1.union(df2), attr_df_list)

赞(0）回复(0）举报 2023-01-01

我来回答

在pyspark中使用另一个表的列创建表的行

1条答案

相关问题

热门标签

最新问答