使用相同的数据质量规则验证pyspark dataframe列

wydwbb8l  于 2023-06-21  发布在  Spark
关注(0)|答案(1)|浏览(91)

我创建了一个虚拟的pyspark Dataframe 。

我试图执行以下规则:
rules = [{"column": "last_name", "value": "NA", "name": "Percentage of 'NA' Values in Last Name"},{"column": "first_name", "value": "NA", "name": "Percentage of 'NA' Values in First Name"} ]
我想有一个字典关键字的NA规则,因为适用于两个名字和姓氏,而不是必须列出相同的规则两次。
rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": "Percentage of 'NA' Values in Last Name and First Name"},{"column": "country", "value": "USA", "name": "Percentage of 'USA' Values in Country"} ]
下面是我到目前为止所做的。有什么好的建议可以让你在使用第二套规则的情况下获得同样的结果吗?

percentages = []

for rule in rules:
    column = rule["column"]
    value = rule["value"]
    name = rule["name"]
    count = df.filter(col(column) == value).count()
    total_count = df.count()
    percentage = (count / total_count) * 100
    percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

axr492tv

axr492tv1#

我认为最干净的方法是改变你的规则:

rules = [{"columns": ["last_name", "first_name"], "value": "NA", "name": 
"Percentage of 'NA' Values in Last Name and First Name"},
{"columns": ["country"], "value": "USA", "name": "Percentage of 'USA' Values in Country"}]

然后,您可以更改代码以使用此标准格式:

percentages = []

for rule in rules:
    columns = rule["columns"]
    value = rule["value"]
    name = rule["name"]
    for column in columns:
        count = df.filter(col(column) == value).count()
        total_count = df.count()
        percentage = (count / total_count) * 100
        percentages.append({"name": name, "percentage": percentage})
for result in percentages:
    print("{}: {:.2f}%".format(result["name"], result["percentage"]))

您还可以进行进一步的调整,以便在名称列表中添加名称并相应地打印。
如果你想坚持你的格式,你可以检查:

for rule in rules:
    if columns in rule:
        #process as multiple columns
    else: #(column in rule)
        #process as one column

相关问题