我有以下sparkDataframe:
datalake_spark_dataframe_downsampled = pd.DataFrame(
{'id' : ['001', '001', '001', '001', '001', '002', '002', '002'],
'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0, 1],
'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826, 31.784826],
'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70, 70],
'DaysDeploymentDate': [10, 20, 21, 31, 41, 11, 19, 57],
'label': [0, 0, 1, 1, 1, 0, 0, 1]}
)
datalake_spark_dataframe_downsampled = spark.createDataFrame(datalake_spark_dataframe_downsampled )
# printSchema of the datalake_spark_dataframe_downsampled (spark df):
"root
|-- IMEI: string (nullable = true)
|-- OuterSensorConnected: integer (nullable = false)
|-- OuterHumidity: float (nullable = true)
|-- EnergyConsumption: float (nullable = true)
|-- DaysDeploymentDate: integer (nullable = true)
|-- label: integer (nullable = false)"
如您所见,对于第一个id“001”,我有5行,对于第二个id“002”,我有3行。我想要的是过滤掉连接到id的行,它们的正标签('1')总共小于2。因此,由于第一个id“001”的正标签数等于3(总共三行,正标签为1),而第二个id“002”只有一行,正标签为1,因此我希望筛选出与id“002”相关的所有行。所以我最后的df看起来像:
datalake_spark_dataframe_downsampled_filtered = pd.DataFrame(
{'id' : ['001', '001', '001', '001', '001'],
'OuterSensorConnected':[0, 0, 0, 1],
'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826],
'EnergyConsumption': [70, 70, 70, 70, 70],
'DaysDeploymentDate': [10, 20, 21, 31, 41],
'label': [0, 0, 1, 1, 1]}
)
datalake_spark_dataframe_downsampled_filtered = spark.createDataFrame(datalake_spark_dataframe_downsampled_filtered)
如何使用spark.sql()查询实现这一点
datalake_spark_dataframe_downsampled_filtered.createOrReplaceTempView("df_filtered")
spark_dataset_filtered=spark.sql("""SELECT *, count(label) as counted_label FROM df_filtered GROUP BY id HAVING counted_label >=2""") #how to only count the positive values here?
1条答案
按热度按时间qco9c6ql1#
使用窗口如何: