这个错误“pyspark.sql.utils.AnalysisException:已解决userId#19中缺少属性id#3238..."?

j7dteeu8  于 2023-05-16  发布在  Spark
关注(0)|答案(1)|浏览(147)

我有一个数据框cluster1_users,列"id",我相信它是另一个数据框df"userId"的子集。我想用df的子集创建一个新的 Dataframe cluster1_df,该子集包含cluster1_users.id,因为df有我需要的其他列,而cluster1_users没有。
df

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|85252 |19     |3.0   |822873600|
|99851 |19     |4.0   |822873600|
|99851 |32     |5.0   |822873600|
|99851 |39     |5.0   |822873600|
|124035|17     |4.0   |823185222|
|124035|12     |1.0   |823185223|
|124035|41     |4.0   |823185232|
|124035|14     |4.0   |823185239|
|46380 |39     |5.0   |823255318|
|46380 |4      |5.0   |823255319|
|46380 |17     |5.0   |823255319|
|113947|61     |3.0   |823264571|
|113947|62     |5.0   |823264576|
|113947|46     |3.0   |823264578|
|113947|48     |4.0   |823264586|
|113947|70     |3.0   |823264587|
|113947|14     |3.0   |823264592|
|113947|12     |1.0   |823264594|
|113947|19     |3.0   |823264596|
|113947|27     |3.0   |823264613|
+------+-------+------+---------+
only showing top 20 rows

transformed

+---+--------------------+----------+
| id|            features|prediction|
+---+--------------------+----------+
| 10|[0.1974308, 0.359...|         1|
| 40|[0.72038186, 0.11...|         5|
| 70|[0.09885423, 0.18...|        10|
| 80|[0.36078414, 0.61...|         5|
|100|[0.3984223, 0.304...|        15|
|120|[0.36285698, 0.53...|        12|
|130|[0.33797824, 0.53...|        20|
|140|[0.42769185, 0.38...|         9|
|160|[0.35105795, 0.43...|         7|
|170|[0.36995363, 0.55...|         9|
|200|[0.3042391, 0.371...|         1|
|210|[0.6970617, 0.799...|         4|
|230|[0.5531783, 0.731...|         8|
|270|[0.3898772, 0.653...|         9|
|290|[0.19119799, 0.29...|        10|
|300|[0.44038358, 0.51...|        24|
|310|[0.53891087, 0.56...|        17|
|330|[0.32053632, 0.46...|         6|
|360|[0.43974763, 0.55...|         6|
|380|[0.29152408, 0.51...|         1|
+---+--------------------+----------+
only showing top 20 rows

使用.contains()

cluster1_pred = transformed.groupBy(["prediction"]).count().sort("count", ascending=False).first().prediction #=24
cluster1_users = transformed.filter(transformed.prediction==cluster1_pred)["id"]
cluster1_df = df.filter(cluster1_users.contains(df.userId)).cache()

然后得到了这个错误:

pyspark.sql.utils.AnalysisException: Resolved attribute(s) id#3238 missing from userId#19,movieId#20,rating#21,timestamp#22 in operator !Filter Contains(cast(id#3238 as string), cast(userId#19 as string)$
!Filter Contains(cast(id#3238 as string), cast(userId#19 as string))
+- Sample 0.0, 0.2, false, 220249759
   +- Project [cast(split(userId,movieId,rating,timestamp#17, ,, -1)[0] as int) AS userId#19, cast(split(userId,movieId,rating,timestamp#17, ,, -1)[1] as int) AS movieId#20, cast(split(userId,movieId,rat$
      +- Relation [userId,movieId,rating,timestamp#17] csv

这个问题是关于什么的?如何解决这个问题?

xzv2uavs

xzv2uavs1#

我觉得你应该用这样的容器,

cluster1_df = df.filter(cluster1_users.id.contains(df.userId)).cache()

contains是列的方法,而不是DataFrame。

相关问题