我有一个数据框cluster1_users
,列"id"
,我相信它是另一个数据框df
的"userId"
的子集。我想用df
的子集创建一个新的 Dataframe cluster1_df
,该子集包含cluster1_users.id
,因为df
有我需要的其他列,而cluster1_users
没有。df
:
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|85252 |19 |3.0 |822873600|
|99851 |19 |4.0 |822873600|
|99851 |32 |5.0 |822873600|
|99851 |39 |5.0 |822873600|
|124035|17 |4.0 |823185222|
|124035|12 |1.0 |823185223|
|124035|41 |4.0 |823185232|
|124035|14 |4.0 |823185239|
|46380 |39 |5.0 |823255318|
|46380 |4 |5.0 |823255319|
|46380 |17 |5.0 |823255319|
|113947|61 |3.0 |823264571|
|113947|62 |5.0 |823264576|
|113947|46 |3.0 |823264578|
|113947|48 |4.0 |823264586|
|113947|70 |3.0 |823264587|
|113947|14 |3.0 |823264592|
|113947|12 |1.0 |823264594|
|113947|19 |3.0 |823264596|
|113947|27 |3.0 |823264613|
+------+-------+------+---------+
only showing top 20 rows
transformed
:
+---+--------------------+----------+
| id| features|prediction|
+---+--------------------+----------+
| 10|[0.1974308, 0.359...| 1|
| 40|[0.72038186, 0.11...| 5|
| 70|[0.09885423, 0.18...| 10|
| 80|[0.36078414, 0.61...| 5|
|100|[0.3984223, 0.304...| 15|
|120|[0.36285698, 0.53...| 12|
|130|[0.33797824, 0.53...| 20|
|140|[0.42769185, 0.38...| 9|
|160|[0.35105795, 0.43...| 7|
|170|[0.36995363, 0.55...| 9|
|200|[0.3042391, 0.371...| 1|
|210|[0.6970617, 0.799...| 4|
|230|[0.5531783, 0.731...| 8|
|270|[0.3898772, 0.653...| 9|
|290|[0.19119799, 0.29...| 10|
|300|[0.44038358, 0.51...| 24|
|310|[0.53891087, 0.56...| 17|
|330|[0.32053632, 0.46...| 6|
|360|[0.43974763, 0.55...| 6|
|380|[0.29152408, 0.51...| 1|
+---+--------------------+----------+
only showing top 20 rows
使用.contains()
:
cluster1_pred = transformed.groupBy(["prediction"]).count().sort("count", ascending=False).first().prediction #=24
cluster1_users = transformed.filter(transformed.prediction==cluster1_pred)["id"]
cluster1_df = df.filter(cluster1_users.contains(df.userId)).cache()
然后得到了这个错误:
pyspark.sql.utils.AnalysisException: Resolved attribute(s) id#3238 missing from userId#19,movieId#20,rating#21,timestamp#22 in operator !Filter Contains(cast(id#3238 as string), cast(userId#19 as string)$
!Filter Contains(cast(id#3238 as string), cast(userId#19 as string))
+- Sample 0.0, 0.2, false, 220249759
+- Project [cast(split(userId,movieId,rating,timestamp#17, ,, -1)[0] as int) AS userId#19, cast(split(userId,movieId,rating,timestamp#17, ,, -1)[1] as int) AS movieId#20, cast(split(userId,movieId,rat$
+- Relation [userId,movieId,rating,timestamp#17] csv
这个问题是关于什么的?如何解决这个问题?
1条答案
按热度按时间xzv2uavs1#
我觉得你应该用这样的容器,
contains是列的方法,而不是DataFrame。