我试图在pyspark中建立一个模型,显然我做了很多错误的事情。
我要做的是:
我有一个产品列表,我必须将每个产品与返回最相似产品的语料库进行比较,然后我必须按产品类型进行筛选。
以下是一个产品示例:
第一步:
product_id = 'HZM-1914'
type = get_type(product_id) #takes 8s. I'll show this function below
similar_list = [(product_id , 1.0)] + model.wv.most_similar(positive=id_produto, topn=5) #takes 0.04s
#the similar list shows the product_id and the similarity, it looks like this:
[('HZM-1914', 1.0), ('COL-8430', 0.9951900243759155), ('D23-2178', 0.9946870803833008), ('J96-0611', 0.9943861365318298), ('COL-7719', 0.9930003881454468), ('HZM-1912', 0.9926838874816895)]
第二步:
# I want to filter the types, so I transform the list in a dataframe, and here is what is taking the longest to perform (and probably what is wrong)
rdd = sc.parallelize([(id, get_type(id), similarity) for (id, similarity) in similar_list]) #takes 55s
products = rdd.map(lambda x: Row(name=str(x[0]), type=str(x[1]), similarity=float(x[2]))) #takes 0.02s
df_recs = sqlContext.createDataFrame(products) #takes 0.02s
df_recs.show() #takes 0.43s
+--------+----------------+------------------+
| name| type| similarity |
+--------+----------------+------------------+
|HZM-1914| Chuteiras| 1.0|
|COL-8430| Chuteiras|0.9951900243759155|
|D23-2178| Bolas|0.9946870803833008|
|J96-0611|Luvas de Goleiro|0.9943861365318298|
|COL-7719| Bolas|0.9930003881454468|
|HZM-1912| Chuteiras|0.9926838874816895|
+--------+----------------+------------------+
第三步:
# Comes the filter:
df_recs = df_recs.filter(df_recs.type == type) #takes 0.09s
df_recs.show() #takes 0.5s
+--------+---------+------------------+
| name| type| similarity |
+--------+---------+------------------+
|HZM-1914|Chuteiras| 1.0|
|COL-8430|Chuteiras|0.9951900243759155|
|HZM-1912|Chuteiras|0.9926838874816895|
+--------+---------+------------------+
get_type()函数是:
def get_type(product_id):
return df.filter(col("ID") == product_id).select("TYPE").collect()[0]["TYPE"]
get_type()获取id和type的Dataframe是:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
get_type()函数和创建Dataframe是主要问题。因此,如果你有任何想法如何使它更好地工作,这将是非常有帮助的。我来自python,我在pyspark上挣扎了很多。事先非常感谢。
暂无答案!
目前还没有任何答案,快来回答吧!