在pyspark上寻找更好的性能

nxowjjhe 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(348)

我试图在pyspark中建立一个模型，显然我做了很多错误的事情。
我要做的是：
我有一个产品列表，我必须将每个产品与返回最相似产品的语料库进行比较，然后我必须按产品类型进行筛选。
以下是一个产品示例：
第一步：

product_id = 'HZM-1914'
    type = get_type(product_id) #takes 8s. I'll show this function below
    similar_list = [(product_id , 1.0)] + model.wv.most_similar(positive=id_produto, topn=5) #takes 0.04s
    #the similar list shows the product_id and the similarity, it looks like this:
    [('HZM-1914', 1.0), ('COL-8430', 0.9951900243759155), ('D23-2178', 0.9946870803833008), ('J96-0611', 0.9943861365318298), ('COL-7719', 0.9930003881454468), ('HZM-1912', 0.9926838874816895)]

第二步：


# I want to filter the types, so I transform the list in a dataframe, and here is what is taking the longest to perform (and probably what is wrong)

rdd = sc.parallelize([(id, get_type(id), similarity) for (id, similarity) in similar_list]) #takes 55s
products = rdd.map(lambda x: Row(name=str(x[0]), type=str(x[1]), similarity=float(x[2]))) #takes 0.02s
df_recs = sqlContext.createDataFrame(products) #takes 0.02s
df_recs.show() #takes 0.43s
+--------+----------------+------------------+
|    name|            type|      similarity  |
+--------+----------------+------------------+
|HZM-1914|       Chuteiras|               1.0|
|COL-8430|       Chuteiras|0.9951900243759155|
|D23-2178|           Bolas|0.9946870803833008|
|J96-0611|Luvas de Goleiro|0.9943861365318298|
|COL-7719|           Bolas|0.9930003881454468|
|HZM-1912|       Chuteiras|0.9926838874816895|
+--------+----------------+------------------+

第三步：


# Comes the filter:

df_recs = df_recs.filter(df_recs.type == type) #takes 0.09s
df_recs.show() #takes 0.5s
+--------+---------+------------------+
|    name|     type|      similarity  |
+--------+---------+------------------+
|HZM-1914|Chuteiras|               1.0|
|COL-8430|Chuteiras|0.9951900243759155|
|HZM-1912|Chuteiras|0.9926838874816895|
+--------+---------+------------------+

get_type（）函数是：

def get_type(product_id):
    return df.filter(col("ID") == product_id).select("TYPE").collect()[0]["TYPE"]

get_type（）获取id和type的Dataframe是：

+----------+--------------------+--------------------+
|ID        |        NAME        |           TYPE     |
+----------+--------------------+--------------------+
|    7983  |SNEAKERS 01         |            Sneakers|
|    7034  |SHIRT 13            |               Shirt|
|    3360  |SHORTS 15           |               Short|

get_type（）函数和创建Dataframe是主要问题。因此，如果你有任何想法如何使它更好地工作，这将是非常有帮助的。我来自python，我在pyspark上挣扎了很多。事先非常感谢。

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/63746694/looking-for-a-better-performance-on-pyspark

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

在pyspark上寻找更好的性能

暂无答案！

相关问题

热门标签

最新问答