pyspark 余弦相似比Spark

tcbh2hod  于 2022-12-28  发布在  Spark
关注(0)|答案(1)|浏览(197)

im将数据集字符串转换为数组,然后转换为向量,如下所示

  1. from pyspark.ml.feature import HashingTF, IDF
  2. # Create a HashingTF object to convert the "text" column to feature vectors
  3. hashing_tf = HashingTF(inputCol="combined_features", outputCol="raw_features")
  4. # Transform the DataFrame to create the raw feature vectors
  5. df = hashing_tf.transform(combarray)
  6. # Create an IDF object to calculate the inverse document frequency for the raw feature vectors
  7. idf = IDF(inputCol="raw_features", outputCol="features")
  8. # Fit the IDF on the DataFrame and transform it to create the final feature vectors
  9. df = idf.fit(df).transform(df)
  10. # View the resulting feature vectors
  11. df.select("features").show(truncate=False)
    • 输出:**
  1. +-------------------------------------+
  2. |features |
  3. +-------------------------------------+
  4. |(262144,[243082],[7.785305182539862])|
  5. |(262144,[90558],[7.785305182539862]) |
  6. |(262144,[9277],[7.785305182539862]) |
  7. |(262144,[55279],[7.785305182539862]) |
  8. |(262144,[114098],[7.785305182539862])|
  9. |(262144,[106982],[7.785305182539862])|
  10. |(262144,[248513],[7.785305182539862])|
  11. +-------------------------------------+

如何在pyspark中从我的特征中创建余弦相似度?

    • 更新**

我将数据合并:

  1. from pyspark.sql.functions import concat, lit, col
  2. selected_feature = selected_feature.withColumn('combined_features',
  3. concat(col('genres'),
  4. col('keywords'),
  5. col('tagline'),
  6. col('cast'),
  7. col('director')))
  8. combine = selected_feature.select("combined_features")

数据是这样的:

  1. +--------------------------------------------------+
  2. | combined_features|
  3. +--------------------------------------------------+
  4. |Action Adventure Fantasy Science Fictionculture...|
  5. |Adventure Fantasy Actionocean drug abuse exotic...|
  6. |Action Adventure Crimespy based on novel secret...|
  7. +--------------------------------------------------+

我写的代码一样的答案,仍然得到错误一样的评论

  1. import pyspark.sql.functions as F
  2. from pyspark.ml.feature import RegexTokenizer, CountVectorizer, IDF
  3. from pyspark.ml.feature import HashingTF, Tokenizer
  4. from sklearn.pipeline import Pipeline
  5. regex_tokenizer = RegexTokenizer(gaps=False, pattern="\w+", inputCol="combined_features", outputCol="tokens")
  6. count_vectorizer = CountVectorizer(inputCol="tokens", outputCol="tf")
  7. idf = IDF(inputCol="tf", outputCol="idf")
  8. tf_idf_pipeline = Pipeline(stages=[regex_tokenizer, count_vectorizer, idf])
  9. combine = tf_idf_pipeline.fit(combine).transform(combine).drop("news", "tokens", "tf")
  10. combine = combarray.crossJoin(combine.withColumnRenamed("idf", "idf2"))
  11. @F.udf(returnType=FloatType())
  12. def cos_sim(u, v):
  13. return float(u.dot(v) / (u.norm(2) * v.norm(2)))
  14. df.withColumn("cos_sim", cos_sim(F.col("idf"), F.col("idf2")))
2wnc66cl

2wnc66cl1#

您的代码中需要多处更正:

  • 您正在导入错误的Pipeline。正确的导入是from pyspark.ml import Pipeline
  • 引用了几个未给出定义的 Dataframe ;但我假设它指的是同一 Dataframe 的不同版本(如df、combarray)

以下是工作代码:

  1. import pyspark.sql.functions as F
  2. from pyspark.ml.feature import RegexTokenizer, CountVectorizer, IDF
  3. from pyspark.ml import Pipeline
  4. regex_tokenizer = RegexTokenizer(gaps=False, pattern="\w+", inputCol="combined_features", outputCol="tokens")
  5. count_vectorizer = CountVectorizer(inputCol="tokens", outputCol="tf")
  6. idf = IDF(inputCol="tf", outputCol="idf")
  7. tf_idf_pipeline = Pipeline(stages=[regex_tokenizer, count_vectorizer, idf])
  8. combine = tf_idf_pipeline.fit(combine).transform(combine).drop("tokens", "tf")
  9. combine = combine.crossJoin(combine.withColumnRenamed("idf", "idf2"))
  10. @F.udf(returnType=FloatType())
  11. def cos_sim(u, v):
  12. return float(u.dot(v) / (u.norm(2) * v.norm(2)))
  13. combine = combine.withColumn("cos_sim", cos_sim(F.col("idf"), F.col("idf2")))
  14. combine.drop("idf", "idf2").show(truncate=False)
  15. +-----------------------------------------------+-----------------------------------------------+-----------+
  16. |combined_features |combined_features |cos_sim |
  17. +-----------------------------------------------+-----------------------------------------------+-----------+
  18. |Action Adventure Fantasy Science Fictionculture|Action Adventure Fantasy Science Fictionculture|1.0 |
  19. |Action Adventure Fantasy Science Fictionculture|Adventure Fantasy Actionocean drug abuse exotic|0.05507607 |
  20. |Action Adventure Fantasy Science Fictionculture|Action Adventure Crimespy based on novel secret|0.049466185|
  21. |Adventure Fantasy Actionocean drug abuse exotic|Action Adventure Fantasy Science Fictionculture|0.05507607 |
  22. |Adventure Fantasy Actionocean drug abuse exotic|Adventure Fantasy Actionocean drug abuse exotic|1.0 |
  23. |Adventure Fantasy Actionocean drug abuse exotic|Action Adventure Crimespy based on novel secret|0.0 |
  24. |Action Adventure Crimespy based on novel secret|Action Adventure Fantasy Science Fictionculture|0.049466185|
  25. |Action Adventure Crimespy based on novel secret|Adventure Fantasy Actionocean drug abuse exotic|0.0 |
  26. |Action Adventure Crimespy based on novel secret|Action Adventure Crimespy based on novel secret|1.0 |
  27. +-----------------------------------------------+-----------------------------------------------+-----------+

使用的样本数据集:

  1. combine = spark.createDataFrame(data=[["Action Adventure Fantasy Science Fictionculture"],["Adventure Fantasy Actionocean drug abuse exotic"],["Action Adventure Crimespy based on novel secret"]], schema=["combined_features"])
  2. combine.show(truncate=False)
  3. +-----------------------------------------------+
  4. |combined_features |
  5. +-----------------------------------------------+
  6. |Action Adventure Fantasy Science Fictionculture|
  7. |Adventure Fantasy Actionocean drug abuse exotic|
  8. |Action Adventure Crimespy based on novel secret|
  9. +-----------------------------------------------+
展开查看全部

相关问题