ml计数器矢量器输出说明

8mmmxcuj 于 2021-07-09 发布在 Spark

关注(0)|答案(1)|浏览(338)

请帮助理解spark ml countvectorizer的输出，并建议哪些文档对其进行了解释。

val cv = new CountVectorizer()
  .setInputCol("Tokens")
  .setOutputCol("Frequencies")
  .setVocabSize(5000)
  .setMinTF(1)
  .setMinDF(2)
val fittedCV = cv.fit(tokenDF.select("Tokens"))
fittedCV.transform(tokenDF.select("Tokens")).show(false)

2374应该是词典中的术语（单词）数。什么是“[2,63285481234]”？
它们是字典中“[航空公司，包，年份，世界，冠军]”的索引吗？如果是这样，为什么同一个单词“airline”在第二行有不同的索引“0”？

+------------------------------------------+----------------------------------------------------------------+
|Tokens                                    |Frequencies                                                     |
+------------------------------------------+----------------------------------------------------------------+
...
|[airline, bag, vintage, world, champion]  |(2374,[2,6,328,548,1234],[1.0,1.0,1.0,1.0,1.0])                 |
|[airline, bag, vintage, jet, set, brown]  |(2374,[0,2,6,328,405,620],[1.0,1.0,1.0,1.0,1.0,1.0])            |
+------------------------------------------+----------------------------------------------------------------+
  [1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer

apache-spark-mllib

来源：https://stackoverflow.com/questions/66948001/what-are-each-of-the-indices-in-the-list-returned-in-countvectorizermodel

1条答案

按热度按时间

tvz2xvvm1#

有一些医生在解释基础知识。然而，这是相当赤裸裸的。
对。数字代表词汇索引中的单词。但是，频率向量中的顺序与令牌向量中的顺序不对应。 airline, bag, vintage 在两行中，因此它们对应于索引[2,6328]。但你不能依赖同样的顺序。
行数据类型是sparsevector。第一个数组显示索引，第二个数组显示值。
例如

vector[328] 
   => 1.0

Map可以如下所示：

vocabulary
airline 328
bag 6
vintage 2
Frequencies
2734, [2, 6 ,328], [99, 5, 7]
# counts
vintage x 99
bag x 5
airline 7

为了找回单词，你可以在词汇表中进行查找。这需要广播给不同的工人。您可能还希望将每个文档的计数分解为单独的行。
这里有一些 python 代码片段，用于使用自定义项将每个文档的前25个常用词提取到单独的行中，并计算每个词的平均值

import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row
vocabulary = sc.broadcast(fittedCV.vocabulary)
def _top_scores(v):
    # create count tuples for each index(i) in a vector(v)
    # `.item()` is used, because in python the count value is a numpy datatype, in `scala` it will be just double 
    counts = [Row(i=i.item(),count=v[i.item()].item()) for i in v.indices]
    # => [Row(i=2, count=30, Row(i=362, count=40)]
    # return 25 top count rows
    counts = sorted(counts, reverse=True, key=lambda x: x.count)
    return counts[:25]
top_scores = F.udf(_top_scores, T.ArrayType(T.StructType().add('i', T.IntegerType()).add('count', T.DoubleType())))                  
vec_to_word = F.udf(_vecToWord, T.StringType())
def _vecToWord(i):
    return vocabulary.value[i]
res = df.withColumn('word_count', explode(top_scores('Frequencies')))
=>
+-----+-----+----------+ 
doc_id, ..., word_count
             (i,  count)
+-----+-----+----------+
4711, ...,   (2, 30.0)
4711, ...,   (362, 40.0)
+-----+-----+----------+
res = res \
    .groupBy('word_count.i').agg( \
        avg('word_count.count').alias('mean')
    .orderBy('mean', ascending=False)
res = res.withColumn('token', vec_to_word('i')) 
=>
+---+---------+----------+ 
 i,   token,    mean
+---+---------+----------+ 
 2,   vintage,  15
 328, airline,  30  
+--+----------+----------+

展开查看全部

赞(0）回复(0）举报 2021-07-09

我来回答

ml计数器矢量器输出说明

1条答案

相关问题

热门标签

最新问答