我有以下关于tf/idf测量的问题。我的目标是为每个日期获得前10名,如前10名20160401,前10名20160402等,但根据我的查询,我有超过10为每个日期。因为它可以看到我不能得到订单,应该只有一个1,一个2一3,我的意思是每个日期最多10。
date word rowNumber
20160401 word 1
20160401 word 2
..
20160401 word 10
-------------------
20160402 word 1
20160402 word 2
..
20160402 word 10
以下是我对tf-idf测量的查询。在newsdetails表中,我有如下列和值
date word wordCount
20160401 yürürlük 1
20160401 mülteci 1
20160401 anlaşma 2
20160401 pazartesi 1
20160401 itibaren 1
20160401 uygula 2
-------------------
SELECT t.newsCreatedTime,t.word,t.tf_idf as tweetRatio ,rn
FROM (
select w.newsCreatedTime,w.word,w.wordCount,t1.topic_sum,w.wordCount/t1.topic_sum as tf,
t2.idf,(w.wordCount/t1.topic_sum)*(t2.idf) as tf_idf,
(@rn := if(@t = w.newsCreatedTime, @rn + 1,
if(@t := w.newsCreatedTime, 1, 1)
)
) as rn
FROM newsdetails w JOIN
( select newsCreatedTime, sum(wordCount) as topic_sum
from newsdetails where trim(word)<>'limit' and length(trim(word))>3 and newsCreatedTime>=20160401
group by newsCreatedTime
) t1
on w.newsCreatedTime=t1.newsCreatedTime and trim(w.word)<>'limit' and length(trim(w.word))>3 and w.newsCreatedTime>=20160401
join (
select w.word, log10(t_cnts.cnts/count(*)) as idf
from newsdetails w,
(select count(distinct newsCreatedTime) as cnts from newsdetails where trim(word)<>'limit' and length(trim(word))>3 and newsCreatedTime>=20160401) t_cnts
group by w.word
) t2
on w.word=t2.word CROSS JOIN
(SELECT @t := '', @rn := 0) params
ORDER BY tf_idf desc
) t
WHERE rn <=10
输出为:
Date word ratio rowNumber
20160401 kadın -0.0005375613007471833 1
20160401 beyaz -0.00006167450024777626 1
20160401 alışveriş -0.000015563777946653284 1
20160401 ösym 0 1
20160401 yürürlük 0.00004586823255826781 1
20160401 haber -0.0003304889014365445 2
20160401 taraf -0.00019041987363584725 2
20160401 merkezi -0.00013800691483414033 2
20160401 mülteci -0.000056636107108483694 2
20160401 perde 0.00004941253963307586 2
20160401 sınav -0.00017806007150496475 3
但我只想得到每个日期的前10个单词和它们的比例。
1条答案
按热度按时间bqf10yzr1#
类似的东西在sql server中也可以使用-可能需要调整mysql的语法