我正在做一个DIY推特情绪分析器,我有一个像这样的推特索引
"_source" : {
"id" : 26930655,
"status" : 1,
"title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow: @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
"hashtags" : null,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
"hashtags" : null,
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
这些Map设置就像
"sentiment-en" : {
"mappings" : {
"properties" : {
"category" : {
"type" : "text"
},
"created_at" : {
"type" : "integer"
},
"hashtags" : {
"type" : "text"
},
"id" : {
"type" : "long"
},
"language" : {
"type" : "integer"
},
"status" : {
"type" : "integer"
},
"title" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"raw_text" : {
"type" : "text"
},
"stop" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "stop_words_filter"
},
"syn" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "synonyms_filter"
}
},
"index_options" : "docs",
"analyzer" : "all_ok_filter"
}
}
}
}
}
"settings" : {
"index" : {
"number_of_shards" : "10",
"provided_name" : "sentiment-en",
"creation_date" : "1627975717560",
"analysis" : {
"filter" : {
"stop_words" : {
"type" : "stop",
"stopwords" : [ ]
},
"synonyms" : {
"type" : "synonym",
"synonyms" : [ ]
}
},
"analyzer" : {
"stop_words_filter" : {
"filter" : [ "stop_words" ],
"tokenizer" : "standard"
},
"synonyms_filter" : {
"filter" : [ "synonyms" ],
"tokenizer" : "standard"
},
"all_ok_filter" : {
"filter" : [ "stop_words", "synonyms" ],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "0",
"uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
"version" : {
"created" : "7090199"
}
}
现在的问题是,我想提取所有的标签和提到在一个单独的字段。
我想要的O/P
"id" : 26930655,
"status" : 1,
"title" : "Here’s 5 underrated #BTC and realistic crypto accounts that everyone should follow: @Quinnvestments , @JacobOracle , @jevauniedaye , @ginsbergonomics , @InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan 🙏🚩🚩🇮🇳""",
"hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
到目前为止我所尝试的
1.创建一个基于模式的标记器,只读取标签和提及,而没有其他标记的字段标签和提及没有太大的成功。
1.尝试在没有任何分析器的情况下编写一个n元语法标记器也没有取得多大成功。
任何帮助将不胜感激,我愿意重新索引我的数据。提前感谢!!!
1条答案
按热度按时间qaxu7uf21#
您可以使用Logstash Twitter input plugin来索引数据,并在过滤器插件中配置下面的ruby脚本,如blog中所述。
您可以使用Logtstash Elasticsearch输入插件为源索引,并在过滤器插件和Logtstash Elasticsearch输出插件中配置关于ruby代码的目标索引。
另一个选择是使用reingest API和摄取管道,但是摄取管道不支持ruby代码。所以你需要把上面的ruby代码转换成无痛脚本。