相当于Apache Lucene在R中的“邻近搜索”

eivnm1vs  于 2023-04-06  发布在  Lucene
关注(0)|答案(1)|浏览(236)

我正在开发一个文档语料库(住院期间的临床叙述),主要使用Quanteda包。目标是能够根据特征的存在/不存在对文档进行分类,比如“痉挛性咳嗽”。
我希望能够使用R来重现Apache Lucene的“邻近搜索”(https://lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches)的行为。
举一个例子:“1例91岁患者在股骨颈手术后发生 * 痉挛性 * 和排痰性 * 咳嗽”
我将开始对短语进行标记,如下所示:

toks = 
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"), 
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>% 
tokens_remove(pattern = stopwords("en",source = "nltk"))

其产生以下输出:

Tokens consisting of 1 document.
text1 :
[1] "spastic"     "productive"  "cough"       "91-year-old" "patient"     "following"   "femoral"    
[8] "neck"        "surgery"

然后我可以继续生成n-gram和skip-gram:

toks = tokens_ngrams(toks,n=4,skip = 0:3)

toks
[1] "spastic_productive_cough_91-year-old"     "spastic_productive_cough_patient"        
  [3] "spastic_productive_cough_following"       "spastic_productive_cough_femoral"        
  [5] "spastic_productive_91-year-old_patient"   "spastic_productive_91-year-old_following"
  [7] "spastic_productive_91-year-old_femoral"   "spastic_productive_91-year-old_neck"     
  [9] "spastic_productive_patient_following"     "spastic_productive_patient_femoral"      
 [11] "spastic_productive_patient_neck"          "spastic_productive_patient_surgery"      
 [13] "spastic_productive_following_femoral"     "spastic_productive_following_neck"       
 [15] "spastic_productive_following_surgery"     "spastic_cough_91-year-old_patient"       
 [17] "spastic_cough_91-year-old_following"      "spastic_cough_91-year-old_femoral"       
 [19] "spastic_cough_91-year-old_neck"           "spastic_cough_patient_following"         
 [21] "spastic_cough_patient_femoral"            "spastic_cough_patient_neck"              
 [23] "spastic_cough_patient_surgery"            "spastic_cough_following_femoral"         
 [25] "spastic_cough_following_neck"             "spastic_cough_following_surgery"         
 [27] "spastic_cough_femoral_neck"               "spastic_cough_femoral_surgery"           
 [29] "spastic_91-year-old_patient_following"    "spastic_91-year-old_patient_femoral"     
 [31] "spastic_91-year-old_patient_neck"         "spastic_91-year-old_patient_surgery"     
.........

在这一点上我想我可以简单地:

any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE

但是我不确定我使用的方法是否正确,因为与Lucene查询的工作方式相比,它感觉很笨拙。如果我试图使用Apache Lucene查询语料库来识别患有“痉挛性咳嗽”的患者,我可能会使用“痉挛性咳嗽”~3这样的东西,其中“~3”意味着任何skip-gram 0:3都将匹配。
有什么建议可以告诉我如何以及在哪里改进我的方法吗?

编辑:

这可能会奏效:https://search.r-project.org/CRAN/refmans/corpustools/html/search_features.html
但目前我不知道如何将其纳入工作流程。

编辑二:

看起来我可以使用像Lucene一样的语法使用subset_query查询语料库。我现在面临的最大问题是“corpustools”不接受作为输入tokens对象,并且函数**tokens_to_corpus()**对我不起作用。这使我无法控制tokenization过程

34gzjxbg

34gzjxbg1#

实际上,在深入研究了文档之后,“corpustools”包提供了我在R =)中获得类似Apache Lucene体验所需的一切

相关问题