相当于Apache Lucene在R中的“邻近搜索”

eivnm1vs 于 2023-04-06 发布在 Lucene

关注(0)|答案(1)|浏览(218)

我正在开发一个文档语料库（住院期间的临床叙述），主要使用Quanteda包。目标是能够根据特征的存在/不存在对文档进行分类，比如“痉挛性咳嗽”。
我希望能够使用R来重现Apache Lucene的“邻近搜索”（https：//lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches）的行为。
举一个例子：“1例91岁患者在股骨颈手术后发生 * 痉挛性 * 和排痰性 * 咳嗽”
我将开始对短语进行标记，如下所示：

toks = 
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"), 
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>% 
tokens_remove(pattern = stopwords("en",source = "nltk"))

其产生以下输出：

Tokens consisting of 1 document.
text1 :
[1] "spastic"     "productive"  "cough"       "91-year-old" "patient"     "following"   "femoral"    
[8] "neck"        "surgery"

然后我可以继续生成n-gram和skip-gram：

toks = tokens_ngrams(toks,n=4,skip = 0:3)

toks
[1] "spastic_productive_cough_91-year-old"     "spastic_productive_cough_patient"        
  [3] "spastic_productive_cough_following"       "spastic_productive_cough_femoral"        
  [5] "spastic_productive_91-year-old_patient"   "spastic_productive_91-year-old_following"
  [7] "spastic_productive_91-year-old_femoral"   "spastic_productive_91-year-old_neck"     
  [9] "spastic_productive_patient_following"     "spastic_productive_patient_femoral"      
 [11] "spastic_productive_patient_neck"          "spastic_productive_patient_surgery"      
 [13] "spastic_productive_following_femoral"     "spastic_productive_following_neck"       
 [15] "spastic_productive_following_surgery"     "spastic_cough_91-year-old_patient"       
 [17] "spastic_cough_91-year-old_following"      "spastic_cough_91-year-old_femoral"       
 [19] "spastic_cough_91-year-old_neck"           "spastic_cough_patient_following"         
 [21] "spastic_cough_patient_femoral"            "spastic_cough_patient_neck"              
 [23] "spastic_cough_patient_surgery"            "spastic_cough_following_femoral"         
 [25] "spastic_cough_following_neck"             "spastic_cough_following_surgery"         
 [27] "spastic_cough_femoral_neck"               "spastic_cough_femoral_surgery"           
 [29] "spastic_91-year-old_patient_following"    "spastic_91-year-old_patient_femoral"     
 [31] "spastic_91-year-old_patient_neck"         "spastic_91-year-old_patient_surgery"     
.........

在这一点上我想我可以简单地：

any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE

但是我不确定我使用的方法是否正确，因为与Lucene查询的工作方式相比，它感觉很笨拙。如果我试图使用Apache Lucene查询语料库来识别患有“痉挛性咳嗽”的患者，我可能会使用“痉挛性咳嗽”~3这样的东西，其中“~3”意味着任何skip-gram 0：3都将匹配。
有什么建议可以告诉我如何以及在哪里改进我的方法吗？