regex 使用R提取大写单词

gab6jxml  于 2023-11-20  发布在  其他
关注(0)|答案(2)|浏览(112)
ex03ProperNoun <- function(text) {
  assertString(text)
  words <- unique(grep('(?<![.!?]\\s)[A-Z][a-z0-9]+(?![.!?]\\s)', 
                       strsplit(text, ' ')[[1]], value = TRUE, perl = TRUE))
  words <- words[-1]
  index <- grep('\\.\\s?[A-Z][a-z0-9]+', text, perl = TRUE)[1]
  if (!is.na(index)) {
    words <- words[(-index)]  
  }
  return(words)
}

ex03ProperNoun("Proper Nouns are usually Capitalized.")
  # --> "Nouns"  "Capitalized"
  # my ouput: "Nouns"        "Capitalized."
ex03ProperNoun("proper nouns are usually Capitalized. This is, Proper for proper nouns.")
  # --> "Capitalized" "Proper"
  # my ouput: "Proper" 
ex03ProperNoun("The IBM5100 Portable Computer was one of the first portable computers.")
  # --> "IBM5100" "Portable" "Computer"
  # my ouput: "IBM5100"  "Portable" "Computer"
ex03ProperNoun("IBM5100 is the name of one of the first portable computers.")
  # --> character(0)
  # my ouput: character(0)

字符串
所以我的问题是,我总是得到错误的输出,对于一些输入。为什么呢?
我试着改变代码,但它仍然不工作。

jaxagkaj

jaxagkaj1#

library(stringr)

x <- c("Proper Nouns are usually Capitalized.",
       "proper nouns are usually Capitalized. This is, Proper for proper nouns.",
       "The IBM5100 Portable Computer was one of the first portable computers.",
       "IBM5100 is the name of one of the first portable computers.",
       "The IBM5100 Portable Computer was one of the first portable computers.")

str_extract_all(x, "\\b[A-Z]\\w*") 

[[1]]
[1] "Proper"      "Nouns"       "Capitalized"

[[2]]
[1] "Capitalized" "This"        "Proper"     

[[3]]
[1] "The"      "IBM5100"  "Portable" "Computer"

[[4]]
[1] "IBM5100"

[[5]]
[1] "The"      "IBM5100"  "Portable" "Computer"

字符串

yws3nbqq

yws3nbqq2#

您可以使用以下基本R解决方案:

x <- c("Proper Nouns are usually Capitalized.",
    "proper nouns are usually Capitalized. This is, Proper for proper nouns.",
    "The IBM5100 Portable Computer was one of the first portable computers.",
    "IBM5100 is the name of one of the first portable computers.",
    "The IBM5100 Portable Computer was one of the first portable computers.")
 
regmatches(x, gregexpr("[?!.]\\s*\\w+(*SKIP)(*F)|\\b(?!^)\\p{Lu}\\w*\\b", x, perl=TRUE))

字符串
参见R demoregex demo

注意事项:如果您计划使用stringr::str_extract_all,并且您知道在最后一个句子标点符号和下一个单词之间不能有超过100个空格,那么您也可以使用str_extract_all(x, "\\b(?!^)(?<![.?!]\\s{0,100})\\p{Lu}\\w*\\b"),因为ICU regex flavor允许受限宽度的lookbehind模式。

输出量:

[[1]]
[1] "Nouns"       "Capitalized"

[[2]]
[1] "Capitalized" "Proper"     

[[3]]
[1] "IBM5100"  "Portable" "Computer"

[[4]]
character(0)

[[5]]
[1] "IBM5100"  "Portable" "Computer"


PCRE正则表达式详细信息:

  • [?!.]\s*\w+(*SKIP)(*F)-?!.,然后是零个或多个空格,一个或多个单词字符,然后匹配失败,并从失败位置搜索下一个匹配
  • |-或
  • \b-字边界
  • (?!^)-不在字符串的开头
  • \p{Lu}-一个字母
  • \w*-任意零个或多个单词字符
  • \b-字边界。

相关问题