删除R中包含超过一定比例的大写字母的行

cld4siwp 于 2023-01-18 发布在其他

关注(0)|答案(1)|浏览(89)

我有一个很大的 Dataframe ，其中包括公司标识符和从报纸上提取的短语。它非常混乱，我想通过条件行删除来清理它。
enter image description here
为此，我想删除超过50%大写字母的行。
我发现this code from a post会删除所有大写字母的行：

data <- data[!grepl("^[A-Z]+(?:[ -][A-Z]+)*$", data$text), ]

如何将其表示为占总字数或总字母数的比例？

r

来源：https://stackoverflow.com/questions/75152760/removing-rows-that-contain-above-a-certain-share-of-upper-case-letters-in-r

1条答案

按热度按时间

lyfkaqu11#

你可以用正则表达式来实现这一点，但是stringi函数stri_count_charclass提供了一个高度优化的版本来检测字符的类别。软件包手册记录了Unicode通用类别列表，这里我们使用字符串L来表示所有字母，Lu来表示大写字母。
类似这样的东西应该能满足你的需求：

library(stringi)

data <- data.frame(text = c("Foo",
                            "BAr",
                            "BAZ"))

data[which(stri_count_charclass(data[["text"]],"[\\p{Lu}]") / stri_count_charclass(data[["text"]],"[\\p{L}]") < 0.5),]
# [1] "Foo"

注意：我在这里更新了我的回答，因为我在最初的回答中没有指出stringi的强大特性。* 我的本能React是使用[a-z]和[A-Z]分别表示小写和大写字符。然而，使用Unicode通用分类允许该解决方案也可以很好地用于非ascii字符。

x = c("Foo",
      "BAr",
      "BAZ",
      "Ḟoo",
      "ḂÁr",
      "ḂÁẒ")
stri_count_charclass(x,"[A-Z]")/stri_count_charclass(x,"[[a-z][A-Z]]")
[1] 0.3333333 0.6666667 1.0000000 0.0000000 0.0000000       NaN

stri_count_charclass(x,"[\\p{Lu}]")/stri_count_charclass(x,"[\\p{L}]")
[1] 0.3333333 0.6666667 1.0000000 0.3333333 0.6666667 1.0000000

赞(0）回复(0）举报 2023-01-18

我来回答

删除R中包含超过一定比例的大写字母的行

1条答案

相关问题

热门标签

最新问答