R语言 从列中删除特定单词之后的所有单词

8ehkhllq  于 2023-05-04  发布在  其他
关注(0)|答案(3)|浏览(298)

我希望删除列中出现在特定单词(包括该单词)之后的所有单词和字符。
这就是我的数据的样子。
| | Number.of.Workers |公司名称|
| --------------|--------------|--------------|
| 五一九五|八十二|Valley Ho Hotels aka Kings Inn|
| 五一九六|八十二|铝合金精密产品|
| 五一九七|七十九|brea dba brea improv|
| 五一九八|七十九|嘎吱作响|
| 小行星5199|七十一|洛杉矶喜剧俱乐部|
| 五二零零|六十五|安德烈-布丁面包公司|
我想完成的一个具体示例是,我想从我的数据中删除“aka”和“dba”之后的任何单词以及单词“aka”和“dba”。

structure(list(Number.of.Workers = c("82", "82", "79", "79", 
"71", "65", "62", "58", "56", "53", "49"), company_name = c("valley ho hotels aka kings inn", 
"aluminum precision products", "levity of brea  dba brea improv", 
"crunch", "comedy club of los angeles  dba hollywood improv", 
"andre-boudin bakeries inc   dba boudin", "comedy club of san jose  dba san jose improv", 
"comedy club of brea  dba ontario improv", "sprout bost ", "culver west lp - playa provisions", 
"faa concord h dba concord honda")), row.names = 5195:5205, class = "data.frame")

vwoqyblh

vwoqyblh1#

您可以按如下方式使用sub()

df$company_name = sub("\\s+(aka|dba|\\(formerly.*[)])\\s+.*$", "", df$company_name)

输出:

Number.of.Workers                      company_name
5195                82                  valley ho hotels
5196                82       aluminum precision products
5197                79                    levity of brea
5198                79                            crunch
5199                71        comedy club of los angeles
5200                65         andre-boudin bakeries inc
5201                62           comedy club of san jose
5202                58               comedy club of brea
5203                56                      sprout bost 
5204                53 culver west lp - playa provisions
5205                49                     faa concord h

注:感谢@Chris Ruehlemann指出subgsub。不同之处在于前者替换第一个匹配项,而后者替换所有匹配项。

k2arahey

k2arahey2#

在碱R中:

#remove any word after "aka" and "dba" 
df$company_name <- gsub("(aka|dba).*", "", df$company_name)

#remove the words "aka" and "dba" 
df$company_name <- gsub("(\\s*aka\\s*|\\s*dba\\s*)", "", df$company_name)
Number.of.Workers                      company_name
5195                82                 valley ho hotels 
5196                82       aluminum precision products
5197                79                  levity of brea  
5198                79                            crunch
5199                71      comedy club of los angeles  
5200                65      andre-boudin bakeries inc   
5201                62         comedy club of san jose  
5202                58             comedy club of brea  
5203                56                      sprout bost 
5204                53 culver west lp - playa provisions
5205                49                    faa concord h
8wtpewkr

8wtpewkr3#

试试这个:

library(stringr)
str_remove(df$company_name, "\\s+(aka|dba).*$")
 [1] "valley ho hotels"                  "aluminum precision products"       "levity of brea"                    "crunch"                            "comedy club of los angeles"       
 [6] "andre-boudin bakeries inc"         "comedy club of san jose"           "comedy club of brea"               "sprout bost "                      "culver west lp - playa provisions"
[11] "faa concord h"

编辑

如果要从中开始删除的字符串的选择如下:

***akadba(formerly known as x)(其中x是公司名称的占位符)

那么你需要在两种模式之间交替:

x <- c("valley ho hotels aka kings inn", 
        "aluminum precision products", 
        "levity of brea  (formerly known as brea improv)", 
        "crunch", 
        "comedy club of los angeles  dba hollywood improv")

str_remove(x, "\\s+(aka|dba).*$|\\s+\\(formerly[^)(]+\\)")
[1] "valley ho hotels" "aluminum precision products" "levity of brea" "crunch" "comedy club of los angeles"

相关问题