我在str_locate语法方面遇到一些问题,如下所示:
library(stringr)
library(rvest)
library(dplyr)
library(stringi)
library(readr)
library(readstata13)
list.mst = read.dta13("mst.dta")
list.mst = list.mst$ma_thue
link.source = 'https://infodoanhnghiep.com/tim-kiem/ma-so-thue/'
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
info = c()
for (mst in list.mst) {
link = paste0(link.source, mst,'/')
message(link)
search.result = read_html(link)
all.com.link = search.result %>% html_nodes(".company-name a") %>% html_attr('href') %>% unique()
for (com.link in all.com.link) {
com.page = read_html(paste0("https:",com.link))
com.info = com.page %>% html_nodes(xpath = '//*[@id="left-content"]/div[2]/div[2]/div[1]') %>% html_text()
com.name = str_sub(com.info,
start = str_locate(com.info, pattern = "Tên doanh nghi???p:")[2]+1,
end = str_locate(com.info, pattern = "Mã s??? thu???:")[1]-1)
com.mst <- str_sub(com.info,
start = str_locate(com.info, pattern = "Mã s??? thu???:")[2]+1,
end = str_locate(com.info, pattern = "Tình tr???ng ho???t ")[1]-1)
com.active = str_sub(com.info,
start = str_locate(com.info, pattern = "Tình tr???ng ho???t ")[2]+1+5,
end = str_locate(com.info, pattern = "ký qu???n lý:")[1]-1-9)
com.add = str_sub(com.info,
start = str_locate(com.info, pattern = "???a ch???:")[2]+1,
end = str_locate(com.info, pattern = "i???n tho???i:")[1]-1-1)
com.tel = str_sub(com.info,
start = str_locate(com.info, pattern = "i???n tho???i:")[2] +1,
end = str_locate(com.info, pattern = "???i di???n pháp lu???t:")[1]-1-1)
com.cap.phep = str_sub(com.info,
start = str_locate(com.info, pattern = "Ngày c???p gi???y phép:")[2]+1,
end = str_locate(com.info, pattern = "Ngày b???t ")[1]-1)
com.hoat.dong = str_sub(com.info,
start = str_locate(com.info, pattern = "Ngày b???t ")[2]+1+14,
end = str_locate(com.info, pattern = "Ngày nh???n TK:")[1]-1)
com.nganh2 = str_sub(com.info,
start = str_locate(com.info, pattern = "Ngành ngh??? kinh doanh:")[2])
com.nganh = str_sub(com.info,
start = str_locate(com.info, pattern = "Ngành ngh??? kinh doanh:")[2] +1)
info= rbind(info, t(c(com.name, com.mst, com.active, com.add, com.tel, com.cap.phep, com.hoat.dong,com.nganh)))
info = trim(info)
}
}
https://infodoanhnghiep.com/tim-kiem/ma-so-thue/0100111338/
Error in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)) :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`//Tên doanh nghi???p:`)
下面是我的示例数据集:link
我真的不知道我的代码有什么问题,所以我真的很感谢任何建议给我. upom看到越南语短语没有读R正确,我试图逃避它,但无济于事.谢谢大家!!!
1条答案
按热度按时间5vf7fwbs1#
正如注解中所指出的,这个特定的错误似乎是由某种字符编码问题触发的,并且没有足够的信息来重现或调试这个错误。
NA
值。下面的示例替换了所讨论的内部循环,并使用固定的URL集作为示例;值描述取自表的第一列,然后用 * janitor * 清除。
结果:
创建于2023年2月20日,使用reprex v2.0.2