在R上只提取Web抓取中的某些节点

xdnvmnnf 于 11个月前发布在其他

关注(0)|答案(1)|浏览(110)

我试图从fbref.com网站中提取一些足球数据，具体来说，我应该提取一些日期，我想了解如何过滤网站内的各个节点你好，我想从fbref提取一些数据，但我不能只提取某种类型的数据。我将通过附加有问题的html代码更好地解释：

<th scope="row" class="left " data-stat=**"date"** csk="20230819"><a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a></th>
    <a href="/it/partite/254420f7/Internazionale-Monza-19-Agosto-2023-Serie-A">19-08-2023</a>

字符串
阅读代码：

url <- https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale

html_data <- read_html(url)

html_data %>%
  html_nodes(".left ")

型
它读取或多或少1266个不同的节点，但我只对提取“data-stats ='date'”的文本感兴趣。通过只获取这些节点，我应该能够稍后提取“href”之后的日期。

来源：https://stackoverflow.com/questions/77419574/extracting-only-certain-nodes-in-web-scraping-on-r

1条答案

按热度按时间

yacmzcpb1#

可以使用html_attr()函数提取属性中的值，并检查它是否是所需的值。

library(rvest)
url <- "https://fbref.com/it/squadre/d609edc0/Statistiche-Internazionale"

html_data <- read_html(url)

foundnodes <- html_data %>% html_nodes(".left ") 

#extract the attribute and check to see if it is equal to date
nodes_date <- which(html_attr(foundnodes, "data-stat")=="date")

#subset the foundnodes
foundnodes[nodes_date]

字符串
或者用一句话来表达：

#find the nodes with class = .left then check that attribute "data-stat" is equal to "date"    
html_data  %>% html_elements(".left[data-stat='date']")

型

赞(0）回复(0）举报 11个月前

我来回答

在R上只提取Web抓取中的某些节点

1条答案

相关问题

热门标签

最新问答