环境数据倡议(EDI)是一个存储来自多个地点的数据集的存储库。我想从一个位置(see example link here)抓取每个数据集的开始和结束日期。
- 一个位置的每个数据集都包含一个指向元数据URL的链接,该URL列出了数据集的开始和结束日期(see example link here)。
我下面的代码尝试使用for循环来提取每个数据集(即Package Id
)的唯一ID,然后用于为每个Package Id
创建元数据页面URL。
但是,我的for循环在尝试从每个元数据页面中抓取开始日期时抛出了一个错误。
- 错误:
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "list"
如何调整for循环来提取每个Package Id
的开始和结束日期?
library(rvest)
library(xml2)
library(dplyr)
library(purrr)
url <- "https://portal.edirepository.org/nis/simpleSearch?defType=edismax&q=*:*&fq=-scope:ecotrends&fq=-scope:lter-landsat*&fq=scope:(knb-lter-and)&fl=id,packageid,title,author,organization,pubdate,coordinates&debug=false&start=0&rows=150"
webpage <- read_html(url)
# Initialize vectors to store the data
package_ids <- character()
time_periods_begin <- character()
time_periods_end <- character()
# Extract the Package Id
package_ids <- webpage %>%
html_table() %>%
.[[4]] %>%
select(`Package Id ▵▿`) %>%
rename(PackageId = `Package Id ▵▿`)
# Iterate over each PackageId row
for (i in 1:length(package_ids$PackageId)) {
# Construct the URL for the "View Full Metadata" page
package_id_link <- paste0("https://portal.edirepository.org/nis/metadataviewer?packageid=", package_ids$PackageId)
# Navigate to the "View Full Metadata" page
metadata_page <- map(package_id_link, read_html)
# Extract the Begin and End (this is where the error lives)
time_period_begin <- html_nodes(metadata_page, "tr:contains('Begin') td:nth-child(2)") %>%
html_text() %>%
trimws()
time_periods_begin <- c(time_periods_begin, time_period_begin)
time_period_end <- html_nodes(metadata_page, "tr:contains('End') td:nth-child(2)") %>%
html_text() %>%
trimws()
time_periods_end <- c(time_periods_end, time_period_end)
}
输出应该如下所示
# Create a data frame with Package Id, Begin, and End
data_frame <- data.frame(PackageId = package_id,
Begin = time_periods_begin,
End = time_periods_end)
data_frame
PackageId Begin End
1 knb-lter-and.2719.6 1971-06-01 2002-03-11
2 knb-lter-and.2720.8 1958-01-01 1979-01-01
3 knb-lter-and.2721.6 1975-01-01 1995-01-01
更新1
我可以获取单个数据集的PackageID、开始和End。在上面的代码中,我可以获取每个数据集的元数据URL。现在只需要弄清楚如何为这147个元数据URL中的每一个提取PackageID、开始和End。
url <- "https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-and.4525.10"
webpage <- read_html(url)
package_id <- html_text(html_nodes(webpage, "td.rowodd + td.roweven")[1])
# Extract the Begin value
time_periods_begin <- html_text(html_nodes(webpage, "td:contains('Begin:') + td")[1])
# Extract the End value
time_periods_end <- html_text(html_nodes(webpage, "td:contains('End:') + td")[1])
data_frame <- data.frame(PackageId = package_id,
Begin = time_periods_begin,
End = time_periods_end)
data_frame
2条答案
按热度按时间gz5pxeao1#
bhmjp9jg2#
下面介绍如何从每个元数据文件中抓取包ID、开始日期和结束日期