我一直在研究rvest软件包,有一个关于从列表中提取url的问题,我的目标是生成一个带有以下头文件的df:国家,城市和城市的网址。我已经有一个DF与每个国家和每个国家的城市列表。
我的问题是,如何引用每个城市,以便获得其相应的URL链接?我尝试引用“wikitable sortable jquery-tablesorter”中td类内的href,但当我运行links = webpage %>% html_node("href") %>% html_text()
时,我只获得了主URL。
谢谢你的建议!
# Get URL
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
# Read the HTML code from the website
page = read_html(url)
# Get name of the countries
countries = page %>% html_nodes(".mw-headline") %>% html_text()
#Remove the last two items which are not countries
countries = as.tibble(countries) %>%
slice(1:(n()-2))
#Add row number to each Country to left_join later
countries = rowid_to_column(countries, "column_label")
# Get cities for that country
# Still working on this since it includes the first table and I get blanks when I filter the html_nodes(".jquery-tablesorter td")
tables = html_nodes(page, "table")
tables = lapply(tables, html_table)
#Remove fist element which is not a city, only on the first page
tables = tables[-1]
#---WIP
# Get links for the cities, currently picks the main domain instead of the city
# Can I add a clause before the html node to indicate I want the href from "wikitable sortable jquery-tablesorter"?
links = page %>% html_attr("href") %>% html_text()
#---
#Remove the Providence and Population columns and keeps City and URL
tables = lapply(tables, "[", -c(2, 3))
#Standardize City as the column
tables = map(tables, set_names, "City")
# Flatten List
all <- bind_rows(tables, .id = "column_label") %>%
mutate(column_label = as.integer(column_label)) %>%
left_join(countries, by = "column_label")
2条答案
按热度按时间uqjltbpv1#
下面是一个完全可复制的例子,它可以让你得到一个包含完整URL的城市列表:
创建于2023年1月6日,使用reprex v2.0.2
dtcbnfnu2#
有一种方法可以达到你想要的结果,我采用了一种不同的方法,使用一个小的自定义函数通过抓取表格行来获得你想要的内容: