多个网址的RVest网页搜罗(希望容易的问题)

vxf3dgd4  于 2023-03-15  发布在  其他
关注(0)|答案(1)|浏览(148)

我是一个新手网页抓取器所以道歉的基本问题,但我已经搜索和挣扎时,试图应用以前的答案在这里。我试图抓取多个相关的网址fbref.com(体育参考的子集),但遇到了一个问题,我认为使用lapply正确。我可以成功地拉一个网址,只是不是一次全部。
以下是我尝试做的要点:

library("rvest")
library("tidyverse")

year1 <- paste0(2006:2021)
year2 <- paste0(2007:2022)

urls <- sort(rep(paste0("https://fbref.com/en/comps/Big5/", year1, "-", year2, "/stats/players/", year1, "-", year2, "-Big-5-European-Leagues-Stats")))

table <- read_html(urls) |> 
  html_nodes("table") |> 
  html_table()

我想我只需要lapply循环最后一节,但是我很难得到正确的格式。当使用最后一节通过纯粹粘贴一个URL来读取其中一个URL时,如下图所示,我得到了我想要的输出。我只想从2006-07到2021-22的所有年份都在一个csv文件中。

> url <- "https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats"
> table <- read_html(url) |> 
+     html_nodes("table") |> 
+     html_table()
> write.csv(table, file = "fbrefinitial.csv")

接下来,我想我只需要使用bind_rows沿着year 1或year 2为每年添加一列,因为我希望在一个csv文件的一个标签中获得所有这些内容。(格式化该命令的正确方法是什么?)
这与this post最相似,但是我尝试以不同的方式应用该逻辑是行不通的。
谢谢你的帮助!

vuktfyat

vuktfyat1#

您可以:

lapply(urls, function(url) {
  read_html(url) |> 
  html_nodes("table") |> 
  html_table()
})
#> [[1]]
#> [[1]][[1]]
#> # A tibble: 2,687 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Dani Aba~ es E~ FW,MF Celt~ es L~ 18    1987  1       0       13      0.1    
#>  3 2     Jacques ~ fr F~ DF    Nice  fr L~ 28    1978  30      28      2,492   27.7   
#>  4 3     Christia~ it I~ GK    Tori~ it S~ 29    1977  36      36      3,235   35.9   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 33    1972  36      36      3,215   35.7   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 25    1981  29      15      1,432   15.9   
#>  7 6     Nadjim A~ km C~ MF    Sedan fr L~ 22    1984  17      11      1,136   12.6   
#>  8 7     Nelson A~ uy U~ MF    Atal~ it S~ 33    1973  5       2       121     1.3    
#>  9 8     Mathias ~ de G~ DF    Hamb~ de B~ 25    1981  8       4       416     4.6    
#> 10 9     Éric Abi~ fr F~ DF    Lyon  fr L~ 26    1979  33      31      2,750   30.6   
#> # ... with 2,677 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> # A tibble: 2,770 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Nice  fr L~ 29    1978  10      4       434     4.8    
#>  3 2     Jacques ~ fr F~ DF    Nürn~ de B~ 29    1978  10      9       820     9.1    
#>  4 3     Ignazio ~ it I~ DF,MF Empo~ it S~ 20    1986  24      9       1,167   13.0   
#>  5 4     Christia~ it I~ GK    Atlé~ es L~ 30    1977  21      20      1,804   20.0   
#>  6 5     Pato Abb~ ar A~ GK    Geta~ es L~ 34    1972  34      34      3,046   33.8   
#>  7 6     Yacine A~ ma M~ MF    Stra~ fr L~ 26    1981  23      17      1,549   17.2   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 25    1982  26      24      2,230   24.8   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 27    1979  30      28      2,523   28.0   
#> 10 9     Ahmed Ab~ eg E~ DF,MF Stra~ fr L~ 26    1981  2       1       91      1.0    
#> # ... with 2,760 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> # A tibble: 2,796 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Vale~ fr L~ 30    1978  18      14      1,252   13.9   
#>  3 2     Ignazio ~ it I~ DF,MF Tori~ it S~ 21    1986  25      21      1,913   21.3   
#>  4 3     Christia~ it I~ GK    Milan it S~ 31    1977  28      28      2,441   27.1   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 35    1972  13      13      1,083   12.0   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 27    1981  10      2       388     4.3    
#>  7 6     Djamel A~ dz A~ MF    Nant~ fr L~ 22    1986  22      12      1,139   12.7   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 26    1982  25      20      1,788   19.9   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 28    1979  25      25      2,116   23.5   
#> 10 9     Fabrice ~ fr F~ MF    Lori~ fr L~ 29    1979  35      35      3,060   34.0   
#> # ... with 2,786 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#>

相关问题