当我用Python做网页抓取时,我如何写两个For循环?

2j4z5cfb  于 2023-02-02  发布在  Python
关注(0)|答案(1)|浏览(119)

我想写一个代码刮多个网页。
但问题是,网页中有两个数字变体。

000/BBSDD0002/93976?page=1&
000/BBSDD0002/93975?page=1&
000/BBSDD0002/93970?page=1&
000/BBSDD0002/93964?page=1&
000/BBSDD0002/93950?page=1&
000/BBSDD0002/93946?page=1&
000/BBSDD0002/93945?page=1&
000/BBSDD0002/93930?page=2&
000/BBSDD0002/93925?page=2&
.
.
.
.
000/BBSDD0002/39045?page=536&

正如我们在这里看到的,页码和文档号同时变化。

import requests
import re
from bs4 import BeautifulSoup
from itertools import product

page = range(1, 6)
document = range(39045, 93976)


for i, j in product(page, document):
    print("Page Number:", i)
    url = "https://000.com/BBSDD0002/{}?page={}&".format(i,j)
    res = requests.get(url, headers=headers)
    res.raise_for_status()
    soup = BeautifulSoup(res.text,"lxml")
    
    list1=soup.find_all("td", attrs = {"class":"sbj"})
    for li in list1:
        print(li.get_text())

到目前为止我是这样写的,但它只循环页码,所以它没有给予我任何东西。
是否有办法创建页码和文档编号的外观?

f0brbegy

f0brbegy1#

不确定您的目标是什么,但您可以这样做:

page = range(1, 6)
entry_id = 39045

for p in page:
    for i in range(0,10):
        print(f'https://000.com/BBSDD0002/{entry_id}?page={p}')
        entry_id = entry_id+1

导致:

https://000.com/BBSDD0002/39045?page=1
https://000.com/BBSDD0002/39046?page=1
https://000.com/BBSDD0002/39047?page=1
https://000.com/BBSDD0002/39048?page=1
https://000.com/BBSDD0002/39049?page=1
https://000.com/BBSDD0002/39050?page=1
https://000.com/BBSDD0002/39051?page=1
https://000.com/BBSDD0002/39052?page=1
https://000.com/BBSDD0002/39053?page=1
https://000.com/BBSDD0002/39054?page=1
https://000.com/BBSDD0002/39055?page=2
https://000.com/BBSDD0002/39056?page=2
https://000.com/BBSDD0002/39057?page=2
https://000.com/BBSDD0002/39058?page=2
https://000.com/BBSDD0002/39059?page=2
https://000.com/BBSDD0002/39060?page=2
https://000.com/BBSDD0002/39061?page=2
https://000.com/BBSDD0002/39062?page=2
https://000.com/BBSDD0002/39063?page=2
https://000.com/BBSDD0002/39064?page=2
https://000.com/BBSDD0002/39065?page=3
https://000.com/BBSDD0002/39066?page=3
https://000.com/BBSDD0002/39067?page=3
...
  • 如果你试图抓取评论-为什么不迭代页面并收集它们的网址。这也将防止你创建无效的网址为删除的评论的例子。*

相关问题