python 浏览所有页面

hrirmatl 于 2022-12-21 发布在 Python

关注(0)|答案(2)|浏览(123)

我试图刮这个网站：voxnews.info

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

web='https://voxnews.info'
def main(req, num, web):
    r = req.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    goal = [(x.time.text, x.h1.a.get_text(strip=True), x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
           for x in soup.select("div.site-content")]

    return goal

with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, num) for num in range(1, 2)] # need to scrape all the webpages in the website
        allin = []
        for f in fs:
            allin.extend(f.result())
        df = pd.DataFrame.from_records(
            allin, columns=["Date", "Title", "Category", "Content"])
        print(df)

但是代码有两个问题：

第一个问题是我没有抓取所有的页面（我目前将1和2放在范围内，但我需要所有的页面）;
它不能正确保存日期。

如果能看看代码，告诉我如何改进它，以解决这两个问题，这将是真棒。

python

来源：https://stackoverflow.com/questions/65802300/scraping-through-all-pages

2条答案

按热度按时间

2mbi3lxu1#

一些小改动。
首先，对于单个请求，不需要使用requests.Session（）--您不需要在请求之间保存数据。
对with语句的一个小改动，我不知道它是否更正确，或者只是我的操作方式，你不需要所有的代码在执行器仍然打开的情况下运行。
我提供了两种解析日期的方法，一种是按照站点上的写法，一个意大利语字符串，另一种是datetime对象。
我在文章中没有看到任何“p”标记，所以我删除了这部分。似乎为了获得文章的“内容”，你必须实际导航到并逐个抓取它们。我从代码中删除了这一行。
在你的原始代码中，你没有得到页面上的每一篇文章，只是每篇文章的第一篇。每页只有一个“div.site-content”标签，而是多个“article”标签。这就是改变。
最后，我更喜欢查找而不是选择，但这只是我的风格选择。这对我的前三个页面有效，我没有尝试整个网站。当你运行这个时要小心，30个请求中的78个块可能会让你被阻止...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import datetime

def main(num, web):
    r = requests.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    html = soup.find("div", class_="site-content")
    articles = html.find_all("article")
    
    # Date as string In italian
    goal = [(x.time.get_text(), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]
    # OR as datetime object
    goal = [(datetime.datetime.strptime(x.time["datetime"], "%Y-%m-%dT%H:%M:%S%z"), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]

    return goal

web='https://voxnews.info'

r = requests.get(web)
soup = BeautifulSoup(r.text, "html.parser")
last_page = soup.find_all("a", class_="page-numbers")[1].get_text()
last_int = int(last_page.replace(".",""))

### BE CAREFUL HERE WITH TESTING, DON'T USE ALL 2,320 PAGES ###
with ThreadPoolExecutor(max_workers=30) as executor:
    fs = [executor.submit(main, num, web) for num in range(1, last_int)]

allin = []
for f in fs:
    allin.extend(f.result())
df = pd.DataFrame.from_records(
    allin, columns=["Date", "Title", "Category"])
print(df)

赞(0）回复(0）举报 2022-12-21

kjthegm62#

为了获取所有页面的结果，而不是一个或十个页面（即硬编码），最好的解决方案是使用一个无限while循环，并测试会导致它退出的内容（按钮、元素）。
这个解决方案比硬编码的for循环更好，因为while循环将遍历所有页面，无论页面有多少，直到满足某个条件。在我们的示例中，页面上存在一个按钮（.next CSS选择器）：

if soup.select_one(".next"):
    page_num += 1
else:
    break

您还可以添加页数限制，一旦达到页数限制，循环也将停止：

limit = 20       # paginate through 20 pages
if page_num == limit:
    break

检查在线IDE中的代码。

from bs4 import BeautifulSoup
import requests, json, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}

data = []

page_num = 1

limit = 20                 # page limit

while True:
    html = requests.get(f"https://voxnews.info/page/{page_num}", headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    print(f"Extracting page: {page_num}")

    print("-" * 10)
  
    for result in soup.select(".entry-header"):
        title = result.select_one(".entry-title a").text
        category = result.select_one(".entry-meta:nth-child(1)").text.strip()
        date = result.select_one(".entry-date").text
          
        data.append({
            "title": title,
            "category": category,
            "date": date
        })

    # Condition for exiting the loop when the specified number of pages is reached.
    if page_num == limit:
        break
  
    if soup.select_one(".next"):
        page_num += 1
    else:
        break  

print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例：

[
    {
    "title": "Italia invasa dai figli degli immigrati: “Italiani pezzi di merda” – VIDEO",
    "category": "BREAKING NEWS, INVASIONE, MILANO, VIDEO",
    "date": "Novembre 23, 2022"
  },
  {
    "title": "Soumahoro accusato di avere fatto sparire altri 200mila euro – VIDEO",
    "category": "BREAKING NEWS, POLITICA, VIDEO",
    "date": "Novembre 23, 2022"
  },
  {
    "title": "Città invase da immigrati: “Qui comandiamo noi” – VIDEO",
    "category": "BREAKING NEWS, INVASIONE, VENEZIA, VIDEO",
    "date": "Novembre 23, 2022"
  },
  # ...
]

如果你想知道更多关于网站抓取的信息，有一篇13 ways to scrape any public data from any website的博客文章。

赞(0）回复(0）举报 2022-12-21

我来回答

python 浏览所有页面

2条答案

相关问题

热门标签

最新问答