python 浏览所有页面

hrirmatl  于 2022-12-21  发布在  Python
关注(0)|答案(2)|浏览(123)

我试图刮这个网站:voxnews.info

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

web='https://voxnews.info'
def main(req, num, web):
    r = req.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    goal = [(x.time.text, x.h1.a.get_text(strip=True), x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
           for x in soup.select("div.site-content")]

    return goal

with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, num) for num in range(1, 2)] # need to scrape all the webpages in the website
        allin = []
        for f in fs:
            allin.extend(f.result())
        df = pd.DataFrame.from_records(
            allin, columns=["Date", "Title", "Category", "Content"])
        print(df)

但是代码有两个问题:

  • 第一个问题是我没有抓取所有的页面(我目前将1和2放在范围内,但我需要所有的页面);
  • 它不能正确保存日期。

如果能看看代码,告诉我如何改进它,以解决这两个问题,这将是真棒。

2mbi3lxu

2mbi3lxu1#

一些小改动。
首先,对于单个请求,不需要使用requests.Session()--您不需要在请求之间保存数据。
with语句的一个小改动,我不知道它是否更正确,或者只是我的操作方式,你不需要所有的代码在执行器仍然打开的情况下运行。
我提供了两种解析日期的方法,一种是按照站点上的写法,一个意大利语字符串,另一种是datetime对象。
我在文章中没有看到任何“p”标记,所以我删除了这部分。似乎为了获得文章的“内容”,你必须实际导航到并逐个抓取它们。我从代码中删除了这一行。
在你的原始代码中,你没有得到页面上的每一篇文章,只是每篇文章的第一篇。每页只有一个“div.site-content”标签,而是多个“article”标签。这就是改变。
最后,我更喜欢查找而不是选择,但这只是我的风格选择。这对我的前三个页面有效,我没有尝试整个网站。当你运行这个时要小心,30个请求中的78个块可能会让你被阻止...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import datetime

def main(num, web):
    r = requests.get(web+"/page/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    html = soup.find("div", class_="site-content")
    articles = html.find_all("article")
    
    # Date as string In italian
    goal = [(x.time.get_text(), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]
    # OR as datetime object
    goal = [(datetime.datetime.strptime(x.time["datetime"], "%Y-%m-%dT%H:%M:%S%z"), x.h1.a.get_text(strip=True), x.find("span", class_="cat-links").get_text(strip=True)) for x in articles]

    return goal

web='https://voxnews.info'

r = requests.get(web)
soup = BeautifulSoup(r.text, "html.parser")
last_page = soup.find_all("a", class_="page-numbers")[1].get_text()
last_int = int(last_page.replace(".",""))

### BE CAREFUL HERE WITH TESTING, DON'T USE ALL 2,320 PAGES ###
with ThreadPoolExecutor(max_workers=30) as executor:
    fs = [executor.submit(main, num, web) for num in range(1, last_int)]

allin = []
for f in fs:
    allin.extend(f.result())
df = pd.DataFrame.from_records(
    allin, columns=["Date", "Title", "Category"])
print(df)
kjthegm6

kjthegm62#

为了获取所有页面的结果,而不是一个或十个页面(即硬编码),最好的解决方案是使用一个无限while循环,并测试会导致它退出的内容(按钮、元素)。
这个解决方案比硬编码的for循环更好,因为while循环将遍历所有页面,无论页面有多少,直到满足某个条件。在我们的示例中,页面上存在一个按钮(.next CSS选择器):

if soup.select_one(".next"):
    page_num += 1
else:
    break

您还可以添加页数限制,一旦达到页数限制,循环也将停止:

limit = 20       # paginate through 20 pages
if page_num == limit:
    break

检查在线IDE中的代码。

from bs4 import BeautifulSoup
import requests, json, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}

data = []

page_num = 1

limit = 20                 # page limit

while True:
    html = requests.get(f"https://voxnews.info/page/{page_num}", headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    print(f"Extracting page: {page_num}")

    print("-" * 10)
  
    for result in soup.select(".entry-header"):
        title = result.select_one(".entry-title a").text
        category = result.select_one(".entry-meta:nth-child(1)").text.strip()
        date = result.select_one(".entry-date").text
          
        data.append({
            "title": title,
            "category": category,
            "date": date
        })

    # Condition for exiting the loop when the specified number of pages is reached.
    if page_num == limit:
        break
  
    if soup.select_one(".next"):
        page_num += 1
    else:
        break  

print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例:

[
    {
    "title": "Italia invasa dai figli degli immigrati: “Italiani pezzi di merda” – VIDEO",
    "category": "BREAKING NEWS, INVASIONE, MILANO, VIDEO",
    "date": "Novembre 23, 2022"
  },
  {
    "title": "Soumahoro accusato di avere fatto sparire altri 200mila euro – VIDEO",
    "category": "BREAKING NEWS, POLITICA, VIDEO",
    "date": "Novembre 23, 2022"
  },
  {
    "title": "Città invase da immigrati: “Qui comandiamo noi” – VIDEO",
    "category": "BREAKING NEWS, INVASIONE, VENEZIA, VIDEO",
    "date": "Novembre 23, 2022"
  },
  # ...
]

如果你想知道更多关于网站抓取的信息,有一篇13 ways to scrape any public data from any website的博客文章。

相关问题