python 收集数据的有效方法(歌词)

bogh5gae  于 2023-04-19  发布在  Python
关注(0)|答案(2)|浏览(125)

对于ML项目,我需要收集Megadeth乐队大约50首歌曲的歌词。
我试着在网上找,但没有这样的文件存在.与网页抓取的问题是,我对它一无所知.聊天GPT写了下面的代码不工作.

import requests
from bs4 import BeautifulSoup

# Set the URL of the Megadeth page on azlyrics.com
url = "https://www.azlyrics.com/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(link["href"])
        
        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")
        
        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})
        
        # Extract the lyrics text from the div and print it
        lyrics_text = lyrics_div.get_text()
        print(lyrics_text)

它似乎在第28行崩溃- song_response = requests.get(link[“href”])
任何其他方法或对该方法的修正将是高度赞赏的。

bfnvny8b

bfnvny8b1#

我运行了你的代码,问题很容易找到。
在迭代过程中,当最终link包含"megadeth"时,link["href"]等于/lyrics/megadeth/lastriteslovedtodeth.html ..由于它不是一个有效的链接(因为它没有域),那么requests.get()会导致错误。修复非常容易。
就像这样:

import requests
from bs4 import BeautifulSoup

site_domain = "https://www.azlyrics.com"
# Set the URL of the Megadeth page on azlyrics.com
url = f"{site_domain}/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(f'{site_domain}{link["href"]}')

        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")

        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})

        # Extract the lyrics text from the div and print it
        lyrics_text = lyrics_div.get_text()
        print(lyrics_text)

基本上,您为站点域创建了一个全局变量,该变量将再次添加到href中。当然有更好的修复方法,但这似乎是最简单的。
请注意,此修复可能不适用于所有网站。任何网站都可以以多种不同的方式格式化href
希望它能解决你的问题。我只是建议你学习如何调试。通过它,很容易找到问题的根源。

wwwo4jvm

wwwo4jvm2#

由于您只提取/lyrics/megadeth/lastriteslovedtodeth.html,因此您需要手动添加基本url尝试使用base_url值并将其添加到需要的位置:

import requests
from bs4 import BeautifulSoup

# Set the URL of the Megadeth page on azlyrics.com
base_url ="https://www.azlyrics.com" #Base Url
url = "https://www.azlyrics.com/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(base_url + link["href"]) #use here
        
        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")
        
        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})
        
        # Extract the lyrics text from the div and print it
        if lyrics_div:
            lyrics_text = lyrics_div.get_text()
            print(lyrics_text)

相关问题