python 收集数据的有效方法(歌词)

bogh5gae 于 2023-04-19 发布在 Python

关注(0)|答案(2)|浏览(125)

对于ML项目，我需要收集Megadeth乐队大约50首歌曲的歌词。
我试着在网上找，但没有这样的文件存在.与网页抓取的问题是，我对它一无所知.聊天GPT写了下面的代码不工作.

import requests
from bs4 import BeautifulSoup

# Set the URL of the Megadeth page on azlyrics.com
url = "https://www.azlyrics.com/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(link["href"])
        
        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")
        
        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})
        
        # Extract the lyrics text from the div and print it
        lyrics_text = lyrics_div.get_text()
        print(lyrics_text)

它似乎在第28行崩溃- song_response = requests.get（link[“href”]）
任何其他方法或对该方法的修正将是高度赞赏的。

python

来源：https://stackoverflow.com/questions/76044617/efficient-ways-to-collect-data-song-lyrics

2条答案

按热度按时间

bfnvny8b1#

我运行了你的代码，问题很容易找到。
在迭代过程中，当最终link包含"megadeth"时，link["href"]等于/lyrics/megadeth/lastriteslovedtodeth.html ..由于它不是一个有效的链接（因为它没有域），那么requests.get()会导致错误。修复非常容易。
就像这样：

import requests
from bs4 import BeautifulSoup

site_domain = "https://www.azlyrics.com"
# Set the URL of the Megadeth page on azlyrics.com
url = f"{site_domain}/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(f'{site_domain}{link["href"]}')

        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")

        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})

        # Extract the lyrics text from the div and print it
        lyrics_text = lyrics_div.get_text()
        print(lyrics_text)

基本上，您为站点域创建了一个全局变量，该变量将再次添加到href中。当然有更好的修复方法，但这似乎是最简单的。
请注意，此修复可能不适用于所有网站。任何网站都可以以多种不同的方式格式化href。
希望它能解决你的问题。我只是建议你学习如何调试。通过它，很容易找到问题的根源。

赞(0）回复(0）举报 2023-04-19

wwwo4jvm2#

由于您只提取/lyrics/megadeth/lastriteslovedtodeth.html，因此您需要手动添加基本url尝试使用base_url值并将其添加到需要的位置：

import requests
from bs4 import BeautifulSoup

# Set the URL of the Megadeth page on azlyrics.com
base_url ="https://www.azlyrics.com" #Base Url
url = "https://www.azlyrics.com/m/megadeth.html"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Parse the HTML content of the response using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find all of the links to the Megadeth song pages on azlyrics.com
song_links = soup.find_all("a", href=True, text=True)

# Loop through the song links and scrape the lyrics from each page
for link in song_links:
    # Check if the link is a link to a Megadeth song
    if "megadeth" in link["href"].lower():
        # Send a GET request to the song page and store the response
        song_response = requests.get(base_url + link["href"]) #use here
        
        # Parse the HTML content of the song page using Beautiful Soup
        song_soup = BeautifulSoup(song_response.content, "html.parser")
        
        # Find the div containing the lyrics of the song
        lyrics_div = song_soup.find("div", {"class": "col-xs-12 col-lg-8 text-center"})
        
        # Extract the lyrics text from the div and print it
        if lyrics_div:
            lyrics_text = lyrics_div.get_text()
            print(lyrics_text)

赞(0）回复(0）举报 2023-04-19

我来回答

python 收集数据的有效方法(歌词)

2条答案

相关问题

热门标签

最新问答