Python网页抓取

ycggw6v2  于 2022-12-14  发布在  Python
关注(0)|答案(1)|浏览(233)

我尝试将某个脚本分解为几个函数,结果出现错误“AttributeError:ResultSet对象没有属性“get”。您可能将元素列表视为单个元素。您在本应调用find()时调用了find_all()吗?”
我试着想办法修复它,但结果还是老样子。这是最初的脚本:

#setting query
query = "stroke"

#handleing white spaces
search = query.replace(' ', '+')

#setting results
results = 20

#setting the full url
url = (f"https://www.google.com/search?q={search}&num={results}")

#empty list for links
link_list = []

#scraping google
requests_results = requests.get(url)
soup_link = bs(requests_results.content, "html.parser")
links = soup_link.find_all("a")

#for each link in soup checking
for link in links:
    link_href = link.get('href')
    if "url?q=" in link_href and not "webcache" in link_href:
        title = link.find_all('h3')
        if len(title) > 0:
            full_link = link.get('href').split("?q=")[1].split("&sa=U")[0]
            link_list.append(full_link)
            print(full_link)
            print(title[0].getText())
            print("------")

以下是输出:

它运行得很完美,我们的想法是取查询,把它改成我想要的短语列表,然后得到相同的最终结果形成,对于每个查询。
所以我把代码分解成几个函数,最后一个函数给了我错误。

query_list = ['Coronary artery disease','Stroke','Diabetes mellitus','Alzheimer','Lower respiratory infections',\
              'Lung cancer','Cirrhosis']
query_list

第一个功能:

def getting_links_func (queries):
    url_list = []

    #setting query
    func_queries = queries   

    #setting results
    results = 10
    
    for query in func_queries:
        #handleing white spaces
        search = query.replace(' ', '+')
        
        
        #setting the full url
        url = (f"https://www.google.com/search?q={search}&num={results}")

        #update list with links
        url_list.append(url)
    
    return url_list

输出:

第二功能:

def links_soup_func (url_list):
    soup_list = []
    
    
    for url in url_list:
    #scraping google
        requests_results = requests.get(url)
        soup_link = bs(requests_results.content, "html.parser")
        links = soup_link.find_all("a")
        soup_list.append(links)
        
    
    return soup_list

似乎效果不错:

第三个功能是淘气的:

def urls_from_soup_func (soup_list):
    #for each soup getting the links in the search page
        for soup in soup_list:           
            link_href = soup.get('href')           
            if "url?q=" in link_href and not "webcache" in link_href:
                title = soup.find_all('h3')
                if len(title) > 0:
                    full_link = soup.get('href').split("?q=")[1].split("&sa=U")[0]
                    link_list.append(full_link)
                    print(full_link)
                    print(title[0].getText())
                    print("------")

这里我得到了find或find all错误。我试过了,打破for循环,只检查一个项目,但是我总是得到同样的问题。

我希望我解释清楚了谢谢你的帮助

xdnvmnnf

xdnvmnnf1#

The error is pretty clear. You are appending the ResultSet object to your list, instead of extending it with single <a> elements.

# what you are doing is this
a = [1, 2]
b = [3, 4]
a.append(b)
print(a)  # [1, 2, [3, 4]]

# you should be doing this
a.extend(b)
print(a)  # [1, 2, 3, 4]

Below change should solve your problem.

soup_list.extend(links)

相关问题