所有数组必须具有相同的长度-创建pandas数组

ntjbwcob  于 2024-01-04  发布在  其他
关注(0)|答案(2)|浏览(134)

这是程序的完整代码,它仍然给出相同的错误,我该如何解决它?

import requests
from bs4 import BeautifulSoup
import pandas as pd

#Connecting to the url
quotes_url = "https://quotes.toscrape.com/"
response = requests.get(quotes_url)

Quotes=[]
Quotes_by=[]
Tags = []

#checking status of url
if response.status_code ==200:
    url_contents = response.text
    doc = BeautifulSoup(url_contents,'html.parser')
    
    quote_rows = doc.find_all('div',{'class':'quote'})

    for quote in quote_rows:
        quotes_tag = doc.find_all('span',{'class','text'})
        quotes_by_tag = doc.find_all('small',{'class':'author'})
        Tags_tag = doc.find_all('div',{'class':'tags'})
        
        '''Error :
        All arrays must be of the same length
        Code
        for tag_tags in Tags_tag:
            tag_tags = doc.find_all('a',{'class':'tag'})
            for tag in tag_tags:
                Tags.append(tag.text.strip())
        '''
        Tags.append({
        'qoute':q.get_text() if (q:= quote.select_one('.text')) else None,
        'tags':[tag.get_text() for tag in quote.select('a.tag')]
    })
        for quote in quotes_tag:
             Quotes.append(quote.text.strip())
        #Appending to the list
        for quote_by in quotes_by_tag:
            Quotes_by.append(quote_by.text.strip())
        
    quotes_dict = {
        'Author':Quotes_by,
        'Tags':Tags,
        'Quote':Quotes
        }
    
    Quotes_df = pd.DataFrame(quotes_dict)
    print(Quotes_df)
    Quotes_df.to_csv('quote.csv',index=None)
else:
    print("Error: %s" % response.status_code)

字符串
所以我不知道什么是错误的,当我尝试打印 Dataframe ,通过我的研究学习我发现不同的方法编写代码时,我尝试他们的工作

yebdmbv4

yebdmbv41#

你得到了一个扁平化的标签列表,因为你一次就把它们都捡起来了。
要创建一个嵌套列表(* 分别为每个报价 *),您可以尝试以下操作:

Tags = [
    [tag.get_text() for tag in quote.select(".tag")]
    for quote in doc.select(".quote")
]

字符串
输出量:

[['change', 'deep-thoughts', 'thinking', 'world'],
 ['abilities', 'choices'],
 ['inspirational', 'life', 'live', 'miracle', 'miracles'],
 ['aliteracy', 'books', 'classic', 'humor'],
 ['be-yourself', 'inspirational'],
 ['adulthood', 'success', 'value'],
 ['life', 'love'],
 ['edison', 'failure', 'inspirational', 'paraphrased'],
 ['misattributed-eleanor-roosevelt'],
 ['humor', 'obvious', 'simile']]


使用的输入:

import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com"
doc = BeautifulSoup(requests.get(url).text, "html.parser")

q1qsirdb

q1qsirdb2#

除了@Timeless关于生成嵌套list的评论之外。
始终要谨慎对待列表的集合,因为你很少能保证它们的长度总是相同的,以生成一个嵌套框架。因此,你将遇到完全相同的问题,最迟在缺少元素的情况下(标签检查https://quotes.toscrape.com/page/3/)。
使用dicts代替-如果缺少键或值,它将简单地通过创建您的嵌套结构来填充Nan。但无论如何总是执行检查也无妨:

for quote in soup.select('.quote'):
    data.append({
        'qoute':q.get_text() if (q:= quote.select_one('.text')) else None,
        'tags':[tag.get_text() for tag in quote.select('a.tag')]
    })

字符串

示例

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com/page/3/'
soup = BeautifulSoup(requests.get(url).text)

data = []

for quote in soup.select('.quote'):
    data.append({
        'qoute':q.get_text() if (q:= quote.select_one('.text')) else None,
        'tags':[tag.get_text() for tag in quote.select('a.tag')]
    })

pd.DataFrame(data)

相关问题