html 代码可以忽略webscraping中的一个迭代吗？IndexError：弹出索引超出范围

lhcgjxsq 于 2022-11-27 发布在其他

关注(0)|答案(2)|浏览(130)

所以我有一个代码，它从14页中剔除了矿物的名称和价格（到目前为止）并将其保存到. txt文件。我首先只尝试了Page1，然后我想添加更多的页面以容纳更多的数据。但随后代码抓取了一些不应该抓取的东西--一个随机名称/字符串。我没想到它会抓取那个，但它确实做到了，并且给它指定了一个错误的价格!这种情况发生在一个具有这个"意外名称"的矿物之后，然后整个列表的其余部分都有错误的价格。见下图：

因此，由于此字符串不同于其他任何字符串，因此进一步的代码无法拆分它并给出错误：

cutted2 = split2.pop(1)
              ^^^^^^^^^^^^^
IndexError: pop index out of range

我试图忽略这些错误，并使用不同的Stackoverflow页面中使用的方法之一：

try:
   cutted2 = split2.pop(1)
except IndexError:
   continue

它确实工作了，没有出现错误...但是它给错误的矿物分配了错误的价格（正如我注意到的）!!!我怎么能改变代码，只是忽略这些"奇怪"的名字，只是继续与列表？下面是整个代码，它停止在URL5，因为我记得，并给出了这个弹出索引错误：

import requests
from bs4 import BeautifulSoup
import re
def collecter(URL):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(URL, headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font a")]
    prices = [
        p.getText(strip=True).split("Price:")[-1] for p
        in soup.select("table tr td font font")
    ]
    
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[")]
    prices[:] = [p for p in prices if p]

    with open("Minerals.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+" "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                try:
                    cutted2 = split2.pop(1)
                except IndexError:
                    continue
                two_prices = cutted2+" "+split1.pop(0)+"\n"
                file.write(two_prices)
                      
URL1 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=0"
URL2 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=25"
URL3 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=50"
URL4 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=75"
URL5 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=100"
URL6 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=125"
URL7 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=150"
URL8 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=175"
URL9 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=200"
URL10 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=225"
URL11 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=250"
URL12 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=275"
URL13 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=300"
URL14 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=325"

collecter(URL1)
collecter(URL2)
collecter(URL3)
collecter(URL4)
collecter(URL5)
collecter(URL6)
collecter(URL7)
collecter(URL8)
collecter(URL9)
collecter(URL10)
collecter(URL11)
collecter(URL12)
collecter(URL13)
collecter(URL14)

编辑：这是完整的工作代码下面，谢谢帮助的家伙!

import requests
from bs4 import BeautifulSoup
import re
for URL in range(0,2569,25):
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

    soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")

    names = [n.getText(strip=True) for n in soup.select("table tr td font>a")]

    prices = [p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font>font")]   
       
    names[:] = [" ".join(n.split()) for n in names if not n.startswith("[") ]
    prices[:] = [p for p in prices if p]

    with open("MineralsList.txt", "a+", encoding='utf-8') as file:
        for name, price in zip(names, prices):
                # print(f"{name}\n{price}")
                # print("-" * 50)
                filename = str(name)+"  "+str(price)+"\n"
                split1 = filename.split(' / ')          
                cutted1 = split1.pop(0)
                split2 = cutted1.split(": ")
                cutted2 = split2.pop(1)
                try:
                    two_prices = cutted2+"  "+split1.pop(0)+"\n"
                except IndexError:
                    two_prices = cutted2+"\n"
                file.write(two_prices)

但是在一些改变之后它在新的错误上停止-它不能通过给定的属性找到字符串，所以错误"IndexError：pop from empty list "出现...甚至soup.select("table tr td font>font")都没有帮助，就像它在" names "中所做的那样

Html

来源：https://stackoverflow.com/questions/74569378/can-code-ignore-one-iteration-in-webscrapping-indexerror-pop-index-out-of-rang

2条答案

按热度按时间

wrrgggsh1#

您只需要让CSS选择器更具体一些，以便只标识直接位于字体元素内部的链接（而不是向下几级）：

soup.select("table tr td font>a")

在页面底部添加链接指向单个项目而不是下一页/上一页链接的进一步标准也会有所帮助：

soup.select("table tr td font>a[href*='CODE']")

赞(0）回复(0）举报 2022-11-27

p8h8hvxi2#

您可以尝试下一个示例沿着分页

import requests
from bs4 import BeautifulSoup

for URL in range(0,100,25):
    headers = {"User-Agent": "Mozilla/5.0"}

    soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")

    names = [ x.get_text(strip=True) for x in soup.select('table tr td font a')][:25]
    print(names)
    prices = [ x.get_text(strip=True) for x in soup.select('table tr td font:nth-child(3)')][:25]
    print(prices)

    # with open("Minerals.txt", "a+", encoding='utf-8') as file:
    #     for name, price in zip(names, prices):
    #             # print(f"{name}\n{price}")
    #             # print("-" * 50)
    #             filename = str(name)+" "+str(price)+"\n"
    #             split1 = filename.split(' / ')          
    #             cutted1 = split1.pop(0)
    #             split2 = cutted1.split(": ")
    #             try:
    #                 cutted2 = split2.pop(1)
    #             except IndexError:
    #                 continue
    #             two_prices = cutted2+" "+split1.pop(0)+"\n"
    #             file.write(two_prices)

输出：

["NX51AH2:\n'lepidolite' after Elbaite with Elbaite", "TH27AL9:\n'Pearceite' with Calcite", "TFM69AN5:\n'Stilbite'", 'SM90CEX:\nAcanthite', 'TMA97AN5:\nAcanthite', 'TB90AE8:\n Acanthite', 'TZ71AK9:\nAcanthite', 'EC63G1:\nAcanthite', 'MN56K9:\nAcanthite', 'TF89AL3:\nAcanthite (Se-bearing) with Polybasite (Se-bearing) and Calcite', 'TP66AJ8:\nAcanthite (Se-bearing) with Pyrite', 'TY86AN2:\nAcanthite after Polybasite', 'TA66AF6:\nAcanthite with Calcite', 'JFD104AO2:\nAcanthite with Calcite', 'TX36AL6:\nAcanthite with Calcite', 'TA48AH1:\nAcanthite with Chalcopyrite', 'EF89L9:\nAcanthite with Pyrite and Calcite', 'TX89AN0:\nAcanthite with Siderite and Proustite', 'EA56K0:\nAcanthite with Silver', 'EC48K0:\nAcanthite with Silver', '11AT12:\nAcanthite, Calcite', '9EF89L9:\nAcanthite, Pyrite, Calcite', 'SM75TDA:\nAdamite', '2M14:\nAdamite', '20MJX66:\nAdamite']
['Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€450 / US$464 / ¥65180 / AUD$690', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€540 / US$557 / 
¥78220 / AUD$830', 'Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€85 / US$87 / ¥12310 / AUD$130', 'Price:€155 / US$159 / ¥22450 / AUD$230', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€1500 / US$1547 / ¥217290 / AUD$2310', 'Price:€1600 / US$1651 / ¥231770 / AUD$2460', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€1200 / US$1238 / ¥173830 / AUD$1850', 'Price:€290 / US$299 / ¥42000 / AUD$440', 'Price:€480 / US$495 / ¥69530 / AUD$740', 'Price:€4800 / US$4953 / ¥695320 / AUD$7400', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€290 / US$299 / ¥42000 / AUD$440', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€320 / US$330 / ¥46350 / AUD$490', 'Price:€75 / US$77 / ¥10860 / AUD$110', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€140 / US$144 / ¥20280 / AUD$215']
['5TD76M9:\nAdamite', 'MA54AE9:\nAdamite (variety Cu-bearing adamite) with Calcite', 'EA11Y6:\nAdamite (variety cuprian)', 'EB14Y6:\nAdamite (variety cuprian)', 'MC11X8:\nAdamite (variety cuprian) with Smithsonite', 'JRM10AN8:\nAegirine', 'MFA46AP3:\nAegirine with Zircon, Orthoclase and Quartz (variety smoky)', 'EM48AF8:\nAlabandite with Calcite', 'MC92T6:\nAlabandite with Calcite and Rhodochrosite', 'TF16AN1:\nAlabandite with Rhodochrosite', 'TX17S1:\nAlabandite with Rhodochrosite', 'TD34S1:\nAlabandite with Rhodochrosite', '10TR46:\nAlmandine', 'HM90EJ:\nAnalcime', 'EFH36AP3:\nAnalcime with Natrolite, Rhodochrosite and Serandite', 'ELR67AP1:\nAnalcime with Quartz', 'EML88AP1:\nAnalcime with Quartz', 'TF87AF4:\nAndorite', 'TR88AJ3:\nAndorite', 'ND56AN0:\nAndorite with Zinkenite', 'SM180NH:\nAndradite (variety demantoid)', 'MT86AL3:\nAndradite (variety demantoid) with Calcite', 'MA27AL7:\nAndradite (variety demantoid) with Calcite', 'TC80TL:\nAndradite (variety topazolite) with Clinochlore', 'TC85TE:\nAndradite (variety topazolite) with Clinochlore']
['Price:€180 / US$185 / ¥26070 / AUD$270', 'Price:€840 / US$866 / ¥121680 / AUD$1290', 'Price:€60 / US$61 / ¥8690 / 
AUD$90', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€580 / US$598 / ¥84010 / AUD$890', 'Price:€1600 / US$1651 / ¥231770 / AUD$2468', 'Price:€2700 / US$2786 / ¥391120 / AUD$4160', 'Price:€740 / US$763 / ¥107190 / AUD$1140', 'Price:€110 / US$113 / ¥15930 / AUD$160', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€920 / US$949 / ¥133270 / AUD$1410', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€130 / US$134 / ¥18830 / AUD$200', 'Price:€260 / US$268 / ¥37660 / AUD$400', 'Price:€380 / US$392 / ¥55040 / AUD$580', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€390 / US$402 / ¥56490 / AUD$600', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€180 / US$185 / ¥26070 / AUD$270', 'Price:€1600 / US$1651 / ¥231770 / AUD$2460', 'Price:€2200 / US$2270 / ¥318690 / AUD$3390', 'Price:€80 / US$82 / ¥11580 / AUD$120', 'Price:€85 / US$87 / ¥12310 / AUD$130']
['T29NAK3:\nAndradite (variety topazolite) with Clinochlore', 'TC85TV:\nAndradite (variety topazolite) with Clinochlore', 'T89GH5:\nAndradite (variety topazolite) with Clinochlore', 'TQ94Q0:\nAndradite (variety topazolite) with Stilbite', 'SM140TFV:\nAndradite on Microcline', 'HM140NG:\nAndradite with Calcite', 'GM66R9:\nAndradite with Clinochlore', 'SM70TYW:\nAndradite with Epidote', 'TC290TVH:\nAndradite with Epidote and Microcline', 'TKX11AO7:\nAndradite with Microcline', 'TC2100TEJ:\nAndradite with Microcline', 'TH16AN2:\nAndradite with Microcline', 'TTX66AO7:\nAndradite with Microcline', 'TC2150TJL:\nAndradite with Microcline', 'TQ96AN2:\nAndradite with Microcline', 'TF48AF2:\nAnglesite', 'MA47AL4:\nAnglesite with Galena', 'LQ88AE6:\nAnglesite with Galena', 'ER90AL8:\nAnglesite with Galena', 'TP70AE1:\nAnglesite with Galena', 'N54NAL5:\nAnglesite with Galena', 'GV96R9:\nAnhydrite with Calcite and Pyrite', '11TV99:\nAnhydrite, Calcite', 'MG26AL4:\nAnorthoroselite with Calcite', 'XM260NFF:\nAragonite']
['Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€85 / US$87 / ¥12310 / AUD$130', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€980 / US$1011 / ¥141960 / AUD$1510', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€90 / US$92 / ¥13030 / AUD$130', 'Price:€70 / US$72 / ¥10140 / AUD$100', 'Price:€100 / US$103 / ¥14480 / AUD$150', 'Price:€110 / US$113 / ¥15930 / AUD$160', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€380 / US$392 / ¥55040 / AUD$580', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€360 / US$371 / ¥52140 / AUD$550', 'Price:€540 / US$557 / ¥78220 / AUD$830', 'Price:€540 / US$557 / ¥78220 / AUD$830', 'Price:€940 / US$969 / ¥136160 / AUD$1450', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€60 / US$61 / ¥8690 / AUD$92'] 
['XM295EAR:\nAragonite', 'ETE46AP2:\nAragonite', 'EXM26AP0:\nAragonite', 'EYB26AP0:\nAragonite', 'EXE56AP2:\nAragonite', 'ETF46AP0:\nAragonite', 'XM2160ERF:\nAragonite', 'EXM46AP0:\nAragonite', 'XM2190MEX:\nAragonite', 'XM2780EFT:\nAragonite', 'EHM93AO9:\nAragonite', 'TYB37AO8:\nAragonite (variety Cu-bearing aragonite)', 'SM99AM3:\nAragonite (variety cuprian)', '1M06:\nAragonite (variety flos ferri)', 'TG69AL3:\nAragonite (variety tarnowitzite)', 'MLC96AO2:\nAragonite on Calcite', 'MLE68AO2:\nAragonite on Calcite', 'MTB66AP3:\nAragonite with Quartz (variety hematoide)', 'MXF96AP3:\nAragonite with Quartz (variety hematoide)', 'MRR47AP3:\nAragonite with Quartz (variety hematoide)', 'MTR37AP3:\nAragonite with Quartz (variety hematoide)', 'JFD193AP3:\nArfvedsonite with Microcline', 'TFX76AO7:\nArsenopyrite with Calcite, Pyrite, Sphalerite and Rhodochrosite', 'NB37AL3:\nArsenopyrite with Muscovite', 'HM220NX:\nArsenopyrite with Muscovite']
['Price:€95 / US$98 / ¥13760 / AUD$146', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€140 / US$144 / ¥20280 / AUD$210', 'Price:€150 / US$154 / ¥21720 / AUD$230', 'Price:€150 / US$154 / 
¥21720 / AUD$230', 'Price:€160 / US$165 / ¥23170 / AUD$246', 'Price:€160 / US$165 / ¥23170 / AUD$240', 'Price:€190 / US$196 / ¥27520 / AUD$293', 'Price:€780 / US$804 / ¥112990 / AUD$1203', 'Price:€880 / US$908 / ¥127470 / AUD$1350', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€480 / US$495 / ¥69530 / AUD$740', 'Price:€100 / US$103 / ¥14480 / AUD$150', 'Price:€460 / US$474 / ¥66630 / AUD$700', 'Price:€190 / US$196 / ¥27520 / AUD$290', 'Price:€360 / US$371 
/ ¥52140 / AUD$550', 'Price:€160 / US$165 / ¥23170 / AUD$246', 'Price:€190 / US$196 / ¥27520 / AUD$293', 'Price:€230 / US$237 / ¥33310 / AUD$354', 'Price:€230 / US$237 / ¥33310 / AUD$354', 'Price:€240 / US$247 / ¥34760 / AUD$370', 'Price:€170 / US$175 / ¥24620 / AUD$260', 'Price:€220 / US$227 / ¥31860 / AUD$330', 'Price:€220 / US$227 / ¥31860 / AUD$330']

赞(0）回复(0）举报 2022-11-27

我来回答

html 代码可以忽略webscraping中的一个迭代吗？IndexError：弹出索引超出范围

2条答案

相关问题

热门标签

最新问答