Selenium数据擦除问题,擦除不正确的数据

9avjhtql  于 2022-11-29  发布在  其他
关注(0)|答案(1)|浏览(151)

我正在尝试从以下位置抓取数据:-https://www.canadapharmacy.com/
下面是我需要刮的几页:-
https://www.canadapharmacy.com/products/abilify-tablet
https://www.canadapharmacy.com/products/accolate
https://www.canadapharmacy.com/products/abilify-mt
我需要从页面的所有信息。我写了下面的代码:-
使用汤:-

base_url = 'https://www.canadapharmacy.com'
data = []
for i in tqdm(range(len(test))):
    r = requests.get(base_url+test[i])
    
    soup = BeautifulSoup(r.text,'lxml')
    # Scraping medicine Name
    try:
        main_name = (soup.find('h1',{"class":"mn"}).text.lstrip()).rstrip()
    except:
        main_name = None
    
    try:
        sec_name = ((soup.find('div',{"class":"product-name"}).find('h3').text.lstrip()).rstrip()).replace('\n','')
    except:
        sec_name = None
    
    try:
        generic_name = (soup.find('div',{"class":"card product generic strength equal"}).find('div').find('h3').text.lstrip()).rstrip()
    except:
        generic_name = None
        
    # Description
    
    card = ''.join([x.get_text(' ',strip=True) for x in soup.select('div.answer.expanded')])

    try:
        des = card.split('Directions')[0].replace('Description','')
    except:
        des = None
    
    try:
        drc = card.split('Directions')[1].split('Ingredients')[0]
    except:
        drc = None
        
    try:
        ingre= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[0]
    except:
        ingre = None
    
    try:
        cau=card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[0]
    except:
        cau = None
    try:
        se= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[1]
    except: 
        se = None

    for j in soup.find('div',{"class":"answer expanded"}).find_all('h4'):
        if 'Product Code' in j.text:
            prod_code = j.text
        
    #prod_code = soup.find('div',{"class":"answer expanded"}).find_all('h4')[5].text #//div[@class='answer expanded']//h4
    
    pharma = {"primary_name":main_name,
            "secondary_name":sec_name,
            "Generic_Name":generic_name,
            'Description':des,
            'Directions':drc,
            'Ingredients':ingre,
            'Cautions':cau,
            'Side Effects':se,
            "Product_Code":prod_code}
    
    data.append(pharma)

使用 selenium :-

main_name = []
sec_name = []
generic_name = []
strength = []
quantity = []
desc = []
direc = []
ingre = []
cau = []
side_effect = []
prod_code = []

for i in tqdm(range(len(test_url))):
    card = []
    driver.get(base_url+test_url[i])
    time.sleep(1)

    try:
        main_name.append(driver.find_element(By.XPATH,"//div[@class='card product brand strength equal']//h3").text)
    except:
        main_name.append(None)

    try:
        sec_name.append(driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text)
    except:
        sec_name.append(None)

    try:
        generic_name.append(driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text)
    except:
        generic_name.append(None)

    try:
        for i in driver.find_elements(By.XPATH,"//div[@class='product-content']//div[@class='product-select']//form"):
            strength.append(i.text)

    except:
        strength.append(None)

#     try:
#         for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][2]"):
#             quantity.append(i.text)
#     except:
#         quantity.append(None)

    card.append(driver.find_element(By.XPATH,"//div[@class='answer expanded']").text)

    try:
        desc.append(card[0].split('Directions')[0].replace('Description',''))
    except:
        desc.append(None)

    try:
        direc.append(card[0].split('Directions')[1].split('Ingredients')[0])
    except:
        direc.append(None)

    try:
        ingre.append(card[0].split('Directions')[1].split('Ingredients')[1].split('Cautions')[0])
    except:
        ingre.append(None)

    try:
        cau.append(card[0].split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[0])
    except:
        cau.append(None)
    try:
        #side_effect.append(card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[1])
        side_effect.append(card[0].split('Cautions')[1].split('Side Effects')[1])
    except: 
        side_effect.append(None)

    for j in driver.find_elements(By.XPATH,"//div[@class='answer expanded']//h4"):
        if 'Product Code' in j.text:
            prod_code.append(j.text)

我可以从页面上刮取数据,但在刮取规格和数量框时遇到了一个问题。我想以这样的方式编写代码,以便我可以分别从每种药物中刮取数据,并将其转换为具有列的数据框,如2mg,5mg,10mg,30片,90片,并显示价格。
我试过这个代码:-

medicine_name1 = []
medicine_name2 = []
strength = []
quantity = []

for i in tqdm(range(len(test_url))):
    driver.get(base_url+test_url[i])
    time.sleep(1)
    
    try:
        name1 = driver.find_element(By.XPATH,"//div[@class='card product brand strength equal']//h3").text
    except:
        name1 = None
        
    try:
        name2 = driver.find_element(By.XPATH,"//div[@class='card product generic strength equal']//h3").text
    except:
        name2 = None
        
    try:
        for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][1]"):
            strength.append(i.text)
            medicine_name1.append(name1)
            medicine_name2.append(name2)
    except:
        strength.append(None)
        
    try:
        for i in driver.find_elements(By.XPATH,"//div[@class='product-select']//form//div[@class='product-select-options'][2]"):
            quantity.append(i.text)
    except:
        quantity.append(None)

它工作正常,但仍然,在这里我得到了重复的值为药物。有人可以请检查吗?

lf5gs5x2

lf5gs5x21#

    • 注意:**通常构建一个字典列表更可靠[而不是像selenium版本那样构建单独的列表。]

如果没有你想要的输出的样本/模型,我不能确定这是你想要的确切格式,但是我建议使用requests + bs4 [在你作为例子包括的3个链接上]这样的解决方案

# import requests
# from bs4 import BeautifulSoup

rootUrl = 'https://www.canadapharmacy.com'
prodList = ['abilify-tablet', 'accolate', 'abilify-mt']
priceList = []
for prod in prodList:
    prodUrl = f'{rootUrl}/products/{prod}'
    print('', end=f'Scraping {prodUrl} ')
    resp = requests.get(prodUrl)
    if resp.status_code != 200:
        print(f'{resp.raise_for_status()} - failed to get {prodUrl}')
        continue
    pSoup = BeautifulSoup(resp.content)

    pNameSel = 'div.product-name > h3'
    for pv in pSoup.select(f'div > div.card.product:has({pNameSel})'):
        pName = pv.select_one(pNameSel).get_text('\n').strip().split('\n')[0] 
        pDet = {'product_endpt': prod, 'product_name': pName.strip()}

        brgen = pv.select_one('div.badge-container > div.badge')
        if brgen: pDet['brand_or_generic'] = brgen.get_text(' ').strip()
        rxReq = pv.select_one(f'{pNameSel} p.mn')
        if rxReq: pDet['rx_requirement'] = rxReq.get_text(' ').strip()

        mgSel = 'div.product-select-options'
        opSel = 'option[value]:not([value=""])'
        opSel = f'{mgSel} + {mgSel}  select[name="productsizeId"] {opSel}'
        for pvRow in pv.select(f'div.product-select-options-row:has({opSel})'):
            pvrStrength = pvRow.select_one(mgSel).get_text(' ').strip()

            pDet[pvrStrength] = ', '.join([
                pvOp.get_text(' ').strip() for pvOp in pvRow.select(opSel)
            ])                 

        pDet['source_url'] = prodUrl
        priceList.append(pDet)
    print(f' [total {len(priceList)} product prices]')

然后显示为表格:

# import pandas

pricesDf = pandas.DataFrame(priceList).set_index('product_name')
colOrder = sorted(pricesDf.columns, key=lambda c: c == 'source_url')
pricesDf = pricesDf[colOrder] # (just to push 'source_url' to the end)

如果您删除了

pDet[pvrStrength] = ', '.join([
                pvOp.get_text(' ').strip() for pvOp in pvRow.select(opSel)
            ])

将其替换为以下循环:

for pvoi, pvOp in enumerate(pvRow.select(opSel)):  
                pvoTxt = pvOp.get_text(' ').strip()
                tabletCt = pvoTxt.split(' - ')[0]
                pvoPrice = pvoTxt.split(' - ')[-1]
                if not tabletCt.endswith(' tablets'): 
                    tabletCt = f'[option {pvoi + 1}]'    
                    pvoPrice = pvoTxt
                
                pDet[f'{pvrStrength} - {tabletCt}'] = pvoPrice

| 索引|阿立哌唑|同品种医疗器械-Abilify(阿立哌唑)|同品种医疗器械-Accolate(扎鲁司特)|阿立哌唑(阿立哌唑)|同品种医疗器械-Abilify ODT(阿立哌唑)|
| - -| - -| - -| - -| - -| - -|
| 产品_结束点|能力片|能力片|赞美|能力-MT|能力-MT|
| 品牌或通用|品牌名称|通用|通用|品牌名称|通用|
| rx_要求|需要处方|不含N|不含N|需要处方|不含N|
| 2mg-30片|二百一十九块九毛九|不含N|不含N|不含N|不含N|
| 2mg-90片|526.99美元|不含N|不含N|不含N|不含N|
| 5mg-28片|一百六十块九毛九|不含N|不含N|不含N|不含N|
| 5mg-84片|459.99美元|不含N|不含N|不含N|不含N|
| 10mg-28片|一百一十六块九毛九|不含N|不含N|不含N|不含N|
| 10mg-84片|一百六十二块九毛九|不含N|不含N|不含N|不含N|
| 15mg-28片|一百五十九块九毛九|不含N|不含N|不含N|不含N|
| 15毫克-84片|一百九十八块九毛九|不含N|不含N|不含N|不含N|
| 20mg-90片|七百四十五块九毛九|六十七块九毛九|不含N|不含N|不含N|
| 30mg-28片|一百零四块九毛九|不含N|不含N|不含N|不含N|
| 30mg-84片|二百八十九块九毛九|七十五块九毛九|不含N|不含N|不含N|
| 1 mg/ml溶液-[选项1]| 150毫升-239.99美元|不含N|不含N|不含N|不含N|
| 2mg-100片|不含N|九十八块九毛九|不含N|不含N|不含N|
| 5mg-100片|不含N|四十三块九毛九|不含N|不含N|不含N|
| 10mg-90片|不含N|三十八块五九|不含N|不含N|不含N|
| 15mg-90片|不含N| 56.59美元|不含N|不含N|不含N|
| 10mg-60片|不含N|不含N|一百零九元|不含N|不含N|
| 20mg-60片|不含N|不含N|一百零九元|不含N|不含N|
| 10毫克ODT-84片|不含N|不含N|不含N| 499.99美元|不含N|
| 15毫克ODT-84片|不含N|不含N|不含N| 499.99美元|不含N|
| 5毫克ODT-90片剂|不含N|不含N|不含N|不含N|五十九元|
| 20毫克ODT-90片|不含N|不含N|不含N|不含N|八十九元|
| 30毫克ODT-150片|不含N|不含N|不含N|不含N|一百二十九块九毛九|
| source_url | https://www.canadapharmacy.com/products/abilify-tablet | https://www.canadapharmacy.com/products/abilify-tablet | https://www.canadapharmacy.com/products/accolate | https://www.canadapharmacy.com/products/abilify-mt | https://www.canadapharmacy.com/products/abilify-mt |
(我转置了表格,因为列太多,行太少。可以从print(pricesDf.T.to_markdown())的输出中复制表格减价)

相关问题