pandas 如何从shopify网站抓取图像

nbewdwxp  于 2023-01-19  发布在  其他
关注(0)|答案(1)|浏览(122)

我正试图从HTML导出图像的TXT文件,主要是从Shopify网站。大多数的img的从Shopify网站的结构是相同的。由于某种原因,我不能刮的图像链接。我只需要第一个链接。
下面是一个HTML标记的示例。

<div class="grid-product__content"><a class="grid-product__link" href="/products/ayla-ring-gold">
<div class="grid-product__image-mask"><div class="grid__image-ratio grid__image-ratio--square">
<img alt="Ayla Ring | Gold - Alexa Kelley" class="lazyload grid__image-contain" data-aspectratio="1.0" data-sizes="auto" data-src="//cdn.shopify.com/s/files/1/1351/4197/products/Ayla_Ring_Gold_Hero_{width}x.jpg?v=1660506192" data-widths="[360, 540, 720, 900, 1080]"/>
</div><div class="grid-product__secondary-image small--hide"><img alt="Ayla Ring | Gold - Alexa Kelley" class="lazyload" data-aspectratio="1.0" data-sizes="auto" data-src="//cdn.shopify.com/s/files/1/1351/4197/products/Ayla_Ring_Gold_2_{width}x.jpg?v=1660506192" data-widths="[360, 540, 720, 1000]"/>
</div></div>
<div class="grid-product__meta">
<div class="grid-product__title grid-product__title--body">Ayla Ring | Gold</div><div class="grid-product__price"><span class="money">$85.00 USD</span>

返回的错误为“AttributeError:“NoneType”对象没有属性“get”“。我知道错误的含义,只是不知道如何获取链接。
这是我的代码...

baseurl = ('https://alexakelley.com')
protocol = ('https:')

dataset = []

with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.150,share=indexserver/Country/USA/A/Alexakelley/alexakelley2.txt', "r") as f:

    soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.find('div', class_='grid grid--uniform'):
        dataset.append({
            'Field_01':protocol + e.find('img', class_='grid__image-contain lazyautosizes lazyloaded').get('data-srcset'),
            'Field_02':e.find('div', class_='grid-product__title grid-product__title--body').get_text(strip=True),
            'Field_03':baseurl + e.find('a', class_='grid-product__link').get('href'),
            'Field_04':e.find('span', class_='money').get_text(strip=True)
        })
        df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.150,share=indexserver/Country/USA/A/Alexakelley/Alexakelley All.csv', index = False)
        print(dataset)

如果我省略了Field_01,Field 02-Field 04将返回结果,所以我的代码可以正常工作。如何处理Field_01行代码?

f4t66c6m

f4t66c6m1#

我改变了方法,获取与您识别的类匹配的所有元素(* 存储在ResultSets中的元素 *)-然后循环每个soup/ResultSet中包含的项,以获取位置中的元素并构建数据集/列表。

注意:在我的测试中,我找到了23个项目,但是,最后一个结果集"spans with money"有24个元素,因此,考虑这种不一致性。

下面是修改后的代码:

# Your list:
dataset = []

# All the resultset (i.e the HTML elements you desired for this task): 
images = soup.find_all("img", class_="grid__image-contain")
divs_field_2 = soup.find_all("div", class_="grid-product__title grid-product__title--body")
a_field_3 = soup.find_all("a", class_="grid-product__link")
span_field_4 = soup.find_all("span", class_="money")

# Loop the elements "i.e. images":
for indx, item in enumerate(images): 
  dataset.append({
      'Field_01': "https:" + item["data-src"],
      'Field_02': divs_field_2[indx_it].get_text(strip=True),
      'Field_03': a_field_3[indx_it]['href'],
      'Field_04': span_field_4[indx_it].get_text(strip=True)
  })

# Create and display the dataframe:
df = pd.DataFrame(dataset)
display(df)

结果:
| 指标|字段_01|字段_02|字段_03|字段_04|
| - ------|- ------|- ------|- ------|- ------|
| 无|https://cdn.shopify.com/s/files/1/1351/4197/products/Ayla_Ring_Gold_Hero_{width}x.jpg?v=1660506192|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 1个|商品名:艾玲银英雄x. jpg?v = 1660506245|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 第二章|网址:http://cdn.shopfy.com/s/files/1351/4197/products/Ayla_Ring_SilverY_Hero_5ee0a758-b4f7 - 471d-bcda-86a556cbc3d7_{宽度} x. jpg?v = 1661231526|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 三个|图片来源:https://cdn.shopify.com/s/files/1/1351/4197/products/Noemie_Ring_Gold_Hero_{宽度} x. jpg?v = 1660506641|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 四个|图片来源:http://cdn.shopify.com/s/files/1/1351/4197/products/Carolina_Ring_Gold_Hero_{宽度} x. jpg?v = 1660506629|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 五个|网站名称:gisele_ring_gold_hero_0b928b0a-542c-4ce5-bad5-dc535935a12f_{宽度} x. jpg?v = 1670545956|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 六个|https://cdn.shopify.com/s/files/1/1351/4197/products/Vera_Ring_Hero_{width}x.jpg?v=1670543130|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 七|网站名称:Gisele_Ring_Silver_Hero_01fa927d-73d4 - 4236 - 87b2 - 28881347cd7f_{宽度} x. jpg?v = 1670545866|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 八个|https://cdn.shopify.com/s/files/1/1351/4197/products/Elise_Ring_Gold_Hero_{width}x.jpg?v=1670544918|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 九|商品名:伊莉斯银铃英雄x. jpg?v = 1670544859|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十个|商品名:杰琳娜戒指黄金英雄x. jpg?v = 1660504898|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十一|https://cdn.shopify.com/s/files/1/1351/4197/products/Emeri_Gold_Hero_{width}x.jpg?v=1660504754|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十二|https://cdn.shopify.com/s/files/1/1351/4197/products/Adele_Ring_Gold_Hero_{width}x.jpg?v=1660504305|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十三|https://cdn.shopify.com/s/files/1/1351/4197/products/Maia_Ring_Gold_Hero_{width}x.jpg?v=1660505010|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十四|https://cdn.shopify.com/s/files/1/1351/4197/products/Fiona_Ring_GC_Hero_{width}x.jpg?v=1660504768|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十五|网站名称:http://cdn.shopfy.com/s/files/1351/4197/products/20210625_ALEXA.KELLEY_AKNG0003_0480copy_e2344c32-c893 - 4df9 - 9678 - 54a2d1e0f007_{宽度} x. jpg?v = 1633294772|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十六|网址:https://cdn.shopify.com/s/files/1/1351/4197/products/Alaia_Bracelet_Gold_4毫米宽x. jpg图片格式= 1660504482|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十七|网址:http://cdn.shopfy.com/s/files/1351/4197/products/20210625_ALEXA.KELLEY_AKNS0004_0479copy_0bdc78af-abc4 - 4867-bc22 - 1f9c517d520e_{宽度} x. jpg?v = 1633294849|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十八|网址:https://cdn.shopify.com/s/files/1/1351/4197/products/Alaia_Bracelet_Silver_4mm_{宽度} x. jpg?v = 1660504542|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 十九|商品名:伊沙贝尔戒指银色英雄x. jpg?v = 1660504875|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 二十个|图片来源:https://cdn.shopify.com/s/files/1/1351/4197/products/Noemie_Ring_Silver_Hero_{宽度} x. jpg?v = 1660505177|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 二十一|图片来源:百度商城|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |
| 二十二|图片来源:http://cdn.shopify.com/s/files/1/1351/4197/products/Natalie_Ring_Silver_Hero_{宽度} x. jpg?v = 1660505083|娜塔莉银戒指|/products/纳塔莉-戒指-银|80美元| $80.00 USD |

相关问题