我如何提取网站的网址与scrapy?

col17t5w  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(181)

我正在尝试用Scrapy删除Amazon网站。我可以删除产品标题和价格等项,但我不知道如何提取产品的url(在底部的图片中标出)。目前我的def解析函数如下所示:

def parse(self, response):

        items = BigItem()

        all_boxes = response.css('.s-widget-spacing-small > .sg-col-inner')
        for boxes in all_boxes:
            name = boxes.css('.s-link-style .a-text-normal').css('::text').extract()
            author = boxes.css('.a-color-secondary .a-size-base:nth-child(2)').css('::text').extract()
            price = boxes.css('.s-price-instructions-style .a-price-whole').css('::text').extract()
            imagelink = boxes.css('.s-image::attr(src)').extract()
            rating = boxes.css('.a-spacing-top-small .aok-align-bottom').css('::text').extract()
            valuation = boxes.css('.a-spacing-top-small .s-link-style .s-underline-text').css('::text').extract()
            link = boxes.css('a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal::attr(href)').extract()

            items['name'] = name
            items['author'] = author
            items['price'] = price
            items['imagelink'] = imagelink
            items['rating'] = rating
            items['valuation'] = valuation
            items['link'] = link

            yield items

我还尝试提取为::text,外部为.css(::text).css(::href),但它不工作。

yws3nbqq

yws3nbqq1#

使用.extract_first().get()方法

link = boxes.css('.a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal::attr(href)').get()

 items['link'] = 'https://www.amazon.de'+link

更新(完整工作代码):

from scrapy.crawler import CrawlerProcess
import scrapy

class Test2Spider(scrapy.Spider):
    name = 'test2'
    start_urls = ['https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'

        }
    def parse(self, response):

        items = {}

        all_boxes = response.css('.s-widget-spacing-small > .sg-col-inner')
        for boxes in all_boxes:
            name = boxes.css('.s-link-style .a-text-normal').css('::text').get()
            author = boxes.css('.a-color-secondary .a-size-base:nth-child(2)').css('::text').get()
            price = boxes.css('.s-price-instructions-style .a-price-whole').css('::text').get()
            imagelink = boxes.css('.s-image::attr(src)').get()
            rating = boxes.css('.a-spacing-top-small .aok-align-bottom').css('::text').get()
            valuation = boxes.css('.a-spacing-top-small .s-link-style .s-underline-text').css('::text').get()
            link = boxes.xpath('.//*[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]/@href').get()

            items['name'] = name
            items['author'] = author
            items['price'] = price
            items['imagelink'] = imagelink
            items['rating'] = rating
            items['valuation'] = valuation
            items['link'] = 'https://www.amazon.de'+ link

            yield items

if __name__ == "__main__":
    process =CrawlerProcess(Test2Spider)
    process.crawl()
    process.start()

输出:

{'name': 'Die letzte Spur: Kriminalroman', 'author': 'Charlotte Link', 'price': '10.99', 'imagelink': 'https://m.media-amazon.com/images/I/81DOyi3pH6L._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '1,799', 'link': 'https://www.amazon.de/-/en/Charlotte-Link-ebook/dp/B00NS3GECO/ref=sr_1_3?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-3'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Tabata: Fit in 4 Minuten - Effektiv Muskeln aufbauen, Fett verbrennen und Stoffwechsel beschleunigen ohne Geräte - mit bebilderten Übungen!', 'author': 'Samira Eger', 'price': '13.90', 'imagelink': 'https://m.media-amazon.com/images/I/71XxrUy+DJL._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '402', 'link': 'https://www.amazon.de/-/en/Samira-Eger/dp/1097325695/ref=sr_1_4?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-4'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Der Geschichtenbäcker: Roman | Nach "Der Buchspazierer": der berührende neue Bestseller über die Kunst sich selbst zu lieben, wie man ist', 'author': 'Carsten Henn', 'price': '15.00', 'imagelink': 'https://m.media-amazon.com/images/I/811sGXri0DL._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '838', 'link': 'https://www.amazon.de/-/en/Carsten-Henn/dp/3492071341/ref=sr_1_5?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-5'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Die Vorgängerin: Thriller', 'author': 'Jess Ryder', 'price': '12.99', 'imagelink': 'https://m.media-amazon.com/images/I/616p-WzfzML._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '1,388', 'link': 'https://www.amazon.de/-/en/Jess-Ryder/dp/1803142863/ref=sr_1_6?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-6'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Sommersprossen – Nur zusammen ergeben wir Sinn: Die mitreißende Roman-Neuerscheinung der SPIEGEL Bestseller Autorin', 'author': 'Cecelia Ahern', 'price': '20.00', 'imagelink': 'https://m.media-amazon.com/images/I/71pSKcguZkL._AC_UY218_.jpg', 'rating': '4.1 out of 5 stars', 'valuation': '967', 'link': 'https://www.amazon.de/-/en/Cecelia-Ahern/dp/381053045X/ref=sr_1_7?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-7'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Stöckelschuh oder Gummistiefel: Ein Sylt-Roman', 'author': 'Anni Deckner', 'price': '13.99', 'imagelink': 'https://m.media-amazon.com/images/I/81S0QN3m+aL._AC_UY218_.jpg', 'rating': '4.4 out of 5 stars', 'valuation': '401', 'link': 'https://www.amazon.de/-/en/Anni-Deckner/dp/3967142000/ref=sr_1_8?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-8'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Nächte, in denen Sturm aufzieht', 'author': 'Jojo Moyes', 'price': '14.99', 'imagelink': 'https://m.media-amazon.com/images/I/71gJerg0w+L._AC_UY218_.jpg', 'rating': '4.4 
out of 5 stars', 'valuation': '1,592', 'link': 'https://www.amazon.de/-/en/Jojo-Moyes-ebook/dp/B07HBTBCP7/ref=sr_1_9?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-9'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Die Gesichterwand', 'author': 'Marion Schreiner', 'price': '0.00', 'imagelink': 
'https://m.media-amazon.com/images/I/81hSwUXXIFL._AC_UY218_.jpg', 'rating': '4.4 out of 5 
stars', 'valuation': '234', 'link': 'https://www.amazon.de/-/en/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg1_1?ie=UTF8&adId=A06547802FBE7URY7KRF0&url=%2FMarion-Schreiner-ebook%2Fdp%2FB07GZ4R6RJ%2Fref%3Dsr_1_10_sspa%3Fcrid%3DPYX2JNBU03IA%26keywords%3Db%25C3%25BCcher%2Bbestseller%2B2022%26qid%3D1655756172%26sprefix%3Db%25C3%25BCcher%2Bbestseller%2B2022%252Caps%252C848%26sr%3D8-10-spons%26psc%3D1&qualifier=1655756172&id=5287020744974994&widgetName=sp_mtf'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Der Kastanienmann: Thriller', 'author': 'Søren Sveistrup', 'price': '11.00', 'imagelink': 'https://m.media-amazon.com/images/I/811cg6QOvdS._AC_UY218_.jpg', 'rating': '4.5 out of 5 stars', 'valuation': '1,026', 'link': 'https://www.amazon.de/-/en/S%C3%B8ren-Sveistrup/dp/344249236X/ref=sr_1_11?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-11'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Fritz und Emma: Roman | Der Bestseller. Die schönste Liebesgeschichte des Jahres', 'author': 'Barbara Leciejewski', 'price': '14.99', 'imagelink': 'https://m.media-amazon.com/images/I/81FS-OFJR0L._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '3,193', 'link': 'https://www.amazon.de/-/en/Barbara-Leciejewski/dp/3864931487/ref=sr_1_12?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-12'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Das Geheimnis des Kameliengartens: Liebesroman', 'author': 'Evelyn Kühne', 'price': '13.99', 'imagelink': 'https://m.media-amazon.com/images/I/91TF+6o2FQL._AC_UY218_.jpg', 'rating': '4.5 out of 5 stars', 'valuation': '298', 'link': 'https://www.amazon.de/-/en/Evelyn-K%C3%BChne/dp/3967141519/ref=sr_1_13?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-13'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Der Donnerstagsmordclub: Kriminalroman | Der Millionenerfolg aus England (Die Mordclub-Serie 1)', 'author': 'Richard Osman', 'price': '0.00', 'imagelink': 'https://m.media-amazon.com/images/I/81L7ZSM4MiS._AC_UY218_.jpg', 'rating': '4.1 out of 5 stars', 'valuation': '2,912', 'link': 'https://www.amazon.de/-/en/Richard-Osman-ebook/dp/B08NWCLGCV/ref=sr_1_14?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-14'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Lydias Haus (Kale Hatfield Story 2)', 'author': 'Marion Schreiner', 'price': '0.00', 'imagelink': 'https://m.media-amazon.com/images/I/91lMBinl6fL._AC_UY218_.jpg', 'rating': '4.3 out of 5 stars', 'valuation': '148', 'link': 'https://www.amazon.de/-/en/gp/slredirect/picassoRedirect.html/ref=pa_sp_mtf_aps_sr_pg1_1?ie=UTF8&adId=A09597722YG20CMXOJHTI&url=%2FMarion-Schreiner-ebook%2Fdp%2FB01CW1B2HE%2Fref%3Dsr_1_15_sspa%3Fcrid%3DPYX2JNBU03IA%26keywords%3Db%25C3%25BCcher%2Bbestseller%2B2022%26qid%3D1655756172%26sprefix%3Db%25C3%25BCcher%2Bbestseller%2B2022%252Caps%252C848%26sr%3D8-15-spons%26psc%3D1&qualifier=1655756172&id=5287020744974994&widgetName=sp_mtf'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Die Macht der Geographie im 21. Jahrhundert: 10 Karten erklären die Politik von 
heute und die Krisen der Zukunft', 'author': 'Tim Marshall', 'price': '24.00', 'imagelink': 'https://m.media-amazon.com/images/I/8124vQvpZAL._AC_UY218_.jpg', 'rating': '4.5 out of 
5 stars', 'valuation': '160', 'link': 'https://www.amazon.de/-/en/Tim-Marshall/dp/3423283017/ref=sr_1_16?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-16'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Belmonte (Die Belmonte-Reihe 1): Eine deutsch-italienische Familiensaga | Ein bewegender Familiengeschichten-Roman rund um Liebe, Heimat und Identität', 'author': 'Antonia Riepp', 'price': '11.00', 'imagelink': 'https://m.media-amazon.com/images/I/71y4gRqP6-L._AC_UY218_.jpg', 'rating': '4.2 out of 5 stars', 'valuation': '986', 'link': 'https://www.amazon.de/-/en/Antonia-Riepp/dp/3492317472/ref=sr_1_17?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-17'}
2022-06-21 02:16:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.de/s?k=b%C3%BCcher+bestseller+2022&crid=PYX2JNBU03IA&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&ref=nb_sb_ss_ts-doa-p_1_22>
{'name': 'Ein neuer Sommer in der kleinen Bäckerei (Die kleine Bäckerei am Strandweg 4): Roman | Sommerlich heiterer Frauenroman über einen Neuanfang auf einer Insel vor Cornwall', 'author': 'Jenny Colgan', 'price': '12.00', 'imagelink': 'https://m.media-amazon.com/images/I/81dkHD6ZsPL._AC_UY218_.jpg', 'rating': '4.5 out of 5 stars', 'valuation': '134', 'link': 'https://www.amazon.de/-/en/Jenny-Colgan/dp/3492318088/ref=sr_1_18?crid=PYX2JNBU03IA&keywords=b%C3%BCcher+bestseller+2022&qid=1655756172&sprefix=b%C3%BCcher+bestseller+2022%2Caps%2C848&sr=8-18'}

...等等
P/S:必须注入用户代理

相关问题