scrapy 如何删除重复的链接，如果关键字出现在网页多次

ccrfmcuu 于 2023-03-23 发布在其他

关注(0)|答案(1)|浏览(143)

我尝试使用蜘蛛（网络爬虫）从网页中查找关键字，该蜘蛛将与URL链接匹配的关键字存储在csv文件中。但问题是，如果关键字在同一页面上多次出现，则csv文件中存在重复。如何删除与关键字重复的链接？
output

allowed_domains = ["www.geo.tv"]
start_urls = ["https://www.geo.tv/"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

crawl_count = 0
words_found = 0                                 

def check_buzzwords(self, response):

    self.__class__.crawl_count += 1

    crawl_count = self.__class__.crawl_count

    wordlist = [
        "Imran",
        "Hello",
        "Nauman",
        ]
    

    url = response.url
    contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
    data = response.body.decode('utf-8')

    for word in wordlist:
            substrings = find_all_substrings(data, word)
            for pos in substrings:
                    ok = False
                    if not ok:
                            self.__class__.words_found += 1
                            print(word + ";" + url + ";")
    return Item()

scrapy

来源：https://stackoverflow.com/questions/75809043/how-to-remove-duplicates-links-if-keyword-appears-multiple-times-in-a-webpage

1条答案

按热度按时间

3bygqnnd1#

我不太清楚你的问题是什么，但听起来你需要做的就是停止迭代find_all_substrings返回的完整可迭代对象。在第一次迭代之后只需要break，因为你知道所有额外的迭代都将是重复的。
例如：

for word in wordlist:
    substrings = find_all_substrings(data, word)
    for pos in substrings:
        self.__class__.words_found += 1
        print(word + ";" + url + ";")
        break

赞(0）回复(0）举报 2023-03-23

我来回答

scrapy 如何删除重复的链接，如果关键字出现在网页多次

1条答案

相关问题

热门标签

最新问答