python 如何修复使用Selenium无限滚动的网页抓取Facebook Marketplace中的错误

dy1byipe  于 2023-02-07  发布在  Python
关注(0)|答案(1)|浏览(146)

我的代码正在从Facebook Marketplace抓取房屋数据,但遇到了一个问题。最初,当页面打开时,它只能读取24个房源。但是,当我尝试通过向下滚动页面加载更多房源时,我的代码开始从开头读取所有房源,而不是第25个房源。如何解决此问题?

open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
#open it's a list of all clickable housing listings when I open the page

while True:
    for o in open:
        sleep(random.randint(1, 2))

        #Here I read the data that I need 

        close_button = driver.find_element(By.XPATH, close_xpath)
        close_button.click()
        sleep(random.randint(1, 2))
        #Here I close the listing and go to the next one
        
    #When I read all 24 listings that were in the 'open' list, I then scroll the page down and try to get new listings and then read them

    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    sleep(random.randint(2, 4)
    open = driver.find_elements(By.XPATH, open_xpath)

    #But after the scroll, my code starts reading the same listings that it already read.

以下是我的输出:

1
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
2
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']

.
.
.
24
['', '2 Beds 2 Baths Apartment']
['$1,350 / Month']
25
['', '2 Beds 1 Bath Apartment']
['$1,600 / Month']
26
['', '1 Bed 1 Bath Apartment']
['$1,500 / Month']

因此,在第24次打开链接后,代码再次开始读取所有清单。

von4xj4u

von4xj4u1#

你可以尝试几种方法。

从html中删除元素

for o in open:循环结束时,使用javascript从html中删除当前元素o

for o in open:
    ...
    driver.execute_script('var element = arguments[0]; element.remove();', o)

但是,此方法可能不起作用:有时当你向下滚动来加载新元素时,页面会重新加载所有先前的元素,然后重新添加到html中。如果是这种情况,请尝试下一种方法。

添加计数器并循环遍历新元素

定义一个计数器并循环遍历索引大于计数器的元素第一次执行while循环时,计数器为0,因此for循环open中包含的所有元素在for的末尾,counter将等于24,因此在第二次执行while时,我们将得到for o in open[24:]:,这意味着现在排除了前24个元素。

open = driver.find_elements(By.XPATH, '//div[@ class="x3ct3a4"]')
counter = 0
while True:
    for o in open[counter:]:
        ...
        counter += 1
           
    ...

相关问题