我试着刮这一页:"示例网站"https://www.semi.org/en/resources/member-directory"
就其本身而言,代码似乎运行良好:link = browser.find_element(By.CLASS_NAME, "member-company__title").find_element(By.TAG_NAME, 'a').get_attribute('href')
返回我的链接,但是当我把代码嵌套在for循环中时,我得到一个错误,css选择器找不到元素,我尝试使用X_Path,但是只能访问第一个容器。
这是我的代码:
results_df = pd.DataFrame({'Company Name': [], 'Join Date': [], 'Company ID': [],'Company Description': [], 'Link': [], 'Primary Industry': [],
'Primary Product Category': [], 'Primary Sub Product Category': [], 'Keywords': [], 'Address':[]})
browser = webdriver.Chrome()
# Load the desired URL
another_url = "https://www.semi.org/en/resources/member-directory"
browser.get(another_url)
time.sleep(3)
containers = browser.find_elements(By.TAG_NAME, 'tr')
for i in range(len(containers)):
container = containers[i]
link = container.find_element(By.TAG_NAME, 'a').get_attribute('href')
browser.get(link)
print("Page navigated after click" + browser.title)
time.sleep(3)
company_name = browser.find_element(By.CLASS_NAME, "page-title").text
try:
join_date = browser.find_element(By.CLASS_NAME, "member-company__join-date").find_element(By.TAG_NAME, 'span').text
except NoSuchElementException:
join_date = "None"
try:
c_ID = browser.find_element(By.CLASS_NAME, "member-company__company-id").find_element(By.TAG_NAME, 'span').text
except NoSuchElementException:
c_ID = "None"
try:
company_description = browser.find_element(By.CLASS_NAME, "member-company__description").text
except NoSuchElementException:
company_description = "None"
try:
company_link = browser.find_element(By.CLASS_NAME,"member-company__website").find_element(By.TAG_NAME, 'div').get_attribute('href')
except NoSuchElementException:
company_link = "None"
try:
primary_industry = browser.find_element(By.CLASS_NAME, "member-company__primary-industry").find_element(By.TAG_NAME, 'div').text
except NoSuchElementException:
primary_industry = "None"
try:
primary_product_cat = browser.find_element(By.CLASS_NAME, "member-company__primary-product-category").find_element(By.TAG_NAME, 'div').text
except NoSuchElementException:
primary_product_cat = "None"
try:
primary_sub_product_cat = browser.find_element(By.CLASS_NAME, "member-company__primary-product-subcategory").find_element(By.TAG_NAME, 'div').text
except NoSuchElementException:
primary_sub_product_cat = "None"
try:
keywords = browser.find_element(By.CLASS_NAME, "member-company__keywords ").find_element(By.TAG_NAME, 'div').text
except NoSuchElementException:
keywords = "None"
try:
address = browser.find_element(By.CLASS_NAME,"member-company__address").text.replace("Street Address","")
except NoSuchElementException:
address = "None"
browser.get(another_url)
time.sleep(5)
result_df = pd.DataFrame({"Company Name": [company_name],
"Join Date": [join_date],
"Company ID": [c_ID],
"Company Description": [company_description],
"Company Website": [company_link],
"Primary Industry": [primary_industry],
"Primary Product Category": [primary_product_cat],
"Primary Sub Product Category": [primary_sub_product_cat],
"Keywords": [keywords],
"Address":[address]})
results_df = pd.concat([results_df, result_df])
results_df.reset_index(drop=True, inplace=True)
results_df.to_csv('semi_test', index=False)
browser.close()
这是怎么回事?
'
1条答案
按热度按时间ncgqoxb01#
这主要是由于语句
containers = browser.find_elements(By.TAG_NAME, 'tr')
。如果打印出容器,你会注意到选择的第一行是没有链接的标题,你的脚本将失败,并给出你所看到的异常。你可以用containers = containers[1:]
修复这个问题,但你会遇到StaleElementReferenceException
的问题,因为你在打开另一个链接后回到了主页。您应该一次从页面中抓取所有链接,然后遍历这些链接以抓取每个链接,而不是一次又一次地返回主页。