我必须去here
在这里我必须选择申请人name =“ltd”
但是在提交页面之前,我必须解决一个验证码。如何使用网页剪贴功能以excel格式获取下一页的信息(申请号、申请标题、日期、申请状态等)?
-----------------正在运行以下脚本,出现错误-----
import csv
import json
from time import sleep, time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
def save_to_csv(data: list) -> None:
with open(file='ipindiaservices.csv', mode='a', encoding="utf-8") as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerow([*data])
def start_from_page(page_number: int, driver: WebDriver) -> None:
driver.execute_script(
f"""
document.querySelector('button.next').value = {page_number};
document.querySelector('button.next').click();
"""
)
def titles_validation(driver: WebDriver) -> None:
"""replace empty title name with '_'"""
driver.execute_script(
"""
let titles = document.querySelectorAll('input+.tab-pane tr:not(:first-child)>td:last-child')
Array.from(titles).forEach((e) => {
if (!e.textContent.trim()) {
e.textContent = '_';
}
});
"""
)
def get_network_data(log: dict, driver: WebDriver) -> dict:
log = json.loads(log["message"])["message"]
if all([
"Network.responseReceived" in log["method"],
"params" in log.keys(),
'CaptchaAudio' in str(log["params"].values())
]):
return driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': log["params"]["requestId"]})
def get_captcha_text(driver: WebDriver, timeout: float) -> str:
"""Return captcha text
Arguments:
- driver: WebDriver
- timeout: pause before receiving data from the web driver log
"""
driver.execute_script(
"""
// document.querySelector('img[title="Captcha"]').click()
document.querySelector('img[title="Captcha Audio"]').click()
"""
)
sleep(timeout)
logs = driver.get_log('performance')
responses = [get_network_data(log, driver) for log in logs if get_network_data(log, driver)]
if responses:
return json.loads(responses[0]['body'])['CaptchaImageText']
else:
get_captcha_text(driver)
def submit_captcha(captcha_text: str, btn_name: str) -> None:
"""Submit captcha
Arguments:
- btn_name: captcha send button name["submit" or "search"]
"""
if btn_name == 'search':
captcha_locator = (By.CSS_SELECTOR, 'input[name="submit"]')
elif btn_name == 'submit':
captcha_locator = (By.ID, 'btnSubmit')
wait.until(EC.visibility_of_element_located((By.ID, 'CaptchaText'))).send_keys(captcha_text)
wait.until(EC.visibility_of_element_located(captcha_locator)).click()
# options = webdriver.ChromeOptions()
# options.add_argument('--headless')
# options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
# capabilities = DesiredCapabilities.CHROME
# capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
# service = Service(executable_path="path/to/your/chromedriver.exe")
# # driver = webdriver.Chrome(service=service, options=options, desired_capabilities=capabilities)
wait = WebDriverWait(driver, 15)
table_values_locator = (By.CSS_SELECTOR, 'input+.tab-pane tr:not(:first-child)>td:last-child')
applicant_name_locator = (By.ID, 'TextField6')
page_number_locator = (By.CSS_SELECTOR, 'span.Selected')
app_satus_locator = (By.CSS_SELECTOR, 'button.btn')
next_btn_locator = (By.CSS_SELECTOR, 'button.next')
driver.get('https://ipindiaservices.gov.in/PublicSearch/')
# sometimes an alert with an error message("") may appear, so a small pause is used
sleep(1)
wait.until(EC.visibility_of_element_located(applicant_name_locator)).send_keys('ltd')
# on the start page and the page with the table, the names of the buttons are different
captcha_text = get_captcha_text(driver, 1)
submit_captcha(captcha_text, "search")
# the page where the search starts
start_from_page(1, driver)
while True:
start = time()
# get current page number
current_page = wait.until(EC.visibility_of_element_located(page_number_locator)).text
print(f"Current page: {current_page}")
# get all application status WebElements
app_status_elements = wait.until(EC.visibility_of_all_elements_located(app_satus_locator))
for element in range(len(app_status_elements)):
print(f"App number: {element}")
# update application status WebElements
app_status_elements = wait.until(EC.visibility_of_all_elements_located(app_satus_locator))
# click on application status
wait.until(EC.visibility_of(app_status_elements[element])).click()
# wait 2 seconds for the captcha to change
sleep(2)
# get text and submit captcha
captcha_text = get_captcha_text(driver, 1)
submit_captcha(captcha_text, "submit")
try:
# get all table data values(without titles) WebElements
table_data_values = wait.until(EC.visibility_of_all_elements_located(table_values_locator))
# if there are empty rows in the table replace them with "_"
titles_validation(driver)
# save data to csv
save_to_csv([val.text.replace('\n', ' ') for val in table_data_values])
except TimeoutException:
print("Application Number does not exist")
finally:
driver.back()
# print the current page number to the console
print(f"Time per page: {round(time()-start, 3)}")
# if the current page is equal to the specified one, then stop the search and close the driver
if current_page == '3776':
break
# click next page
wait.until(EC.visibility_of_element_located(next_btn_locator)).click()
driver.quit()
2条答案
按热度按时间e5nszbig1#
在这个网站上,验证码可以被解析,而不需要求助于第三方服务。当你点击“验证码音频”按钮时,一个GET请求被发送到端点
https://ipindiaservices.gov.in/PublicSearch/Captcha/CaptchaAudio
。响应是一个字典{"CaptchaImageText":"hnnxd"}
,你可以通过"Chrome Devtools Protocol"
使用Network.getResponseBody
方法从Selenium获取它,或者你可以使用requests
库。要将数据保存为csv格式,例如,您可以使用标准库中包含的csv模块。以下是一个可能的解决方案:
此解决方案的性能约为每100页280秒。
收集所有数据大约需要2.5-3小时。
因此,添加了在特定页面(默认情况下,这是最后一页)上停止数据收集的功能:
并从指定页面开始采集数据(默认情况下,这是第一页):
输出为
ipindiaservices.csv
要在
Google Collab
中使用此解决方案,请按照下列步骤操作:1.安装Selenium和ChromeDriver
1.进行必要的导入
1.设定选项
其他的一切都保持不变。不要把所有的代码放在一个单元格里。
更新:这是一种可能的解决方案,可从每个应用程序的表中收集所有信息:
输出为
ipindiaservices.csv
此解决方案的性能约为每页
230sec
。有时,应用程序状态页上可能没有数据(例如,通过编号“00054/CAL/1998”,我们会得到“Application Number does not exist”(应用程序编号不存在))。因此,脚本会忽略此应用程序。
2 sec timeout before receiving the captcha text is due to the fact that after clicking on "Application Status" one captcha is shown and after ~1sec it changes to another one which we must enter
修复:由于点击
Application status
后出现的验证码已从站点中删除,其脚本中的解决方案也被删除。此外,此资源的开发人员为Application Number
和Application status
创建了相同的类,因此我们需要更改Application status
的css选择器。Google Colab
的解决方案:这些修正已经在
Google Colab
(python 3.7.15)上测试过了。这个解决方案的性能大约是每页55sec
。保留以前版本的脚本,以便进行直观比较。
h5qlskok2#
你可以使用任何capcta解决网站。在这些网站上,用户通常自己修复它,所以他们可能很慢,但它确实起到了作用。
Sample website (I didn't receive any ads)
您可以使用selenium来提取信息。在站点上提取和放置带有“id”标签的元素就足够了。Library for reading/writing excel in python