我正在为一篇学术研究论文撰写这篇草稿。我绝对是一个新手,自学成才,并且已经拼凑起来了!
我想要的是:一个大约560行的csv;每个日期(mdyyyy)、审核、评级和用户名(用户名目前未计入脚本,仅供参考)各一列。
我已经让它运行没有错误,但输出是错误的。我有上千行。该脚本正在以多种格式循环和输出数据:1)带月份/日期的500ish行和审阅2)带评级的500ish行和审阅3)带名称、日期、审阅的500ish行都在同一列中。。。。等等
我花了几个小时试图解决这个问题,现在我有了另一个:
回溯(最近一次调用):第49行,在date=“”.join(date[j].text.split(“”[-2:])索引器中:列表索引超出范围
在3.9.6中运行这个,如果这有区别的话。
我有三个问题:
如何解决此日期超出范围的问题?
脚本是否有任何明显的错误导致它创建了数千行不同的格式?
如何在中添加用户名?我尝试过这样做,但似乎找不到正确的xpath。以下是我正在浏览的网站:https://www.tripadvisor.com/showuserreviews-g189447-d207187-r773649540-monastery_of_st_john-patmos_dodecanese_south_aegean.html
import csv
from selenium import webdriver
import time
# default path to file to store data
path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"
# default number of scraped pages
num_page = 5
# default tripadvisor website of hotel or things to do (attraction/monument)
url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
# url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
# if you pass the inputs in the command line
if (len(sys.argv) == 4):
path_to_file = sys.argv[1]
num_page = int(sys.argv[2])
url = sys.argv[3]
# import the webdrive -- NMS 20210705
driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")
# open the file to save the review
csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])
# change the value inside the range to save more or less reviews
for i in range(0, 48, 1):
# expand the review
time.sleep(2)
# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)
container = driver.find_elements_by_xpath(".//div[@class='review-container']")
# grab also the date of review
date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")
for j in range(len(container)):
rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", " ")
review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", " ")
date = " ".join(date[j].text.split(" ")[-2:])
# write data into csv
csvWriter.writerow([title, rating, review, date])
# change the page
driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()
# quite selenium
driver.quit()
# FYI you need to close all windows for the file to write ```
1条答案
按热度按时间ibps3vxo1#
那个日期查找器回来时是空的,所以[j]没能找到。审阅日期在容器中,因此您可以将其与其他内容一起使用。
更改:只是标题的范围,而不是整个分区。添加代码以查找person(第2行的剥离位置)在容器中找到日期并删除“Revied”