我正在尝试从一个html格式很糟糕的网站中提取数据,因为所有我想要的信息都在同一个div中,并且用换行符分割。我是一个新的网页抓取者,所以请原谅我。
https://wsldata.com/directory/record.cfm?LibID=48
为了得到我需要的零件,我用途:
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
退货
['\r\n',
'\r\n',
'\r\n',
'\r\n \r\n ',
'\r\n\t\t\t',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tDirector',
'\r\n Ext: 5442',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tAssistant Library Director',
'\r\n Ext: 5433',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tYouth Services Librarian',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tTechnical Services Librarian',
'\r\n Ext: 2558',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tOutreach Librarian',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tFoundation Executive Director',
'\r\n Ext: 5456',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n',
'\r\n',
' \xa0|\xa0 ',
'\r\n']
我已经设法使用以下代码将其转换为所需的格式
import scrapy
import re
class LibspiderSpider(scrapy.Spider):
name = 'libspider'
allowed_domains = ['wsldata.com']
start_urls = ['https://wsldata.com/directory/record.cfm?LibID=48']
# Note that start_urls contains multiple links, I just simplified it here to reduce cluttering
def parse(self, response):
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
details_clean = []
titles = []
details = []
for detail in details_raw:
detail = re.sub(r'\t', '', detail)
detail = re.sub(r'\n', '', detail)
detail = re.sub(r'\r', '', detail)
detail = re.sub(r' ', '', detail)
detail = re.sub(r' \xa0|\xa0 ', '', detail)
detail = re.sub(r'|', '', detail)
detail = re.sub(r' E', 'E', detail)
if detail == '':
pass
elif detail == '|':
pass
else:
details_clean.append(detail)
if detail[0:3] != 'Ext':
titles.append(detail)
for r in range(len(details_clean)):
if r == 0:
details.append(details_clean[r])
else:
if details_clean[r-1][0:3] != 'Ext' and details_clean[r][0:3] != 'Ext':
details.append('-')
details.append(details_clean[r])
else:
details.append(details_clean[r])
output = []
for t in range(len(details)//2):
info = {
"title": details[(t*2)],
"phone": details[(t*2+1)],
}
output.append(info)
型
response.xpath行后面的代码块用于将我的输入清理为更好的输出。当在scrapy之外测试代码时,使用我在post顶部显示的奇怪输入,我得到:
[{'title': 'Director', 'phone': 'Ext: 5442'}, {'title': 'Assistant Library Director', 'phone': 'Ext: 5433'}, {'title': 'Youth Services Librarian', 'phone': '-'}, {'title': 'Technical Services Librarian', 'phone': 'Ext: 2558'}, {'title': 'Outreach Librarian', 'phone': '-'}, {'title': 'FoundationExecutive Director', 'phone': 'Ext: 5456'}]
当我尝试将这段代码实现到scrappy的parse()中时,我的日志没有显示任何被擦除的项,显然我得到了一个空的json。
yield在上面的代码中并不存在,因为我尝试了多种方法来实现它,但都没有成功。我是错过了scrappy的响应和yield之间的联系,还是我试图做的事情是不可能的,应该提取这个奇怪的列表并像这样将其从scrappy中删除:
def parse(self, response):
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
yield{
'details_in' : details_raw
}
其提取:
[
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n \r\n ", "\r\n\t\t\t", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tDirector", "\r\n Ext: 5442", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tAssistant Library Director", "\r\n Ext: 5433", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tYouth Services Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tTechnical Services Librarian", "\r\n Ext: 2558", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tOutreach Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tFoundation Executive Director", "\r\n Ext: 5456", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n \r\n ", "\r\n\t\t\tBranch Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
...
...
]
1条答案
按热度按时间c3frrgcw1#
如果你想从列表中删除这些行,你可以使用这个(而不是regex):
您可以通过使用正确的xpath selectors来获得想要的结果:
xpath选择器看起来像这样,因为正如您所说的:
一个网站有可怕的html格式
我相信你可以找到另一个xpath选择器来满足你的需要,但是这个并不可怕=)。