我试图运行下面的python脚本来从google scholar中提取数据。然而,当我运行代码时,我得到了一个空列表作为json响应。注意,所有必要的库都已安装。
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
params = {
'q': 'Machine learning',
'hl': 'en'
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# JSON data will be collected here
data = []
# Container where all needed data is located
for result in soup.select('.gs_r.gs_or.gs_scl'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
publication_info = result.select_one('.gs_a').text
snippet = result.select_one('.gs_rs').text
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
related_articles = result.select_one('a:nth-child(4)')['href']
try:
all_article_versions = result.select_one('a~ a+ .gs_nph')['href']
except:
all_article_versions = None
try:
pdf_link = result.select_one('.gs_or_ggsm a:nth-child(1)')['href']
except:
pdf_link = None
data.append({
'title': title,
'title_link': title_link,
'publication_info': publication_info,
'snippet': snippet,
'cited_by': f'https://scholar.google.com{cited_by}',
'related_articles': f'https://scholar.google.com{related_articles}',
'all_article_versions': f'https://scholar.google.com{all_article_versions}',
"pdf_link": pdf_link
})
print(json.dumps(data, indent = 2, ensure_ascii = False))
输出:[]
2条答案
按热度按时间1u4esq0p1#
你的代码运行的很好,但是问题是把抓取的数据正确的保存在json格式中。所以你可以使用超级强大和简单的工具,PandasDataFrasme来把数据保存在json格式中
输出:
mum43rcc2#
这将允许您将结果写入JSON文件。