我尝试从脚本标记中提取JSON数据,并从中提取数据。
我的准则。
import requests, json
from bs4 import BeautifulSoup
head = {
"Accept": 'application/json, text/plain, */*',
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8,km;q=0.7",
"Connection": "keep-alive",
"Host": "www.ixigua.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
url = "https://www.ixigua.com/home/58484635562"
ree = requests.get(url, headers=head)
soup = BeautifulSoup(ree.content, 'html.parser')
script = soup.find_all('script')[-2].text
print(script)
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(script, f, ensure_ascii = False)
结果如下
"window._SSR_HYDRATED_DATA={\"recommendFeed\":null,\"attentionFeed\":null,\"nbaFeed\":null,\"livingFeed\":null,\"channelFeed\":[],\"homeFeed\":null,\"adBanner\":[],\"channelInfo\":null,\"ChannelFeedList\":[],\"UserDetail\":{\"enableTabs\":[],\"hotPersonList\":[],\"userInfo\":{\"name\":\"-\",\"description\":\"-\",\"avatar\":\"\",\"followersCount\":0,\"followingCount\":0,\"user_id\":\"\",\"follow\":false},\"videoData\":{\"videoList\":[],\"loading\":true},\"hotsoonData\":{\"hotsoonList\":[]},\"preview_series\":[],\"seriesData\":{\"series_list\":[],\"hasMore\":false,\"nextCursor\":\"0\"}},\"FooterLinks\":[],\"LvideoChannel\":[],\"LvideoChannelOnTcc\":[],\"LvideoCategory\":[],\"AlbumInCategory\":[],\"ChannelFeedV2\":[],\"ChannelLevelOneConfig\":[],\"ChannelLevelTwoConfig\":[],\"HighQualityFeed\":[],\"ChannelBannerConfig\":[],\"Teleplay\":null,\"Projection\":{\"video\":{},\"series\":{},\"pSeries\":{},\"playlist\":{\"item_num\":0},\"shouldReturn404\":false,\"item_id\":\"\",\"key\":undefined},\"CinemaChannelFeed\":[],\"CinemaFeedRebojiemu\":[],\"CinemaFeedFromRedis\":[],\"MyWatchHistory\":[{\"type\":\"all\",\"videoFeed\":[],\"hasMore\":true},{\"type\":\"svideo\",\"videoFeed\":[],\"hasMore\":true},{\"type\":\"lvideo\",\"videoFeed\":[],\"hasMore\":true}],\"MyFavorite\":[{\"type\":\"all\",\"videoFeed\":[],\"hasMore\":true},{\"type\":\"svideo\",\"videoFeed\":[],\"hasMore\":true},{\"type\":\"lvideo\",\"videoFeed\":[],\"hasMore\":true}],\"AuthorDetailInfo\":{\"user_id\":\"58484635562\",\"media_id\":\"1562629337991170\",\"name\":\"鼎力推鉴王鼎杰工作室\",\"introduce\":\"小细节里的大战略,大格局里的小动作。\",\"avatar\":\"https:\\u002F\\u002Fsf3-cdn-tos.bdxiguastatic....
但每当我尝试打印[“AuthorDetailInfo”]时,我都收到错误。
print(script["AuthorDetailInfo"])
错误结果
print(script["AuthorDetailInfo"])
TypeError: string indices must be integers
我怎么能打印这个?我怎么能从JSON中删除所有的反斜杠?
编码
print(script["AuthorDetailInfo"])
预期结果
{
"user_id":"58484635562",
"media_id":"1562629337991170",
"name":"鼎力推鉴王鼎杰工作室",
"introduce":"小细节里的大战略"...
}
1条答案
按热度按时间bq3bfh9z1#
script
是JavaScript代码,而不是JSON。请注意{
前面的window._SSR_HYDRATED_DATA=
。后面的所有内容都可以被视为JSON(尽管从技术上讲它不是JSON)。您必须首先处理变量赋值。一种方法是使用split()
:现在可以使用
json.loads()
来解析它:最后你就能得到你想要的部分:
注意:
maxsplit=1
只是为了防止字符串中有其他的=
字符,而且,只有在赋值中的JavaScript对象是有效的JSON时才起作用。