webapi“vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/index.phtml“,提供get
方法调用的查询,每页40行分页,我写了一个函数调用webapi,打印网页中的所有行:
def get_rows(page):
import urllib.request
import lxml.html
url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
"index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
table_xpath = '//*[@id="dataTable"]'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers)
data_string=urllib.request.urlopen(req).read()
root=lxml.html.fromstring(data_string)
dtable = root.xpath(table_xpath)[0]
rows = dtable.xpath('.//tr')
print(len(rows))
现在叫它:
get_rows(page=1)
41
get_rows(page=2)
41
get_rows(page=3)
26
get_rows(page=4)
41
为什么我的函数只能得到第3页的部分行(26),而网页包含40行(41=1个标题+40行数据)?我发现许多页面都遇到了同样的问题,wbeoage包含40行数据,get_rows()打印的数字小于40。请尝试使用我的函数:
[get_rows(page) for page in [3,38,73,81,118,123]]
2条答案
按热度按时间von4xj4u1#
目标网页中的编码是
gb2312
,但是如果你使用它,会出现无效编码错误,尝试了很多次,最后设置gbk
工作正常!ef1yzkbh2#
问题似乎是对应于
Content-Type
的HTMLmeta
标记将字符集标识为GB 2312,如下所示:而作为响应的一部分返回的
Content-Type
报头将字符集标识为GBK,如下所示:As GBK is a superset of GB 2312, much of the content in the pages will be encoded identically, and so can be decoded using either character set. For the third page, however, the name of the stock corresponding to code 688279 (峰岹科技) cannot be encoded using GB 2312, and so attempting to decode it using GB 2312 will fail. The exact symptom of this failure is odd in that parsing will halt at this point (hence the short number of matched elements), but the document returned (
root
) can still be worked with. This is very likely the same for the other pages you've discovered with the same problem.更具体地说,在代码中调用:
然后只对这个字节序列进行操作。因此,当将其传递给
lxml
进行解析时:它只有
meta
标签要参考以确定编码。最好的方法似乎是这里列出的方法,它使用
Content-Type
头文件中声明的字符集编码显式解码读取的字节。变成了这样: