python-3.x 为什么网页一定包含40行数据时只能得到部分行?

4urapxun  于 2023-03-09  发布在  Python
关注(0)|答案(2)|浏览(110)

webapi“vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/index.phtml“,提供get方法调用的查询,每页40行分页,我写了一个函数调用webapi,打印网页中的所有行:

def get_rows(page):
    import urllib.request
    import lxml.html
    url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
          "index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
    table_xpath = '//*[@id="dataTable"]'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    req = urllib.request.Request(url=url, headers=headers)
    data_string=urllib.request.urlopen(req).read()
    root=lxml.html.fromstring(data_string)
    dtable = root.xpath(table_xpath)[0]
    rows = dtable.xpath('.//tr')
    print(len(rows))

现在叫它:

get_rows(page=1)
41
get_rows(page=2)
41
get_rows(page=3)
26
get_rows(page=4)
41

为什么我的函数只能得到第3页的部分行(26),而网页包含40行(41=1个标题+40行数据)?我发现许多页面都遇到了同样的问题,wbeoage包含40行数据,get_rows()打印的数字小于40。请尝试使用我的函数:

[get_rows(page) for page in [3,38,73,81,118,123]]
von4xj4u

von4xj4u1#

目标网页中的编码是gb2312,但是如果你使用它,会出现无效编码错误,尝试了很多次,最后设置gbk工作正常!

def get_rows(page):
    import urllib.request
    import lxml.html
    url = "http://vip.stock.finance.sina.com.cn/q/go.php/vFinanceAnalyze/kind/profit/"\
          "index.phtml?s_i=&s_a=&s_c=&reportdate=2021&quarter=4&p={}".format(page)
    table_xpath = '//*[@id="dataTable"]'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    req = urllib.request.Request(url=url, headers=headers)
    data_string=urllib.request.urlopen(req).read().decode('gbk')
    root=lxml.html.fromstring(data_string)
    dtable = root.xpath(table_xpath)[0]
    rows = dtable.xpath('.//tr')
    print(len(rows))
ef1yzkbh

ef1yzkbh2#

问题似乎是对应于Content-Type的HTML meta标记将字符集标识为GB 2312,如下所示:

<meta http-equiv="Content-type" content="text/html; charset=GB2312" />

而作为响应的一部分返回的Content-Type报头将字符集标识为GBK,如下所示:

Content-Type: text/html; charset=gbk

As GBK is a superset of GB 2312, much of the content in the pages will be encoded identically, and so can be decoded using either character set. For the third page, however, the name of the stock corresponding to code 688279 (峰岹科技) cannot be encoded using GB 2312, and so attempting to decode it using GB 2312 will fail. The exact symptom of this failure is odd in that parsing will halt at this point (hence the short number of matched elements), but the document returned ( root ) can still be worked with. This is very likely the same for the other pages you've discovered with the same problem.
更具体地说,在代码中调用:

urllib.request.urlopen(req).read()

然后只对这个字节序列进行操作。因此,当将其传递给lxml进行解析时:

lxml.html.fromstring(data_string)

它只有meta标签要参考以确定编码。
最好的方法似乎是这里列出的方法,它使用Content-Type头文件中声明的字符集编码显式解码读取的字节。

data_string=urllib.request.urlopen(req).read()

变成了这样:

resource=urllib.request.urlopen(req)
data_string=resource.read().decode(resource.headers.get_content_charset())

相关问题