scrapy 如何使用Python抓取日语内容?[复制]

wvyml7n5  于 2023-11-19  发布在  Python
关注(0)|答案(1)|浏览(142)

此问题在此处已有答案

Python correct encoding of Website (Beautiful Soup)(4个答案)
8天前关闭。
我有一个代码在Python的请求,也尝试在Python Scrapy。
它返回正确的HTML,但HTML标签内的内容是奇怪的字符,如Á¶¼±°úÇбâ¼úÃÑ·Ã¸Í Áß¾ÓÀ§¿øȸ Á¶¼±ÀÚµ¿È­ÇÐȸ¿Í Á¶¼±ÀÚ¿¬¿¡³等。

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'If-Modified-Since': 'Mon, 22 May 2017 16:51:07 GMT',
    'If-None-Match': '"269-5501fad6c02b2-gzip"',
    'Referer': 'http://kcna.co.jp/',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}

response = requests.get('http://kcna.co.jp/item2/2001/200107/news07/01.htm#10', headers=headers, verify=False)

resp = HtmlResponse(url='',body=response.text, encoding='utf8')
print(resp.css('p::text').get().encode('utf8').decode('utf8'))

字符串

mepcadol

mepcadol1#

内容是韩语,而不是日语。该网站是由朝鲜新闻社东京分支运营的,我相信是朝鲜政府的新闻机构。
页面本身没有声明编码,因此使用HTML 4的默认编码:ISO-8859-1。内容使用EUC-KR编码,这是一种传统的韩语编码。
探索饲料:

import requests
from encodings import euc_kr

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'If-Modified-Since': 'Mon, 22 May 2017 16:51:07 GMT',
    'If-None-Match': '"269-5501fad6c02b2-gzip"',
    'Referer': 'http://kcna.co.jp/',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}

response = requests.get('http://kcna.co.jp/item2/2001/200107/news07/01.htm#10', headers=headers, verify=False)

# Identify current encoding
response.encoding
# 'ISO-8859-1'

# rest encoding
response.encoding = 'euc-kr'

# Convert to bytes then back to string
rt = response.text.encode('euc-kr').decode('euc-kr')

# Alternatively using encodings module:
# Convert to bytes: response.text.encode('euc-kr')
# and convert encoding to UTF-8
rt = euc_kr.codec.decode(response.text.encode('euc-kr'))[0]
rt[0:300]
# '<html>\r\n<head><title>past news</title></head>\r\n<body bgcolor="#eeeeee">\r\n
# <small>\r\n\r\n<a name="1"><h4>김영남위원장\u3000수단대통령에게\u3000축전</a></h4>\r\n\u3000
# (평양\u30007월\u30001일발\u3000조선중앙통신)조선민주주의인민공화국\u3000최고인민회의\u3000상임위원회\u3000김영남위원장이\u3000수단혁명절\u300012돐에\u3000즈음하여\u3000수단공화국\u3000대통령\u3000오마르\u3000하싼\u3000아흐마드\u3000알\u3000바쉬르에게\u3000축전을\u3000보내였다.<br>\r\n\u30006월\u300029일부로\u3000된\u3000축전에는\u3000다음과\u3000같이\u3000지적되여\u3000있다.<br>\r\n\u3000나는\u3000수단혁명절\u300012돐에\u3000즈음하여\u3000당'

字符串
可以看出,数据以EUC-KR编码。
要与您的代码一起使用:

import requests
from scrapy.http import HtmlResponse

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'If-Modified-Since': 'Mon, 22 May 2017 16:51:07 GMT',
    'If-None-Match': '"269-5501fad6c02b2-gzip"',
    'Referer': 'http://kcna.co.jp/',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}

response = requests.get('http://kcna.co.jp/item2/2001/200107/news07/01.htm#10', headers=headers, verify=False)

response.encoding = 'euc-kr'
resp = HtmlResponse(url='',body=response.text, encoding='euc-kr')


resp在这个阶段应该是一个Unicode字符串。

相关问题