当我在谷歌或Stackoverflow上查找我的问题时,似乎有半打这样的案例得到了解决,然而我似乎从来没有真正理解过解决方案。
所以我想用Jupyter Lab从服务器上刮一个.csv,用Anaconda启动。这个文件确实存在,我可以点击几下下载。
现在我尝试执行以下查询:
import pandas as pd
pd.read_csv("link")
它会产生以下错误:
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-37-aae59f2238c3> in <module>
----> 1 pd.read_csv("https://first-python-notebook.readthedocs.io/_static/committees.csv")
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
429 # See https://github.com/python/mypy/issues/1297
430 fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 431 filepath_or_buffer, encoding, compression
432 )
433 kwds["compression"] = compression
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
170
171 if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
--> 172 req = urlopen(filepath_or_buffer)
173 content_encoding = req.headers.get("Content-Encoding", None)
174 if content_encoding == "gzip":
/Applications/anaconda3/lib/python3.7/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
139 import urllib.request
140
--> 141 return urllib.request.urlopen(*args, **kwargs)
142
143
/Applications/anaconda3/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
/Applications/anaconda3/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
--> 531 response = meth(req, response)
532
533 return response
/Applications/anaconda3/lib/python3.7/urllib/request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
--> 641 'http', request, response, code, msg, hdrs)
642
643 return response
/Applications/anaconda3/lib/python3.7/urllib/request.py in error(self, proto, *args)
567 if http_err:
568 args = (dict, 'default', 'http_error_default') + orig_args
--> 569 return self._call_chain(*args)
570
571 # XXX probably also want an abstract factory that knows when it makes
/Applications/anaconda3/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result
/Applications/anaconda3/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
但是,当我尝试这样做时,效果很好:
f = requests.get(link)
print(f.text)
从阅读其他资源,在我看来,问题可能是我的user-agent定义不正确,这使得目标服务器拒绝我的请求。解决方案是添加一个正确或虚假的'header',其中包括我的user_agent:https://www.whatismybrowser.com/detect/what-is-my-user-agent
所以我试了这个:
import http.cookiejar
from urllib.request import urlopen
site= "link"
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
req = urllib2.Request(site, headers=hdr)
content = page.read()
print(content)
但首先,它会返回
NameError: name 'urllib2' is not defined
我找不到有效的解决方案
当然,我的主要问题也没有解决。
我真的不明白我的头应该被设置.你需要执行这样的每一个文件从网络重新?没有一个更一般的解决方案?或者这甚至是我的实际问题?
2条答案
按热度按时间6kkfgxo01#
由于
1.2
的pandas
,因此可以通过将选项作为字典键添加到read_csv
的storage_options
参数来调优所使用的读取器。这个库会在请求中包含
User-Agent
头,所以你不必在外部和调用read_csv
之前设置它。ut6juiuv2#
这个脚本应该可以与Python 2/Python3一起工作(Python3中的
urllib2
有一个变化):