python-3.x 无法使用请求模块从网页生成结果

llmtgqce 于 2023-02-01 发布在 Python

关注(0)|答案(3)|浏览(156)

访问这个website之后，当我用Miami, FL填写输入框（City or zip）并点击搜索按钮时，我可以看到那个站点上显示的相关结果。
我希望使用requests模块模拟相同的操作。我尝试按照dev tools中显示的步骤操作，但由于某种原因，下面的脚本出现了以下输出：

You are not authorized to access this request.

我试过：

import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup

URL = "https://www.realtor.com/realestateagents/"
link = 'https://www.realtor.com/realestateagents/api/v3/search'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'referer': 'https://www.realtor.com/realestateagents/',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
    'X-Requested-With': 'XMLHttpRequest',
    'x-newrelic-id': 'VwEPVF5XGwQHXFNTBAcAUQ==',
    'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ1MjU0NDQsInN1YiI6ImZpbmRfYV9yZWFsdG9yIiwiaWF0IjoxNjY0NTI0Nzk2fQ.Q2jryTAD5vgsJ37e1SylBnkaeK7Cln930Q8KL4ANqsM'
}

params = {
    'nar_only': '1',
    'offset': '',
    'limit': '20',
    'marketing_area_cities': 'FL_Miami',
    'postal_code': '',
    'is_postal_search': 'true',
    'name': '',
    'types': 'agent',
    'sort': 'recent_activity_high',
    'far_opt_out': 'false',
    'client_id': 'FAR2.0',
    'recommendations_count_min': '',
    'agent_rating_min': '',
    'languages': '',
    'agent_type': '',
    'price_min': '',
    'price_max': '',
    'designations': '',
    'photo': 'true',
    'seoUserType': "{'isBot':'false','deviceType':'desktop'}",
    'is_county_search': 'false',
    'county': ''
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link,params=params)
    print(res.status_code)
    print(res.json())

编辑：
对于那些认为使用res.json()毫无意义的人，请看image，它直接取自dev工具，如果我在提交请求时能正确设置参数和头，我就能成功地使用res.json()。

python-3.x

来源：https://stackoverflow.com/questions/73906009/unable-to-produce-results-from-a-webpage-using-requests-module

3条答案

按热度按时间

63lcw9qa1#

问题是授权令牌在几秒钟后无效，因此您需要在每次请求时刷新（重新生成）它。
首先，您需要获取用于创建JWT令牌的JWT秘密（RegEx从HTML源代码中提取它）：

# Which is hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

然后使用密码生成新的授权令牌：

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999), # expiry date
  "sub": "find_a_realtor",
  "iat": int(time()) # issued at
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

将其添加到头中，然后运行请求，就像之前所做的那样：

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\\{%22isBot%22:false,%22deviceType%22:%22desktop%22\\}&is_county_search=false&county=',
    headers=headers
)

把这些都放在一起...

import requests
from jwt import encode
from time import time
from re import findall

# First we need to get their JWT Secret... which is securely hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999),
  "sub": "find_a_realtor",
  "iat": int(time())
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\\{%22isBot%22:false,%22deviceType%22:%22desktop%22\\}&is_county_search=false&county=',
    headers=headers
)

# Print the JSON output
print(response.json())

赞(0）回复(0）举报 2023-02-01

kq0g1dla2#

根据你的问题--如你所问--你希望通过请求从那个网站获取信息，下面是一种方法，用Python的Requests来实现：

import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 5)):
    url = f'https://www.realtor.com/realestateagents/miami_fl/pg-{x}'    
    r = s.get(url)
    soup = bs(r.text, 'html.parser')
    agent_cards = soup.select('div[data-testid="component-agentCard"]')
    for a in agent_cards:
        agent_name = a.select_one('div.agent-name').get_text()
        agent_group = a.select_one('div.agent-group').get_text()
        agent_phone = a.select_one('div.agent-phone').get_text()
        print(agent_name, '|', agent_group, '|', agent_phone)

最终结果：

100%
4/4 [00:05<00:00, 1.36s/it]
Edmy Gomez | Coldwell Banker Realty | (954) 434-0501
Nidia L Cortes PA | Beachfront Realty Inc | (786) 287-9268
Rodney Ward | Coldwell Banker Realty | (305) 253-2800
Onelia Hurtado | Elevate Real Estate Brokers | (954) 559-8252
Gustavo Cabrera | Belhouse Real Estate, Llc | (305) 794-8533
Hermes Pallaviccini |  Global Luxury Realty LLC | (305) 772-7232
Maria Carrillo | Keyes - Brickell Office | (305) 984-3180
Nancy Batchelor, P.A. | COMPASS | (305) 903-2850
Winnie Uricola | Keyes - Hollywood Office | (305) 915-7721
monica Deluca | Re/Max Powerpro Realty | (954) 552-1224
Maria Cristina Korman | Keller Williams Realty Partners SW | (954) 588-2850
Ines Hegedus-Garcia | Avanti Way | (305) 758-2323
Jean-Paul Figallo | Concierge Real Estate | (754) 281-9912
[...]

您可能需要将范围扩大到总页数。

赞(0）回复(0）举报 2023-02-01

4smxwvx53#

该错误表明您未被授权访问API，您可能需要检查您的令牌是否过期。
一般来说，使用requests.get并不是模仿用户操作（如填写表单和点击网站上的搜索按钮）的最佳方式。
尝试使用浏览器自动化工具，如selenium [1]。
但是如果你已经知道了网站的结构，就像你的例子一样，你可能不需要填写表单，你可以直接对那个页面做一个get请求，然后你就可以像另一个答案中所示的那样解析内容。
例如，在您的示例网站中，有一个迈阿密佛罗里达州的网页（https://www.realtor.com/realestateagents/miami_fl）。您可以直接通过请求获得此网站的内容。

选项1使用浏览器自动化

from selenium import webdriver
from selenium.webdriver.common.by import By

driver  = webdriver.Chrome()
driver.get('https://www.realtor.com/realestateagents/')
loc = driver.find_element(By.ID,'srchHomeLocation')
loc.send_keys("Miami, FL")
search_button = driver.find_element(By.ID,'far_search_button')
search_button.click()
r = driver.page_source
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

选项2使用请求

r = requests.get("https://www.realtor.com/realestateagents/miami_fl")
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

在这两种情况下，您都需要处理页面导航，要么在selenium中单击Next，要么对所有493个页面执行get请求。
最后，res.json（）不会将任何html转换为json，只有当结果是以JSON格式编写时，它才会返回结果的JSON对象。

https://www.selenium.dev/documentation/webdriver/

赞(0）回复(0）举报 2023-02-01

我来回答

python-3.x 无法使用请求模块从网页生成结果

3条答案

选项1使用浏览器自动化

选项2使用请求

相关问题

热门标签

最新问答