python-3.x 无法使用请求模块从网页生成结果

llmtgqce  于 2023-02-01  发布在  Python
关注(0)|答案(3)|浏览(155)

访问这个website之后,当我用Miami, FL填写输入框(City or zip)并点击搜索按钮时,我可以看到那个站点上显示的相关结果。
我希望使用requests模块模拟相同的操作。我尝试按照dev tools中显示的步骤操作,但由于某种原因,下面的脚本出现了以下输出:

You are not authorized to access this request.

我试过:

import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup

URL = "https://www.realtor.com/realestateagents/"
link = 'https://www.realtor.com/realestateagents/api/v3/search'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'referer': 'https://www.realtor.com/realestateagents/',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
    'X-Requested-With': 'XMLHttpRequest',
    'x-newrelic-id': 'VwEPVF5XGwQHXFNTBAcAUQ==',
    'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ1MjU0NDQsInN1YiI6ImZpbmRfYV9yZWFsdG9yIiwiaWF0IjoxNjY0NTI0Nzk2fQ.Q2jryTAD5vgsJ37e1SylBnkaeK7Cln930Q8KL4ANqsM'
}

params = {
    'nar_only': '1',
    'offset': '',
    'limit': '20',
    'marketing_area_cities': 'FL_Miami',
    'postal_code': '',
    'is_postal_search': 'true',
    'name': '',
    'types': 'agent',
    'sort': 'recent_activity_high',
    'far_opt_out': 'false',
    'client_id': 'FAR2.0',
    'recommendations_count_min': '',
    'agent_rating_min': '',
    'languages': '',
    'agent_type': '',
    'price_min': '',
    'price_max': '',
    'designations': '',
    'photo': 'true',
    'seoUserType': "{'isBot':'false','deviceType':'desktop'}",
    'is_county_search': 'false',
    'county': ''
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link,params=params)
    print(res.status_code)
    print(res.json())

编辑:
对于那些认为使用res.json()毫无意义的人,请看image,它直接取自dev工具,如果我在提交请求时能正确设置参数和头,我就能成功地使用res.json()

63lcw9qa

63lcw9qa1#

问题是授权令牌在几秒钟后无效,因此您需要在每次请求时刷新(重新生成)它。
首先,您需要获取用于创建JWT令牌的JWT秘密(RegEx从HTML源代码中提取它):

# Which is hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

然后使用密码生成新的授权令牌:

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999), # expiry date
  "sub": "find_a_realtor",
  "iat": int(time()) # issued at
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

将其添加到头中,然后运行请求,就像之前所做的那样:

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\\{%22isBot%22:false,%22deviceType%22:%22desktop%22\\}&is_county_search=false&county=',
    headers=headers
)

把这些都放在一起...

import requests
from jwt import encode
from time import time
from re import findall

# First we need to get their JWT Secret... which is securely hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999),
  "sub": "find_a_realtor",
  "iat": int(time())
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\\{%22isBot%22:false,%22deviceType%22:%22desktop%22\\}&is_county_search=false&county=',
    headers=headers
)

# Print the JSON output
print(response.json())
kq0g1dla

kq0g1dla2#

根据你的问题--如你所问--你希望通过请求从那个网站获取信息,下面是一种方法,用Python的Requests来实现:

import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 5)):
    url = f'https://www.realtor.com/realestateagents/miami_fl/pg-{x}'    
    r = s.get(url)
    soup = bs(r.text, 'html.parser')
    agent_cards = soup.select('div[data-testid="component-agentCard"]')
    for a in agent_cards:
        agent_name = a.select_one('div.agent-name').get_text()
        agent_group = a.select_one('div.agent-group').get_text()
        agent_phone = a.select_one('div.agent-phone').get_text()
        print(agent_name, '|', agent_group, '|', agent_phone)

最终结果:

100%
4/4 [00:05<00:00, 1.36s/it]
Edmy Gomez | Coldwell Banker Realty | (954) 434-0501
Nidia L Cortes PA | Beachfront Realty Inc | (786) 287-9268
Rodney Ward | Coldwell Banker Realty | (305) 253-2800
Onelia Hurtado | Elevate Real Estate Brokers | (954) 559-8252
Gustavo Cabrera | Belhouse Real Estate, Llc | (305) 794-8533
Hermes Pallaviccini |  Global Luxury Realty LLC | (305) 772-7232
Maria Carrillo | Keyes - Brickell Office | (305) 984-3180
Nancy Batchelor, P.A. | COMPASS | (305) 903-2850
Winnie Uricola | Keyes - Hollywood Office | (305) 915-7721
monica Deluca | Re/Max Powerpro Realty | (954) 552-1224
Maria Cristina Korman | Keller Williams Realty Partners SW | (954) 588-2850
Ines Hegedus-Garcia | Avanti Way | (305) 758-2323
Jean-Paul Figallo | Concierge Real Estate | (754) 281-9912
[...]

您可能需要将范围扩大到总页数。

4smxwvx5

4smxwvx53#

该错误表明您未被授权访问API,您可能需要检查您的令牌是否过期。
一般来说,使用requests.get并不是模仿用户操作(如填写表单和点击网站上的搜索按钮)的最佳方式。
尝试使用浏览器自动化工具,如selenium [1]。
但是如果你已经知道了网站的结构,就像你的例子一样,你可能不需要填写表单,你可以直接对那个页面做一个get请求,然后你就可以像另一个答案中所示的那样解析内容。
例如,在您的示例网站中,有一个迈阿密佛罗里达州的网页(https://www.realtor.com/realestateagents/miami_fl)。您可以直接通过请求获得此网站的内容。

选项1使用浏览器自动化

from selenium import webdriver
from selenium.webdriver.common.by import By

driver  = webdriver.Chrome()
driver.get('https://www.realtor.com/realestateagents/')
loc = driver.find_element(By.ID,'srchHomeLocation')
loc.send_keys("Miami, FL")
search_button = driver.find_element(By.ID,'far_search_button')
search_button.click()
r = driver.page_source
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

选项2使用请求

r = requests.get("https://www.realtor.com/realestateagents/miami_fl")
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

在这两种情况下,您都需要处理页面导航,要么在selenium中单击Next,要么对所有493个页面执行get请求。
最后,res.json()不会将任何html转换为json,只有当结果是以JSON格式编写时,它才会返回结果的JSON对象。

  1. https://www.selenium.dev/documentation/webdriver/

相关问题