javascript 构造GET url从JS按钮点击函数中获取JSON

ycggw6v2  于 2023-01-08  发布在  Java
关注(0)|答案(1)|浏览(131)

我尝试在Python中使用requests和BeautifulSoup来抓取这个页面,但是页面是Javascript,所以我包含了问题的两个标签。
https://untappd.com/v/southern-cross-kitchen/469603
和其他人喜欢它,但它有一个“显示更多”按钮。我想避免使用无头浏览器,所以我去窥探背后的JavaScript,看看我是否可以找到一个URL,获取或发布请求。
经过一些检查,下面是按钮的代码:

<a class="yellow button more show-more-section track-click" data-href=":moremenu" data-menu-id="78074484" data-section-id="286735920" data-track="venue" data-venue-id="469603" href="javascript:void(0);">

并通过此函数进行控制和重定向:

$(document).on("click", ".show-more-menu-section", (function() {
        var e = $(this);
        $(e).hide();
        var t = $(this).attr("data-venue-id"),
            a = $(this).attr("data-menu-id"),
            n = $(".section-area .menu-section").length;
        return $(".section-loading").addClass("active"), $.ajax({
            url: "/venue/more_menu_section/" + t + "/" + n,
            type: "GET",
            data: "menu_id=" + a,
            dataType: "json",
            error: function(t) {
                $(".section-loading").removeClass("active"), $(e).show(), $.notifyBar({
                    html: "Hmm. Something went wrong. Please try again!",
                    delay: 2e3,
                    animationSpeed: "normal"
                })
            },
            success: function(t) {
                $(".section-loading").removeClass("active"), "" == t.view ? $(e).hide() : (trackMenuView("viewMoreMenuSection"), t.count >= 15 ? ($(e).show(), $(".section-area").append(t.view)) : $(".section-area").append(t.view), handleNew())
            }
        })

包含在https://assets.untappd.com/assets/v3/js/venue/venue.menu.min.js?v=2.7.10
因此,对于函数中所需的值,tan为:

t = 469603
n = 78074484
a = 1

我现在尝试使用函数的url部分构造url,该函数为:

url: "/venue/more_menu_section/" + t + "/" + n

使用https://www.untappd.com作为我的基本URL,我尝试了以下URL,但没有成功:
/venue/more_menu_section/469603/1?data=%7B%22menu_id%22%3A%2278074484%22%7D
/venue/more_menu_section/469603/1?data%3D%7B%22menu_id%22%3A78074484%7D
/venue/more_menu_section/469603/1?data=%7B%22menu_id%22%3A78074484%7D
/venue/more_menu_section/469603/1?data={"menu_id":78074484}
因此,我无法通过编程来获取数据。我真的希望避免使用webdrivers和headless浏览器来模拟点击,所以我猜这应该可以通过GET请求来实现。创建该url被证明是一个挑战。
我怎样才能创建正确的网址获取?

kxkpmulp

kxkpmulp1#

因此,对于函数中所需的值,tan为:

t = 469603 
n = 78074484 
a = 1

我想您混淆了na的值,但是如何获得n呢?在JS代码中,它计算可以使用.section-area .menu-section进行css选择的元素的数量[基本上,显示的菜单部分的数量] -它真的只有1吗?
而且你不需要在你的url中添加data=..-键值对 insidedata应该在?key1=value1&key2=value2... [URL encoded ofc]这样的查询中用和号连接,所以在本例中只需要?menu_id={a}
但是,获得正确的URL是不够的-您设置了正确的头吗?如果我查看网络日志,您需要设置'accept': 'application/json''x-requested-with': 'XMLHttpRequest'以获得JSON响应[$.value中包含html],并且您应该设置一些user agent [如'user-agent': 'Mozilla/5.0']以不被阻止。
而且,我在网络日志中注意到的URL(具体来说是Southern Cross Kitchen page)似乎更像是**/venue/more_menu/{venue_id}/{n}?section_id={section_id}**格式-我得到了一个空的$.value,两者都有

  • 从JS+HTML代码段(* /venue/more_menu_section/469603/1?menu_id=78074484 *)形成的链接,以及
  • 包含html代码段中的sectionId的/more_menu/链接(* /venue/more_menu_section/469603/1?section_id=286735920 *),但是

我在jData['value']中得到了一个html string with 3 more items,其中

headers = {
  'accept': 'application/json', 'user-agent': 'Mozilla/5.0', 'x-requested-with': 'XMLHttpRequest'
}
mmUrl = 'https://untappd.com/venue/more_menu/469603/15?section_id=287454773'
response = requests.get(mmUrl, headers=headers)
jData = response.json()

mmUrl刚刚从网络日志中复制-可能是sectionId更改了?)

完整示例

我把它 Package 在一个函数(scrape_untappd_menu)中,因为我发现它更方便,特别是在错误处理方面。

(如果您尝试运行此命令,请不要忘记包含selectForList。)

# def extract_..., selectForList ## PASTE FROM https://pastebin.com/ZnZ7xM6u

## just for returning and printing a message with one statement ##
def vRet(toPrint, toReturn=[]):
    print(toPrint)
    return toReturn

def scrape_untappd_menu(umLink, includeVenue=False):
    rootUrl,addSlash = 'https://untappd.com','' if umLink[:1]=='/' else '/'
    if not umLink.startswith(rootUrl):umLink=f'{rootUrl}{addSlash}{umLink}'
    selRef = {
      'name': 'div.beer-details>h5>a',
      'description': 'div.beer-details>h5>em',
      'rating': ('div[data-rating]', 'data-rating'),
      'label': 'div.beer-label',
      'label_updated':('div.beer-label>span[data-update-at]','data-update-at'),
      'about': 'div.beer-details>h6>span',
      'link': ('a[data-href=":beer"][href]', 'href'),
      'brewery': 'a[data-href=":brewery"]',
      'brewery_link': ('a[data-href=":brewery"][href]', 'href'),
      'log_src': ('a[data-href=":beer"]>img[src]', 'src')
    }

    ## fetch and parse page html ##
    pgResp = requests.get(umLink, headers={'user-agent': 'Mozilla/5.0'})
    try: pgResp.raise_for_status() 
    except Exception as e: return vRet(f'failed2scrape:{type(e)} - {e}')
    pgSoup = BeautifulSoup(pgResp.content)

    ## get some venue name and id ##
    vName, vn_h2, venuId = selectForList(pgSoup, [
        'div.venue-name>h1', 'div.venue-name>h2', 
        ('*[data-venue-id]', 'data-venue-id')])
    if not vName: vName =  umLink.split('/v/', 1)[-1].replace('-', ' ')
    if vn_h2: vName += f' [{vn_h2}]'
    if not venuId: print(f"could not find '*[data-venue-id]'") 
    
    ## get menu items from page ##
    mSel = 'ul.menu-section-list>li>div.beer-info'    
    mList = [selectForList(li, selRef) for li in pgSoup.select(mSel)]

    ## find moremenu button ##
    fetchMore, lmmCt = True, 1
    mmBtn_sel = [
        'a[data-href=":moremenu"][data-section-id]', 
        'a[data-href=":moremenusection"][data-menu-id]'
    ]
    mmBtn, msBtn = pgSoup.select_one(mmBtn_sel[0]), False
    if not mmBtn: 
        mmBtn, msBtn = pgSoup.select_one(mmBtn_sel[1]), True
        if not mmBtn: 
            fetchMore = False
            print(f"could not find '{', '.join(mmBtn_sel)}'")
    
    ## load more ##
    mSel = 'li>div.beer-info'   
    sectCt = len(pgSoup.select('.section-area .menu-section')) 
    sectId = mmBtn.get('data-section-id') if mmBtn else None
    menuId = mmBtn.get('data-menu-id') if mmBtn else None
    while fetchMore:
        lmmUrl = f'/venue/more_menu/{venuId}/{len(mList)}?section_id={sectId}'
        if msBtn: 
            lmmUrl = f'/venue/more_menu_section/{venuId}/{sectCt}'
            lmmUrl += f'?menu_id={menuId}'
        print(f'[{lmmCt}] loading more from {rootUrl+lmmUrl}', end='')
        lmReq = requests.get(f'{rootUrl}{lmmUrl}', headers={
            'accept': 'application/json', 
            'user-agent': 'Mozilla/5.0', 
            'x-requested-with': 'XMLHttpRequest'})
        try: 
            lmReq.raise_for_status() 
            jData = lmReq.json()
            fetchMore = jData['count']
            lmSoup = BeautifulSoup(jData['view'])
            if lmmUrl: sectCt += fetchMore
            print(f'\r[{lmmCt}] loaded {fetchMore} more from {rootUrl+lmmUrl}')
        except Exception as e: 
            return vRet(f'\n{type(e)} - {e}', mList)
        
        ## get more menu items from html string instide json response ##
        mList += [selectForList(li, selRef) for li in lmSoup.select(mSel)]
        lmmCt += 1
    
    ## some cleanup [and maybe add venue name,id,link] ##
    for mi, m in enumerate(mList):
        m['about'] = m['about'].replace(m['brewery'], '').strip(' \u2022')
        for k in ['link', 'brewery_link', 'log_src']:
            if str(m[k])[:1] == '/': mList[mi][k] = f'{rootUrl}{m[k]}'
        
        mDets = {'venueId': venuId, 'venue': vName} if includeVenue else {}
        for k, v in mList[mi].items(): 
            if k=='about' and v: v = v.strip('\u2022').strip().strip('\u2022')
            mDets[k] = v.strip() if isinstance(v, str) else v
        if includeVenue: mDets['venue_link'] = umLink
        mList[mi] = mDets
    
    return vRet(f'{len(mList)} menu items from {umLink}', mList)

你可以称之为一个单一的地点

menuList = scrape_untappd_menu('/v/southern-cross-kitchen/469603')

或者,如果您想从多个地点抓取菜单:

menuList = []
for vl in venueLinks: 
    menuList += scrape_untappd_menu(vl, includeVenue=True)
    print()

您还可以将结果保存为CSV(使用下面的pandas):

pandas.DataFrame(menuList).to_csv('untappd.csv', index=False)

相关问题