html 使用BeautifulSoup从网页中提取单独部分下的内容

c9x0cxw0  于 2022-12-09  发布在  其他
关注(0)|答案(1)|浏览(192)

Below picture shows the html structure of the contents I want to pull from a website, and the code is also listed below the picture. I can't seem to find a solution to extract different information under the master class, allow me to explain it further.

Example 1:

<li id="361821" class="list-group-item te-stream-item"><div class="te-stream-title-div"><a class="te-stream-title" href="/austria/balance-of-trade"><b>Austria Trade Deficit Widens in September</b></a><div class="pull-right"><a class="label small bg-primary te-stream-country" href="/stream?c=austria">Austria</a>&nbsp;<a class="label small te-stream-category" href="/stream?i=balance+of trade">Balance of Trade</a></div></div>The trade gap in Austria rose to EUR 1.38 billion in September of 2022 from EUR 1.02 billion a year ago. Imports surged 20.5% year-on-year to EUR 19.03 billion, mainly boosted by purchases of machinery &amp; vehicles (18.6%) while exports increased at a softer 19.6% to EUR 17.65 billion, on higher shipments of machinery &amp; vehicles (18.4%) and processed goods (17.9%). In the first nine months of the year, the country's trade deficit widened to EUR 13.84 billion from EUR 8.61 billion in the same period the year before. "However, the increase in international trade values is largely due to the rise in import and export prices, while the volumes were often declining. The value of gas imports, for example, increased by a whopping 132.7% in the period from January to September compared to the first three quarters of the previous year, although the volume of imports declined by more than 41.4 % in the same period,” said Statistics Austria Director General Tobias Thomas.<br><small>10 minutes ago</small></li>

Example 2:

<li id="361846" class="list-group-item te-stream-item te-stream-item-2"><div class="te-stream-title-div"><a class="te-stream-title-2" href="/commodity/crude-oil"><b>Crude Oil Hits Lowest Since December 2021</b></a><div class="pull-right"><a class="label small bg-primary te-stream-country" href="/stream?c=commodity">Commodity</a></div></div>WTI crude futures were trading around the $73 per barrel mark, the lowest since December 2021, as sentiment remained clouded by worries about weak global demand. Advanced economies, especially the US and Europe, are witnessing a drop in manufacturing activity due to tightening financial conditions. At the same time, sluggish Chinese customs data compounded fears about the global economy's health. Still looking for the demand side but offering some respite to investors, China has been dialing back coronavirus-related restrictions following widespread protests. On the supply front, OPEC+ decided to stick to their existing policy of reducing oil output by 2 million barrels a day from November through 2023. Investors were also assessing the impact of the latest sanctions on Russia, including a price cap and a European Union embargo on seaborne imports of Russian oil.<br><small>7 minutes ago</small></li>

The contents I want to pull is:

  1. News Headlines (this is under a class='te-stream-title')
  2. News Categories (this is under div class='pull-right', but get_text() won't get both categories as there could be multiple categories under 1 news)
  3. News Content (this is under the master class div class='list-group-item te-stream-item') but if I find_all then get_text() it not only returns the content, but also returns 10 minutes ago which is after the br , I would like contents as well as that 10 minutes ago to be pulled separately.
    The master class I believe is the list-group-item te-stream-item , and if I use soup.find_all('li', class_='list-group-item te-stream-item') , it captures all of the sub-classes including news headlines, news contents, and news categories. My question is how to get to the next step to extract those information separately? So that it can later turn into a dataframe with rows as different news, with 4 columns (news headlines; news contents; news category; updated time)
1sbrub3j

1sbrub3j1#

如果模式总是相同的,则可以使用stripped_strings来提取字符串:

data = [dict(zip(['title','country','category','content','time'],e.stripped_strings)) for e in soup.select('li.list-group-item')]

pd.DataFrame(data)
输出

| | 标题|乡村|范畴|内容物|计时器|
| - -|- -|- -|- -|- -|- -|
| 第0页|奥地利9月贸易逆差扩大|奥地利Name|贸易差额|奥地利的贸易逆差从一年前的10. 2亿欧元上升到2022年9月的13. 8亿欧元。|10分钟前|
如果图案不相同,您必须选择更具体的元素:
示例

from bs4 import BeautifulSoup
import pandas as pd
html='''
<li id="361821" class="list-group-item te-stream-item">
    <div class="te-stream-title-div">
        <a class="te-stream-title" href="/austria/balance-of-trade"><b>Austria Trade Deficit Widens in September</b></a>
        <div class="pull-right">
            <a class="label small bg-primary te-stream-country" href="/stream?c=austria">Austria</a>&nbsp;<a class="label small te-stream-category" href="/stream?i=balance+of trade">Balance of Trade</a>
        </div>
    </div>The trade gap in Austria rose to EUR 1.38 billion in September of 2022 from EUR 1.02 billion a year ago. Imports surged 20.5% year-on-year to EUR 19.03 billion, mainly boosted by purchases of machinery &amp; vehicles (18.6%) while exports increased at a softer 19.6% to EUR 17.65 billion, on higher shipments of machinery &amp; vehicles (18.4%) and processed goods (17.9%). In the first nine months of the year, the country's trade deficit widened to EUR 13.84 billion from EUR 8.61 billion in the same period the year before. "However, the increase in international trade values is largely due to the rise in import and export prices, while the volumes were often declining. The value of gas imports, for example, increased by a whopping 132.7% in the period from January to September compared to the first three quarters of the previous year, although the volume of imports declined by more than 41.4 % in the same period,” said Statistics Austria Director General Tobias Thomas.<br><small>10 minutes ago</small>
</li>
'''
soup = BeautifulSoup(html)

data = []
# data = [dict(zip(['title','country','category','content','time'],e.stripped_strings)) for e in soup.select('li.list-group-item')]
for e in soup.select('li.list-group-item'):
    data.append({
        'headline':e.a.text,
        'category':e.div.div.text,
        'content': e.div.next_sibling,
        'date':e.small.text
    })
pd.DataFrame(data)
输出

| | 标题|范畴|内容物|日期|
| - -|- -|- -|- -|- -|
| 第0页|奥地利9月贸易逆差扩大|奥地利贸易差额|奥地利的贸易逆差从一年前的10. 2亿欧元上升到2022年9月的13. 8亿欧元。|10分钟前|

相关问题