python 根据字节大小检查从另一个列表创建嵌套列表,以动态增加嵌套列表名称

cu6pst1q  于 2023-02-02  发布在  Python
关注(0)|答案(1)|浏览(108)

我有一个使用beautifulsoup4的网站的段落标签()内容列表。
我想把这个列表拆分成一个嵌套列表,子列表的名字是动态增加的,并且这个增加是基于当前嵌套列表的字节大小检查的,结果应该用来创建一个json对象。
我目前的代码举例:

import requests
from bs4 import BeautifulSoup

def getContent():

    page = requests.get("www.example.com")
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.prettify()
    
    data = {}
    SECTION_INDEX = 1
    data_container = []
    total_article_size = 0
    article_section_data = []


    for tag in soup.find_all("p"):
        text = tag.text
        data_container.append(text)

    for p in data_container:
        article_section = "CONTENT_SECTION_" + str(SECTION_INDEX)
        article_section_data.append(p)
        data[article_section] = article_section_data

        if article_section_size >= 300:
            SECTION_INDEX = SECTION_INDEX + 1

    return(data)

def createJson():
    data = getContent()
    json_source = {
                      "ARTICLE_DATA": data
                  }

    json_object = json.dumps(json_source, indent=2)

def main():
    createJson()

实际结果:

{
  "CONTENT_DATA": {
    "CONTENT_SECTION_1": [
      "the actual paragraphs",
      "content goes there",
      "some more content".
      "even more content from the site",
      "and some even more",
      "and finally, some more"
    ],
    "CONTENT_SECTION_2": [
      "the actual paragraphs",
      "content goes there",
      "some more content".
      "even more content from the site",
      "and some even more",
      "and finally, some more"
    ],
    "CONTENT_SECTION_3": [
      "the actual paragraphs",
      "content goes there",
      "some more content".
      "even more content from the site",
      "and some even more",
      "and finally, some more"
    ]
  }
}

预期结果:

{
  "CONTENT_DATA": {
    "CONTENT_SECTION_1": [
      "the actual paragraphs",
      "content goes there"
    ],
    "CONTENT_SECTION_2": [
      "some more content",
      "even more content from the site"
    ],
    "CONTENT_SECTION_3": [
      "and some even more",
      "and finally, some more"
    ]
  }
}

如何做到这一点,为什么重复的模式从实际效果上面?

ccgok5k5

ccgok5k51#

为了达到预期的结果,您可以使用sys.getsizeof函数跟踪当前文章部分的大小,并根据所需的字节大小将data_container列表拆分为更小的列表。以下是更新后的代码:

import requests
from bs4 import BeautifulSoup
import sys

def getContent():

    page = requests.get("www.example.com")
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.prettify()
    
    data = {}
    SECTION_INDEX = 1
    data_container = []
    article_section_size = 0
    article_section_data = []

    for tag in soup.find_all("p"):
        text = tag.text
        data_container.append(text)

    for p in data_container:
        article_section = "CONTENT_SECTION_" + str(SECTION_INDEX)
        article_section_data.append(p)
        article_section_size += sys.getsizeof(p)

        if article_section_size >= 300:
            data[article_section] = article_section_data
            article_section_data = []
            article_section_size = 0
            SECTION_INDEX = SECTION_INDEX + 1

    if article_section_data:
        data[article_section] = article_section_data

    return(data)

实际结果中的重复模式是由于您总是将p元素附加到article_section_data列表,而不是在达到所需字节大小时将其重置为空列表。

相关问题