我有一个使用beautifulsoup4的网站的段落标签()内容列表。
我想把这个列表拆分成一个嵌套列表,子列表的名字是动态增加的,并且这个增加是基于当前嵌套列表的字节大小检查的,结果应该用来创建一个json对象。
我目前的代码举例:
import requests
from bs4 import BeautifulSoup
def getContent():
page = requests.get("www.example.com")
soup = BeautifulSoup(page.content, "html.parser")
results = soup.prettify()
data = {}
SECTION_INDEX = 1
data_container = []
total_article_size = 0
article_section_data = []
for tag in soup.find_all("p"):
text = tag.text
data_container.append(text)
for p in data_container:
article_section = "CONTENT_SECTION_" + str(SECTION_INDEX)
article_section_data.append(p)
data[article_section] = article_section_data
if article_section_size >= 300:
SECTION_INDEX = SECTION_INDEX + 1
return(data)
def createJson():
data = getContent()
json_source = {
"ARTICLE_DATA": data
}
json_object = json.dumps(json_source, indent=2)
def main():
createJson()
实际结果:
{
"CONTENT_DATA": {
"CONTENT_SECTION_1": [
"the actual paragraphs",
"content goes there",
"some more content".
"even more content from the site",
"and some even more",
"and finally, some more"
],
"CONTENT_SECTION_2": [
"the actual paragraphs",
"content goes there",
"some more content".
"even more content from the site",
"and some even more",
"and finally, some more"
],
"CONTENT_SECTION_3": [
"the actual paragraphs",
"content goes there",
"some more content".
"even more content from the site",
"and some even more",
"and finally, some more"
]
}
}
预期结果:
{
"CONTENT_DATA": {
"CONTENT_SECTION_1": [
"the actual paragraphs",
"content goes there"
],
"CONTENT_SECTION_2": [
"some more content",
"even more content from the site"
],
"CONTENT_SECTION_3": [
"and some even more",
"and finally, some more"
]
}
}
如何做到这一点,为什么重复的模式从实际效果上面?
1条答案
按热度按时间ccgok5k51#
为了达到预期的结果,您可以使用
sys.getsizeof
函数跟踪当前文章部分的大小,并根据所需的字节大小将data_container
列表拆分为更小的列表。以下是更新后的代码:实际结果中的重复模式是由于您总是将
p
元素附加到article_section_data
列表,而不是在达到所需字节大小时将其重置为空列表。