我试图使用漂亮的soup和FindALL方法解析一个HTML文档,但似乎无法分离出所需的信息。我看了文档和一些教程,也许是因为我是一个初级开发人员,但我似乎不能隔离数字和链接。
下面是一个包含基本信息的虚拟HTML表:
<tbody>
<tr class="results_row2">
<td align="left">
Text is here ispssgjj sgdhjksgd jhsgd sgd
</td>
<td align="left">
GHJSFAGHJSFA GAFGSH AGSHSAGJH
</td>
<td align="left">
hdjk sgdhjk fdhjk sdhjk sdghjk
</td>
<td align="center">
11/10/1964
</td>
<td align="left">
</td>
<td align="center">
5
</td>
<td align="center">
<a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a>
<br>
<a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a>
<br>
</td>
</tr>
</tbody>
当我运行我的程序时,我需要它为每行(这只是一行)拉取以下内容:日期(但重新排列为YYMMDD,即641110)以及它说LINK GOES HERE的字符串(但我必须将其与另一个字符串连接以使其成为有效链接)
我不想要任何其他信息,如链接在这里或胡言乱语的文本(例如。Hjkhjksgd)
编辑:我还需要能够登录到网页位置与正确的信誉(我有密码和用户名)
希望我的代码足够清晰,我有一些打印,以帮助我理解变量等。我也对其他方式持开放态度,我似乎无法弄清楚美丽的Pandas或 selenium 。到目前为止,我得到了这个:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#label the file location
file_location = r"Destination goes here"
#open the file up
with open(file_location, 'r') as f:
file = f.read()
#create a soup
soup= BeautifulSoup(file, "html.parser")
#print(f"soup is {soup}")
#find all the tags that match what we want
script = soup.findAll('td', id='center')
print('begning loop')
#this is to find the date I am going to make a separate loop to find the print certificate
#loop through the tags and check for what we want
for i in range (0, len(script)):
#these two variables are me trying to convert the tag to a variable to be used to check
scriptString = str(script[i])
scriptInt = int(script[i])
#print(f'Starting loop i is: {i}')
# Every 7th cell seems to be a number....
if((i+4)%7 == 0):
print(f'Starting IF i is: {i}')
print(f'int test is {scriptInt}')
#print(f'script is {script[i]} quote end')
#this was to find out which part of the string was a number and it's 80% accurate
#for j in range (0, len(scriptString)):
#print(f' j is {j} and string is {scriptString[j]}')
#this printed the YYMMDD
print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end')
print("end")
我试着从表中提取字符串,但它不像一个int,字符串非常混乱。由于字符串的混乱,我不能比较它与我想要的。由于有多个td标签,我不能通过td来隔离它。
1条答案
按热度按时间zi8p0yeb1#
我使用了
datetime
模块和re
模块来尝试实现你的需求,希望能对你有所帮助,以下是代码: