python 你如何用漂亮的汤解析HTML,只得到一个特定的JavaScript链接,以及HTML表中的特定日期?

zbdgwd5y  于 2023-05-05  发布在  Python
关注(0)|答案(1)|浏览(97)

我试图使用漂亮的soup和FindALL方法解析一个HTML文档,但似乎无法分离出所需的信息。我看了文档和一些教程,也许是因为我是一个初级开发人员,但我似乎不能隔离数字和链接。
下面是一个包含基本信息的虚拟HTML表:

<tbody>                                    
                                            <tr class="results_row2">
                                                <td align="left">
                                                    Text is here ispssgjj sgdhjksgd jhsgd sgd
                                                </td>
                                                <td align="left">
                                                    GHJSFAGHJSFA GAFGSH AGSHSAGJH
                                                </td>
                                                <td align="left">
                                                    hdjk sgdhjk fdhjk sdhjk sdghjk
                                                </td>
                                                <td align="center">
                                                    11/10/1964
                                                </td>
                                                <td align="left">
                                                    
                                                </td>
                                                <td align="center">
                                                    5
                                                </td>

                                                <td align="center">
                                                    
                                                    <a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a>
                                                    
                                                            <br>
                                                            <a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a>
                                                            
                                                    <br>
                                        
                                                </td>
                                            </tr>
                                                                               
                                </tbody>

当我运行我的程序时,我需要它为每行(这只是一行)拉取以下内容:日期(但重新排列为YYMMDD,即641110)以及它说LINK GOES HERE的字符串(但我必须将其与另一个字符串连接以使其成为有效链接)
我不想要任何其他信息,如链接在这里或胡言乱语的文本(例如。Hjkhjksgd)
编辑:我还需要能够登录到网页位置与正确的信誉(我有密码和用户名)
希望我的代码足够清晰,我有一些打印,以帮助我理解变量等。我也对其他方式持开放态度,我似乎无法弄清楚美丽的Pandas或 selenium 。到目前为止,我得到了这个:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#label the file location
file_location = r"Destination goes here"

#open the file up
with open(file_location, 'r') as f:
    file = f.read() 

#create a soup
soup= BeautifulSoup(file, "html.parser")
#print(f"soup is {soup}")     

#find all the tags that match what we want
script = soup.findAll('td', id='center')

print('begning loop')

#this is to find the date I am going to make a separate loop to find the print certificate 
#loop through the tags and check for what we want 
for i in range (0, len(script)):
        #these two variables are me trying to convert the tag to a variable to be used to check
    scriptString = str(script[i])
    scriptInt = int(script[i])
    
        #print(f'Starting loop i is: {i}')
        # Every 7th cell seems to be a number....
    if((i+4)%7 == 0):
        print(f'Starting IF i is: {i}')
        print(f'int test is {scriptInt}')
        #print(f'script is {script[i]} quote end')
                #this was to find out which part of the string was a number and it's 80% accurate 
        #for j in range (0, len(scriptString)):
            #print(f' j is {j} and string is {scriptString[j]}')
    #this printed the YYMMDD    
        print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end')

print("end")

我试着从表中提取字符串,但它不像一个int,字符串非常混乱。由于字符串的混乱,我不能比较它与我想要的。由于有多个td标签,我不能通过td来隔离它。

zi8p0yeb

zi8p0yeb1#

我使用了datetime模块和re模块来尝试实现你的需求,希望能对你有所帮助,以下是代码:

import re
from datetime import datetime
from bs4 import BeautifulSoup

file_location = r"yourhtml.html"
with open(file_location, "r") as f:
    file = f.read()
soup = BeautifulSoup(file, "html.parser")
script = soup.findAll("td", align="center")
print("begning loop")
for i in script:
    a_tags = i.findAll("a")
    if a_tags:
        # parsing JavaScript
        for a in a_tags:
            pattern = r"\('(.*?)'\)"
            match = re.search(pattern, a["href"])
            if match:
                content = match.group(1)
                print(content)
    try:
        date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y")
        print(f"{str(date_obj.year)[-2:]}{date_obj.month}{date_obj.day}")
    except ValueError:
        continue
print("end")

相关问题