python 你如何用漂亮的汤解析HTML，只得到一个特定的JavaScript链接，以及HTML表中的特定日期？

我试图使用漂亮的soup和FindALL方法解析一个HTML文档，但似乎无法分离出所需的信息。我看了文档和一些教程，也许是因为我是一个初级开发人员，但我似乎不能隔离数字和链接。
下面是一个包含基本信息的虚拟HTML表：

<tbody>                                    
                                            <tr class="results_row2">
                                                <td align="left">
                                                    Text is here ispssgjj sgdhjksgd jhsgd sgd
                                                </td>
                                                <td align="left">
                                                    GHJSFAGHJSFA GAFGSH AGSHSAGJH
                                                </td>
                                                <td align="left">
                                                    hdjk sgdhjk fdhjk sdhjk sdghjk
                                                </td>
                                                <td align="center">
                                                    11/10/1964
                                                </td>
                                                <td align="left">
                                                    
                                                </td>
                                                <td align="center">
                                                    5
                                                </td>
                                                <td align="center">
                                                    
                                                    <a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a>
                                                    
                                                            <br>
                                                            <a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a>
                                                            
                                                    <br>
                                        
                                                </td>
                                            </tr>
                                                                               
                                </tbody>

当我运行我的程序时，我需要它为每行（这只是一行）拉取以下内容：日期（但重新排列为YYMMDD，即641110）以及它说LINK GOES HERE的字符串（但我必须将其与另一个字符串连接以使其成为有效链接）
我不想要任何其他信息，如链接在这里或胡言乱语的文本（例如。Hjkhjksgd）
编辑：我还需要能够登录到网页位置与正确的信誉（我有密码和用户名）
希望我的代码足够清晰，我有一些打印，以帮助我理解变量等。我也对其他方式持开放态度，我似乎无法弄清楚美丽的Pandas或 selenium 。到目前为止，我得到了这个：

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#label the file location
file_location = r"Destination goes here"
#open the file up
with open(file_location, 'r') as f:
    file = f.read() 
#create a soup
soup= BeautifulSoup(file, "html.parser")
#print(f"soup is {soup}")     
#find all the tags that match what we want
script = soup.findAll('td', id='center')
print('begning loop')
#this is to find the date I am going to make a separate loop to find the print certificate 
#loop through the tags and check for what we want 
for i in range (0, len(script)):
        #these two variables are me trying to convert the tag to a variable to be used to check
    scriptString = str(script[i])
    scriptInt = int(script[i])
    
        #print(f'Starting loop i is: {i}')
        # Every 7th cell seems to be a number....
    if((i+4)%7 == 0):
        print(f'Starting IF i is: {i}')
        print(f'int test is {scriptInt}')
        #print(f'script is {script[i]} quote end')
                #this was to find out which part of the string was a number and it's 80% accurate 
        #for j in range (0, len(scriptString)):
            #print(f' j is {j} and string is {scriptString[j]}')
    #this printed the YYMMDD    
        print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end')
print("end")

我试着从表中提取字符串，但它不像一个int，字符串非常混乱。由于字符串的混乱，我不能比较它与我想要的。由于有多个td标签，我不能通过td来隔离它。

我使用了datetime模块和re模块来尝试实现你的需求，希望能对你有所帮助，以下是代码：

import re
from datetime import datetime
from bs4 import BeautifulSoup
file_location = r"yourhtml.html"
with open(file_location, "r") as f:
    file = f.read()
soup = BeautifulSoup(file, "html.parser")
script = soup.findAll("td", align="center")
print("begning loop")
for i in script:
    a_tags = i.findAll("a")
    if a_tags:
        # parsing JavaScript
        for a in a_tags:
            pattern = r"\('(.*?)'\)"
            match = re.search(pattern, a["href"])
            if match:
                content = match.group(1)
                print(content)
    try:
        date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y")
        print(f"{str(date_obj.year)[-2:]}{date_obj.month}{date_obj.day}")
    except ValueError:
        continue
print("end")

展开查看全部

python 你如何用漂亮的汤解析HTML，只得到一个特定的JavaScript链接，以及HTML表中的特定日期？

1条答案

相关问题

热门标签

最新问答