python 你如何用漂亮的汤解析HTML,只得到一个特定的JavaScript链接,以及HTML表中的特定日期?

zbdgwd5y  于 2023-05-05  发布在  Python
关注(0)|答案(1)|浏览(104)

我试图使用漂亮的soup和FindALL方法解析一个HTML文档,但似乎无法分离出所需的信息。我看了文档和一些教程,也许是因为我是一个初级开发人员,但我似乎不能隔离数字和链接。
下面是一个包含基本信息的虚拟HTML表:

  1. <tbody>
  2. <tr class="results_row2">
  3. <td align="left">
  4. Text is here ispssgjj sgdhjksgd jhsgd sgd
  5. </td>
  6. <td align="left">
  7. GHJSFAGHJSFA GAFGSH AGSHSAGJH
  8. </td>
  9. <td align="left">
  10. hdjk sgdhjk fdhjk sdhjk sdghjk
  11. </td>
  12. <td align="center">
  13. 11/10/1964
  14. </td>
  15. <td align="left">
  16. </td>
  17. <td align="center">
  18. 5
  19. </td>
  20. <td align="center">
  21. <a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a>
  22. <br>
  23. <a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a>
  24. <br>
  25. </td>
  26. </tr>
  27. </tbody>

当我运行我的程序时,我需要它为每行(这只是一行)拉取以下内容:日期(但重新排列为YYMMDD,即641110)以及它说LINK GOES HERE的字符串(但我必须将其与另一个字符串连接以使其成为有效链接)
我不想要任何其他信息,如链接在这里或胡言乱语的文本(例如。Hjkhjksgd)
编辑:我还需要能够登录到网页位置与正确的信誉(我有密码和用户名)
希望我的代码足够清晰,我有一些打印,以帮助我理解变量等。我也对其他方式持开放态度,我似乎无法弄清楚美丽的Pandas或 selenium 。到目前为止,我得到了这个:

  1. import requests
  2. from urllib.parse import urljoin
  3. from bs4 import BeautifulSoup
  4. #label the file location
  5. file_location = r"Destination goes here"
  6. #open the file up
  7. with open(file_location, 'r') as f:
  8. file = f.read()
  9. #create a soup
  10. soup= BeautifulSoup(file, "html.parser")
  11. #print(f"soup is {soup}")
  12. #find all the tags that match what we want
  13. script = soup.findAll('td', id='center')
  14. print('begning loop')
  15. #this is to find the date I am going to make a separate loop to find the print certificate
  16. #loop through the tags and check for what we want
  17. for i in range (0, len(script)):
  18. #these two variables are me trying to convert the tag to a variable to be used to check
  19. scriptString = str(script[i])
  20. scriptInt = int(script[i])
  21. #print(f'Starting loop i is: {i}')
  22. # Every 7th cell seems to be a number....
  23. if((i+4)%7 == 0):
  24. print(f'Starting IF i is: {i}')
  25. print(f'int test is {scriptInt}')
  26. #print(f'script is {script[i]} quote end')
  27. #this was to find out which part of the string was a number and it's 80% accurate
  28. #for j in range (0, len(scriptString)):
  29. #print(f' j is {j} and string is {scriptString[j]}')
  30. #this printed the YYMMDD
  31. print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end')
  32. print("end")

我试着从表中提取字符串,但它不像一个int,字符串非常混乱。由于字符串的混乱,我不能比较它与我想要的。由于有多个td标签,我不能通过td来隔离它。

zi8p0yeb

zi8p0yeb1#

我使用了datetime模块和re模块来尝试实现你的需求,希望能对你有所帮助,以下是代码:

  1. import re
  2. from datetime import datetime
  3. from bs4 import BeautifulSoup
  4. file_location = r"yourhtml.html"
  5. with open(file_location, "r") as f:
  6. file = f.read()
  7. soup = BeautifulSoup(file, "html.parser")
  8. script = soup.findAll("td", align="center")
  9. print("begning loop")
  10. for i in script:
  11. a_tags = i.findAll("a")
  12. if a_tags:
  13. # parsing JavaScript
  14. for a in a_tags:
  15. pattern = r"\('(.*?)'\)"
  16. match = re.search(pattern, a["href"])
  17. if match:
  18. content = match.group(1)
  19. print(content)
  20. try:
  21. date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y")
  22. print(f"{str(date_obj.year)[-2:]}{date_obj.month}{date_obj.day}")
  23. except ValueError:
  24. continue
  25. print("end")
展开查看全部

相关问题