Python - xpath解析XML、HTML

x33g5p2x  于2021-09-19 转载在 HTML5  
字(7.0k)|赞(0)|评价(0)|浏览(485)

绝对路径:HTML / body / div / a

相对路径: ./a

专业术语

树:整个HTML或XML结构

节点:HTML中的每个标签,XML中标签就是节点

根节点:树的第一个节点,HTML的根节点就是HTML标签

属性:节点属性(HTML中就是标签属性)

  1. from lxml import etree

xpath解析xml

XML数据格式

json数据与XML数据时两种通用的数据格式,用于不同语言之间进行数据交流

将一个超市的商品数据进行传输:

  1. json:
  2. {
  3. "name":"永辉超市",
  4. "address":"肖家河",
  5. "goods":[
  6. {"name":"泡面","price":3.5,"count":50}
  7. {"name":"火腿肠","price":3,"count":200}
  8. {"name":"矿泉水","price":2,"count":30}
  9. ]
  10. }
  11. XML:
  12. <supermarket>
  13. <name>永辉超市</name>
  14. <address>肖家河</address>
  15. <goodsList>
  16. <goods name = "泡面" price = "3.5" count = "50"></goods>
  17. <goods name = "火腿肠" price = "3" count = "200"></goods>
  18. <goods name = "矿泉水" price = "2" count = "30"></goods>
  19. </goodsList>
  20. <workerList>
  21. <cashier name = "张三" pay = "4000"></cashier>
  22. <shoppingGuide name = "李四" pay = "3000"></shoppingGuide>
  23. </workerList>
  24. </supermarket>
  1. 准备数据
  1. xml_data =""" <supermarket> <name>永辉超市</name> <address>肖家河</address> <goodsList> <goods name = "泡面" price = "3.5" count = "50"></goods> <goods name = "火腿肠" price = "3" count = "200"></goods> <goods name = "矿泉水" price = "2" count = "30"></goods> </goodsList> <workerList> <cashier name = "张三" pay = "4000"></cashier> <shoppingGuide name = "李四" pay = "3000"></shoppingGuide> </workerList> </supermarket> """
  1. 创建树对象,并且获取数据的根节点
  1. supermarket = etree.XML(xml_data)

获取标签(获取节点)

节点对象.xpath(路径)

a.写绝对路:不管xpath前面的节点对象是什么,路径从根节点开始写

写法:/绝对路径

  1. cashier = supermarket.xpath('/supermarket/workerList/cashier')

​ b.相对路径:用.表示当前节点,xpath前面是谁,当前节点就是谁

​ …表示当前节点的上层节点

​ 注意:./ 可省略

  1. cashier = supermarket.xpath('./workerList/cashier')
  2. print(cashier) #[<Element cashier at 0x1d4299ba980>]
  3. cashier = supermarket.xpath('../workerList/cashier')
  4. print(cashier)

​ c.//路径 — 从任意位置开始全局搜索

​ 查找方式和功能和xpath前的节点无关

  1. result = supermarket.xpath('//cashier')
  2. print(result) #[<Element cashier at 0x1d4299ba980>]
  3. goods = supermarket.xpath('//goodsList/goods')
  4. print(goods) #[<Element goods at 0x1d4299ba9c0>, <Element goods at 0x1d4299baa00>, <Element goods at 0x1d4299baa40>]

获取节点内容

语法:获取节点的路径/text()

  1. name = supermarket.xpath('./name/text()')
  2. print(name) #['永辉超市']

获取节点属性值

语法:获取节点的路径/@属性名

  1. goods = supermarket.xpath('//goodsList/goods/@name')
  2. print(goods) #['泡面', '火腿肠', '矿泉水']

xpath解析HTML

  1. html = etree.HTML(open('test.html', 'r', encoding='utf-8').read())
  2. h1 = html.xpath('/html/body/h1')
  3. print(h1) #[<Element h1 at 0x273af64aa40>]
  4. h1 = html.xpath('//h1')
  5. print(h1) #[<Element h1 at 0x273af64aa40>]

加谓语(加条件)

语法:选中标签的路径[谓语]

1)、[N] — 获取同层的第N个标签

  1. p = html.xpath('//p[1]/text()') #所在父标签的第一个p标签的第一个评标前内容
  2. print(p) #['肖家河大厦', '泡面', '矿泉水', '面包', '充电宝']
  3. p = html.xpath('./body/p/text()')
  4. print(p) #['肖家河大厦']
  5. result = html.xpath('body/ul/li[2]/p/text()')
  6. print(result) #['矿泉水', '2', '120']

2)、[last()] — 获取同层的最后一个标签

[last() - N] — 获取同层的倒数第(N+1)个标签

  1. counts = html.xpath('body/ul/li/p[last()]/text()')
  2. print(counts) #['15', '120', '42', '10']
  3. bread = html.xpath('body/ul/li[last() - 1]/p[last()]/text()')
  4. print(bread) #['42']

3)、[position() > N]

[position() >= N]

  1. goods = html.xpath('body/ul/li[position() < 3]/p/text()')
  2. print(goods) #['泡面', '3.5', '15', '矿泉水', '2', '120']

4)、[@属性名] — 获取有指定属性的标签

  1. result = html.xpath('body/div/p[@class]/text()')
  2. print(result) #['p1', 'p2', 'p4']

[@属性名 = 属性值] — 获取指定属性为指定值的标签

  1. result = html.xpath('body/div/p[@class = "c1"]/text()')
  2. print(result)

5)、[标签 >/</>=/<=/= 数据] — 将标签按照指定子标签的内容进行筛选

  1. prices = html.xpath('body/ul/li[p[2]>2]/p/text()')
  2. prices = html.xpath('./body/ul/li[p[2]>2]/p/text()')
  3. print(prices)

6、通配符 /*

获取最后一个div下的所有标签的内容

  1. result = html.xpath('body/div[last()]/*/text()')
  2. print(result) #['p1', 'p2', 'a1', 'span1']
  3. result = html.xpath('body/div[last()]/*[@class]/text()')
  4. print(result) #['p1', 'span1']
  5. result = html.xpath('body/div[last()]/*[@*]/text()')
  6. print(result) #['p1', 'a1', 'span1']
  7. result = html.xpath('//img/@*')
  8. print(result) #['https://image1.guazistatic.com/qn2107010956026670c8553db23db93154432c791292ae.jpg?imageView2/1/w/270/h/180/q/88', '']

7、分支(获取若干个路径)

  1. result = html.xpath('body/ul/li/p[1]/text()|body/ul/li/p[2]/text()')
  2. print(result) #['泡面', '3.5', '矿泉水', '2', '面包', '5', '充电宝', '150']
test.html
  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="UTF-8">
  5. <title>Title</title>
  6. </head>
  7. <body>
  8. <h1>永辉超市</h1>
  9. <p>肖家河大厦</p>
  10. <ul>
  11. <li>
  12. <p class="name">泡面</p>
  13. <p class="price">3.5</p>
  14. <p class="count">15</p>
  15. </li>
  16. <li>
  17. <p class="name">矿泉水</p>
  18. <p class="price">2</p>
  19. <p class="count">120</p>
  20. </li>
  21. <li>
  22. <p class="name">面包</p>
  23. <p class="price">5</p>
  24. <p class="count">42</p>
  25. </li>
  26. <li>
  27. <p class="name">充电宝</p>
  28. <p class="price">150</p>
  29. <p class="count">10</p>
  30. </li>
  31. </ul>
  32. <div>
  33. <p class="">p1</p>
  34. <p class="c1">p2</p>
  35. <p id="p1">p3</p>
  36. <p class="c2">p4</p>
  37. </div>
  38. <div id="div1">
  39. <p class="">p1</p>
  40. <p>p2</p>
  41. <a href="">a1</a>
  42. <span class="">span1</span>
  43. <img src="https://image1.guazistatic.com/qn2107010956026670c8553db23db93154432c791292ae.jpg?imageView2/1/w/270/h/180/q/88" alt="">
  44. </div>
  45. </body>
  46. </html>
xpath解析 豆瓣电影数据

导包

  1. from selenium.webdriver import Chrome
  2. from lxml import etree
  3. import csv
  4. import time

获取更多数据(翻页)

  1. def get_more():
  2. more = browser.find_element_by_css_selector('.more')
  3. more.click()

获取网页数据

  1. def get_message():
  2. movie_name = []
  3. movie_score = []
  4. movie_poster = []
  5. movie_detail = []
  6. for movie in movies:
  7. movie_name.append((movie.xpath('div/img/@alt')))
  8. movie_score.append(movie.xpath('p/strong/text()'))
  9. movie_poster.append(movie.xpath('div[@class = "cover-wp"]/img/@src'))
  10. movie_detail.append(movie.xpath('@href'))
  11. return zip(movie_name,movie_score,movie_poster,movie_detail)

保存数据

  1. def writer(m_message):
  2. file = open('files/douban/movies.csv', 'w', encoding='utf-8')
  3. writer = csv.writer(file)
  4. writer.writerow(['电影名称','电影评分','电影海报','电影详情'])
  5. for movie in m_message:
  6. writer.writerow(movie)
  7. file.close()

调用

  1. browser = Chrome()
  2. for index in range(0,121,20):
  3. browser.get(f'https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start={index}')
  4. tree = etree.HTML(browser.page_source)
  5. movies = tree.xpath('body/div[@id = "wrapper"]/div[@id = "content"]/div[@*]/div[@class = "article"]/div[@class = "gaia"]/div[@class = "list-wp"]/div/a')
  6. writer(get_message())
  7. get_more()
  8. print(index)
  9. time.sleep(2)

相关文章

最新文章

更多