我非常努力地缩短XML站点Map中的大量行,但是我找不到一个解决方案来减少它。
import advertools as adv
import pandas as pd
site = "https://www.halfords.com/sitemap_index.xml"
sitemap = adv.sitemap_to_df(site)
sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True)
# Some sitemaps keeps urls with "/" on the end, some is with no "/"
# If there is "/" on the end, we take the second last column as slugs
# Else, the last column is the slug column
slugs = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ')
slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ')
# Merge two series
slugs = list(slugs) + list(slugs2)
# adv.word_frequency automatically removes the stop words
word_counts_onegram = adv.word_frequency(slugs)
word_counts_twogram = adv.word_frequency(slugs, phrase_len=2)
competitor = pd.concat([word_counts_onegram, word_counts_twogram])\
.rename({'abs_freq':'Count','word':'Ngram'}, axis=1)\
.sort_values('Count', ascending=False)
competitor.to_csv('competitor.csv',index=False)
competitor
竞争者形状(67758,2)
(67758, 2)
我已经在几个博客上浏览了关于Stack Overflow的资源,但是似乎没有什么有用的。我想这肯定是因为我在编码方面没有什么专业知识
1条答案
按热度按时间uujelgoq1#
两件事:
1.您可以使用
adv.url_to_df
来拆分URL并获取slugs(应该有一个名为last_dir
的列:| | 网址|计划|内洛克|路径|查询|碎片|目录_1|目录_2|目录_3|目录_4|目录_5|目录_6|目录_7|目录_8|目录_9|最后目录|
| - -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|- -|
| 第0页|https://www.halfords.com/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html|https协议|www.halfords.com| /cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html |楠|楠|循环|循环技术| Helm 摄像机| removu-k1-4k-camera-and-stabiliser-694977.html |楠|楠|楠|楠|楠| removu-k1-4k-camera-and-stabiliser-694977.html |
| 一个|https://www.halfords.com/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html|https协议|www.halfords.com| /technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html |楠|楠|工艺学|蓝牙车载套件| jabra-drive-bluetooth-speakerphone---white-695094.html |楠|楠|楠|楠|楠|楠| jabra-drive-bluetooth-speakerphone---white-695094.html |
| 2个|https://www.halfords.com/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html|https协议|www.halfords.com| /tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html |楠|楠|工具|电动工具及附件|动力工具| stanley-fatmax-v20-18v-combi-drill-kit-695102.html |楠|楠|楠|楠|楠| stanley-fatmax-v20-18v-combi-drill-kit-695102.html |
| 三个|https://www.halfords.com/technology/dash-cams/mio-mivue-c450-695262.html|https协议|www.halfords.com| /technology/dash-cams/mio-mivue-c450-695262.html |楠|楠|工艺学| Jmeter 盘凸轮| mio-mivue-c450-695262.html |楠|楠|楠|楠|楠|楠| mio-mivue-c450-695262.html |
| 四个|https://www.halfords.com/technology/dash-cams/mio-mivue-818-695270.html|https协议|www.halfords.com| /technology/dash-cams/mio-mivue-818-695270.html |楠|楠|工艺学| Jmeter 盘凸轮| mio-mivue-818-695270.html |楠|楠|楠|楠|楠|楠| mio-mivue-818-695270.html |
1.Pandas提供了一些选项,您可以更改这些选项。例如:
正如您所做的那样,您可以轻松地创建onegram和bigram,将它们组合起来并显示它们:
| | 字|绝对频率|
| - -|- -|- -|
| 第0页|哈尔福德|小行星2985|
| 一个|汽车|小行星1430|
| 2个|脚踏车|九二二|
| 三个|成套用具|八二九|
| 四个|黑色的|小行星七百七十七|
| 五个|雷射|六百八十六|
| 六个|设定|六一四|
| 七个|车轮|五百四十人|
| 八个|打包机|五二四|
| 九个|垫子|五一一|
| 10个|汽车脚垫|四百七十八人|
| 十一|图勒|四百五十三|
| 十二个|油漆|四百一十九人|
| 十三个|四个|四百一十三|
| 十四|喷雾器|三百八十二|
希望能帮上忙?