无法验证存储在hbase中的爬网数据

jjhzyzn0 于 2021-06-09 发布在 Hbase

关注(0)|答案(1)|浏览(470)

我爬过的网站使用'坚果'与hbase作为存储后端。我引用了这个教程链接- http://wiki.apache.org/nutch/Nutch2Tutorial .
nutch版本为2.2.1，hbase版本为0.90.4，solr版本为4.7.1
以下是我使用的步骤-
./runtime/local/bin/nutch注入URL
./runtime/local/bin/nutch generate-topn 100-添加30天
./runtime/local/bin/nutch-fetch-全部
./runtime/local/bin/nutch-fetch-全部
./runtime/local/bin/nutch更新

./runtime/local/bin/nutch solrindex http://localhost:8983/solr/ -all

我的url/seed.txt文件包含- http://www.xyzshoppingsite.com/mobiles/ 我只保留了“regex urlfilter.txt”文件的下面一行（所有其他regex都有注解）。 +^http://([a-z0-9]*\.)*xyzshoppingsite.com/mobile/* 在爬网结束时，我可以看到在hbase中创建的表“webpage”，但我无法验证是否已爬网所有完整的数据。在solr中搜索时，不显示任何内容，结果为0。
我的最终目的是得到完整的数据，目前在所有网页下移动在上述网址。
你能告诉我吗，
如何验证hbase中存在的爬网数据？
solr日志目录包含0个文件，因此我无法获得突破。如何解决？
hbase命令输出 scan "webpage" 仅将时间戳数据和其他数据显示为
' value=\x0A\x0APlease Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a>Please Wait ... Redirecting to <a href="/mobiles"><b>http://www.xyzshoppingsite.com/mobiles</b></a> '
在这里，为什么数据是这样爬网的，而不是重定向后页面的实际内容？
请帮忙。提前谢谢。
谢谢和问候！

hbase solr nutch web-crawler

来源：https://stackoverflow.com/questions/23564206/unable-to-verify-crawled-data-stored-in-hbase

1条答案

按热度按时间

kgsdhlau1#

您可以使用下面的命令来代替执行所有这些步骤吗

./bin/crawl url/seed.txt shoppingcrawl http://localhost:8080/solr 2

如果您能够成功执行，将在hbase中创建一个名为shoppingcrawl\u webpage的表。
我们可以在hbase shell中执行下面的命令进行检查

hbase> list

然后我们可以扫描特定的表。在这种情况下

hbase> scan 'shoppingcrawl_webpage'

赞(0）回复(0）举报 2021-06-09

我来回答

无法验证存储在hbase中的爬网数据

1条答案

相关问题

热门标签

最新问答