我目前正在尝试使用cloudera快速启动vm上开发的cloudera搜索批量索引来批量索引文本文件中的数据。我相信我的schema和morphline有问题,因为当我进入solr Jmeter 板时,它完成了任务,并且在索引时似乎正在工作,但没有文档。核心显示,但它只是零文档。我确信我正在运行的命令和cloudera search在允许我批量索引一个示例之前是有效的,当我使用示例输入文件、模式和morphline文件时,它会正常工作,并将文档索引和添加到核心。我用来执行此操作的命令是:
hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' \
--log4j '/usr/share/doc/search-1.0.0+cdh5.4.0+0/examples/solr-nrt/log4j.properties' \
--morphline-file /usr/share/doc/search-1.0.0+cdh5.4.0+0/examples/solr-nrt/test-morphlines/readMultiLine.conf \
--output-dir hdfs://quickstart.cloudera:8020/user/outdir --verbose --go-live \
--zk-host 127.0.0.1:2181/solr --collection collection1 \
hdfs://quickstart.cloudera:8020/user/indir
我的模式是:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="sentences" version="1.5">
<fields>
<field name="id" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />
<field name="sentence" type="text_general" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="random" class="solr.RandomSortField" indexed="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
</schema>
对于我的morphline文件,我使用的是我在示例中找到的一个,仅用于读取单行:
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
readLine {
ignoreFirstLine : true
commentPrefix : "#"
charset : UTF-8
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
]
}
]
我的示例输入是:(docid tab语句)
1 For evening wear at the North Pole, girls could dress up in handsome Nordic sweaters and full iridescent taffeta skirts, or top one of the full striped skirts with a terrific short beige trench coat.
2 But working to change the communist-run system is illegal, and the party relentlessly punishes dissent.
3 Word of the latest document first came on Sept. 1, 1987, during a meeting between the pope and Jewish leaders in Castel Gandolfo, the pontiff's summer residence in the hills southeast of Rome.
4 Anita Moen-Guidon of Norway was third, 2:28.6 behind Lazutina, and Russia's Julia Chepalova fourth, 2:53.5 behind.
5 We have been beaten, we have shed blood, we have purchased the right to meet here today with our blood,'' said John Munuve, an assembly leader.
6 The folklore Nordic knits were handsome, in sweaters, or knee-length pants, and might have been topped by something like a super taffeta full coat.
7 Several politicians have charged that the high taxes Kenyans already pay go into the pockets of government officials or wasteful projects, and not into providing essential services and repairing crumbling infrastructure.
8 independence.
1条答案
按热度按时间dm7nw8vv1#
在schema.xml中
id
作为必填字段。但是,readline只将行读入“message”字段。所以你需要加上
id
你的文件。您可以使用类似setvalues的内容,或者将readline更改为readcsv,并使用制表符分隔符和列名,每个都应该是一个id
: