我有两个文件,我正试图在模式匹配的基础上加入这两个。
File1 :
weather.bbc.co.uk,112
ads.facebook.com,113
ads.amazon.co.uk,114
www.sky.com,115
news.bbc.co.uk,116
pics.facebook.com,117
File2 :
facebook.com,facebook
bbc.co.uk,bbc
netflix.com,netflix
flipkart.com,flipkart
output:
weather.bbc.co.uk,112,bbc.co.uk,bbc
ads.facebook.com,113,facebook.com,facebook
news.bbc.co.uk,116,bbc.co.uk,bbc
pics.facebook.com,117,facebook.com,facebook
Script
file1 = LOAD '/file1' using PigStorage('|') as (request_domain: chararray,msisdn:int);
file2 = LOAD '/file2' using PigStorage('|') as (domain: chararray,provider: chararray);
file3 = JOIN file1 by case when (request_domain MATCHES CONCAT(CONCAT('(?i).*',file2.domain),'.*')) then file2.domain else 'Other' end LEFT OUTER,file2 by domain;
DESCRIBE file3;
dump file3;
但我得到一个错误如下:
warn[thread-29]org.apache.hadoop.mapred.localjobrunner-job\u local\u 0006 org.apache.pig.backend.executionengine.execute:错误0:标量在输出中有多行。第一名:(facebook.com,facebook),第二名:(bbc.co.uk,bbc)在org.apache.pig.impl.builtin.readscalars.exec(readscalars。java:111)在org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc。java:330)位于org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnextstring(pouserfunc。java:432)在org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.getnext(physicaloperator)。java:317)位于org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.processinput(pouserfunc。java:221)在org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc。java:275)位于org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnextstring(pouserfunc。java:432)在org.apache.pig.backend.hadoop.executionengine.physicallayer.physicaloperator.getnext(physicaloperator)。java:317)位于org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.processinput(pouserfunc。java:221)在org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnext(pouserfunc。java:275)位于org.apache.pig.backend.hadoop.executionengine.physicallayer.expressionoperators.pouserfunc.getnextstring(pouserfunc。java:432)
1条答案
按热度按时间kqhtkvqz1#
分隔符应为“,”而不是“|”->pigstorage(',')
该模式将匹配多个值,请尝试使用带有udf索引的交叉函数,如下所示
试着用十字架,