我有一个列表(在制表符分隔的.txt文件中),如下所示:
row col value1 1 3.210 2 5.325 3 2.230 1 5.3
row col value
1 1 3.2
10 2 5.3
25 3 2.2
30 1 5.3
等。我想把它变成一个稀疏矩阵,比如:
1 2 31 3.2 10 5.3 25 2.230 5.3
1 2 3
1 3.2
10 5.3
25 2.2
30 5.3
然后填上零。使用hadoop最简单的方法是什么(我需要使用hadoop,因为矩阵的大小大约是3tb……)
g9icjywg1#
你可以用Hive或Pig。以下是使用pig的示例:
A = load 'input.txt' USING PigStorage('\t') AS (row:long, col:int, value:float);B = foreach a generate SOMEUDF(A);store B into 'output.txt';
A = load 'input.txt' USING PigStorage('\t') AS (row:long, col:int, value:float);
B = foreach a generate SOMEUDF(A);
store B into 'output.txt';
然后您只需要定义一个自定义项:
public class SOMEUDF extends EvalFunc <Tuple>{ public Tuple exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ // Generate the matrix line here and return. }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}
public class SOMEUDF extends EvalFunc <Tuple>
{
public Tuple exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
// Generate the matrix line here and return.
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
1条答案
按热度按时间g9icjywg1#
你可以用Hive或Pig。以下是使用pig的示例:
然后您只需要定义一个自定义项: