hadoop mapreduce多个文件的字数

oalqel3c  于 2021-06-03  发布在  Hadoop
关注(0)|答案(0)|浏览(259)

我是hadoop mapreduce的新手。我在数词。输入的是文本文件的数量,我必须表示每个不同文件中每个单词的频率,并且必须形成术语向量表。我读过术语向量是apachelucene库的一部分。我对这些东西都很陌生,所以我不知道怎么做,
我该怎么做。。?谢谢

Output should look like as follows in table format
       an apple is not orange the 
Doc1    1   5   8   22   0    32 
Doc2    0   6   10  19   0    13 
Doc3    3   12  15   4   8     5

这是我的mapper类

public class Mapper2 extends Mapper<LongWritable, Text, Text, IntWritable>
{   
public void map(LongWritable key, Text value, Context context)  
    throws IOException, InterruptedException
    {
// Get the file name
    FileSplit split = (FileSplit) context.getInputSplit();
     String filename = split.getPath().getName().toString();

//   context.write(new Text(filename), new IntWritable(1));
    String[] Stopwords=   {"a","about","above","after","again","against","all","am","an",
            "and","any","are","as","at","be","by","com","for","from","how","in","it","of",
            "on","or","that","the","this","to","was","what","when","where","who","will","with",
            "is","do","not","of","I","This"};
    List<String> stopwordlist=Arrays.asList(Stopwords);
    String s = value.toString();
    for (String word : s.split("\\W+")) 
        {           
        if ((word.length() > 0)&&(!stopwordlist.contains(word))) 
            {   
            context.write(new Text(filename+word), new IntWritable(1));
            }       
        }

    }

}

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题