读取前100行

ffx8fchx 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(390)

请看下面的代码：
wcmapper.php（hadoop流作业的Map器）


# !/usr/bin/php

<?php
//sample mapper for hadoop streaming job
$word2count = array();

// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

?>

wcreducer.php（示例hadoop作业的reducer脚本）


# !/usr/bin/php

<?php
//reducer script for sample hadoop job
$word2count = array();

// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
    // remove leading and trailing whitespace
    $line = trim($line);
    // parse the input we got from mapper.php
    list($word, $count) = explode("\t", $line);
    // convert count (currently a string) to int
    $count = intval($count);
    // sum counts
    if ($count > 0) $word2count[$word] += $count;
}

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";
}

?>

此代码用于在commoncrawl数据集上使用php的wordcount流式处理作业。
在这里，这些代码读取整个输入。这不是我需要的，我需要读取前100行并将它们写入一个文本文件。我是hadoop、commoncrawl和php的初学者。那么，我该怎么做呢？
请帮忙。

hadoop php common-crawl web-crawler web-services

来源：https://stackoverflow.com/questions/20854701/reading-the-first-100-lines

2条答案

按热度按时间

6ie5vjzr1#

我不知道你是如何定义“线”的，但如果你想要单词，你可以这样做：

for ($count=0; $count<=100; $count++){
      echo $word2count[$count]\t$count\n";
}

赞(0）回复(0）举报 2021-06-03

3b6akqbq2#

在第一个循环中使用计数器，当计数器达到100时停止循环。然后，创建一个虚拟循环，只读取到输入的末尾，然后继续执行代码（将结果写入stdout）。结果的写入也可以先于虚拟循环读取，直到stdin输入结束。示例代码如下：

...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
   // read the line from STDIN; you
   // can add a check to exit if done ($line == false)
   $line = fgets(STDIN); 
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}

赞(0）回复(0）举报 2021-06-03

我来回答

读取前100行

2条答案

相关问题

热门标签

最新问答