java—如何提取包含3个或更多a的标题？

bq3bfh9z 于 2021-06-04 发布在 Hadoop

关注(0)|答案(3)|浏览(304)

我创造了那个密码http://paste.ubuntu.com/5730390/ 我试图提取包含3个或更多a的标题（大写或小写）α'一些网站上的s（希腊字母）。我已经在本地硬盘上存储了txt格式的网站内容（有大量的网站）。
我在dfs中的输入是：site_1.txt、site_2.txt、site_3.txt等。
假设以下标题分别属于site_1.txt、site_2.txt和site_3.txt。
academy.edu-共享研究
谷歌
新闻12.gr |αθλητική ενημέρωση από τα δωδεκάνησα
现在我希望输出包含：标题1和3（3，因为有希腊网站，包含一个字母）α") 以如下形式：
academy.edu-共享研究，站点1.txt
新闻12.gr |αθλητική ενημέρωση από τα δωδεκάνησα, 站点2.txt
我试过regex模式，比如“？：[αa{3，}]）。（？：[α“，但没有结果。有人能帮忙吗？
提前谢谢！

Java hadoop mapreduce

来源：https://stackoverflow.com/questions/16904897/how-to-extract-the-title-which-contains-3-or-more-as

3条答案

按热度按时间

kulphzqa1#

你可以用 replace 要实现这一点：

public static int howMany(String str, char c) {
    String str2 = str.replace(c+"", "");
    return str.length() - str2.length();
}

然后您可以使用上述方法：

for(String website : websites) {
    if(howMany(website, 'a') >= 3 || howMany(website, 'α')) {
        System.println(website);
    }
}

赞(0）回复(0）举报 2021-06-04

u1ehiz5o2#

这听起来不像hadoop问题，只是regex问题。你只需要匹配 a 或阿尔法3次以上。下面的正则表达式将实现这一点 "([aα].*){3,}" .

String files[] = {
        "Academia.edu - Share research",
        "Google",
        "News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα"};
String regexpattern = "([aα].*){3,}";
Pattern pattern = Pattern.compile(regexpattern);
for (String file: files){
    Matcher matcher = pattern.matcher(file);
    while (matcher.find()){
        System.out.println("file name matched '" + file+"'");
    }
}

赞(0）回复(0）举报 2021-06-04

bhmjp9jg3#

要匹配3个a或alpha（不一定相邻），可以使用以下正则表达式：