regex 如何使用正则表达式在文件中搜索一个或多个字符串，并分别计算每个字符串的数量？

new9mtju 于 2023-04-22 发布在其他

关注(0)|答案(3)|浏览(196)

因此，我试图在文件的每一行中找到一个或多个字符串，并计算每个字符串在文件中出现的总次数。在某些行中只有一个字符串，但在其他行中可能有多个目标字符串，如果这有意义的话。我试图使用正则表达式来做到这一点。
因此，我尝试了如下方法（已经读取了文件并使用.readlines将其分隔为行）：

1count=0
2count=0
3count=0

Pattern=r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'

i=0
while i!=len(lines) 
    match=re.search(pattern, lines[i]) 

    if match:
        if match.group(1):
            1count=1count+1
        elif match.group(2):
            2count=2count+1
        elif match.group(3):
            3count=3count+1
    i=i+1

当行中没有多个匹配项时，这是有效的，但是当行中有多个匹配项时，它显然只计算第一个匹配项，然后继续。有没有办法让我扫描整行？我知道re.findall会找到所有匹配项，但它会将它们放入一个数组中，我不知道如何可靠地计算每个单词的匹配项数量，因为findall中的匹配在每次循环通过的数组中具有不同的索引。

regex

来源：https://stackoverflow.com/questions/76053588/how-to-search-for-one-or-more-strings-in-a-file-using-regex-and-count-the-numbe

3条答案

按热度按时间

njthzxwz1#

在您的示例中，匹配项都是静态字符串，因此您可以将它们用作Counter对象的字典键。

import re
from collections import Counter

count = Counter()
for line in lines:
    for match in re.finditer(Pattern, line):
        count.update(match.group(0))

for k in count.keys():
    print(f"{c[k]} occurrences of {k}")

这里有用的部分更改是使用re.finditer()而不是re.findall，它返回一个正确的re.Match对象，如果您愿意，您可以从中提取具有.group(0)以及各种其他属性的匹配字符串。
如果需要提取可能包含变体的匹配，如r"c[ei]*ling"或r"\d+"，则不能将匹配的字符串用作字典键（因为Counter会将每个唯一字符串作为单独的实体;所以你会得到“12次出现123”和“1次出现234”而不是“13次出现\d+”）;在这种情况下，我可能会尝试使用命名子组。

for match in re.finditer(r"(?P<ceiling>c[ei]*ling)|(?P<number>\d+)", line):
        matches = match.groupdict()
        for key in matches.keys():
            if matches[key] is not None:
                count.update(key)

赞(0）回复(0）举报 2023-04-22

lc8prwob2#

您可以使用findall并在末尾计算出现次数。例如：

import re
count1=0
count2=0
count3=0
data = "String1 String2 String2 String3\nString1 String1\nString3"
Pattern=r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'
lines = data.split('\n')
all_matches = []
i = 0
while i!=len(lines): 
    match=re.findall(Pattern, lines[i])
    all_matches.extend(match)
    i += 1
count1 = len([el for el in all_matches if el[0] == 'String1'])
count2 = len([el for el in all_matches if el[1] == 'String2'])
count3 = len([el for el in all_matches if el[2] == 'String3'])
    
print(count1, count2, count3)

注意：findall将返回一个元组列表，其中元组的第一项对应第一组，依此类推。

all_matches将是元组的列表，每个元组的形状是(matched item for string1, matched item for string2, matched item for string3)，如果没有匹配的，它将是''，类似于这样：

[('String1', '', ''), ('', 'String2', ''), ('', 'String2', ''), ...]

例如，在计算count1时，我们创建了一个匹配String1的元素列表（我们看到的条件是，元组的第一个元素等于'String1'），如下所示：

first_group = [el for el in all_matches if el[0] == 'String1']

然后我们返回其长度作为这些元素的count1length的值：

count1 = len(first_group)

赞(0）回复(0）举报 2023-04-22

ovfsdjhp3#

另一个变体是使用numpy及其count_nonzero方法。由于不需要将数据分隔成行，让我们假设所有数据都在data中：

import numpy as np
# count non-empty strings along axis 0 (the matches for each word)
count = np.count_nonzero(np.array(re.findall(Pattern, data)), 0)

赞(0）回复(0）举报 2023-04-22

我来回答

regex 如何使用正则表达式在文件中搜索一个或多个字符串，并分别计算每个字符串的数量？

3条答案

相关问题

热门标签

最新问答