Python -在长文本中查找短语

qcuzuvrc 于 2023-08-02 发布在 Python

关注(0)|答案(4)|浏览(120)

用python在一个较长的文本中找到一个短语的最有效的方法是什么？我想做的是找到完整的短语，但是如果找不到，就把它分成更小的部分，然后试着找到它们，直到单个单词。
例如，我有一个文本：
段落是论文的基石。许多学生根据长度来定义段落：一个段落是至少五个句子的一组，一个段落是半页长，等等……头脑 Storm 有很多技巧;无论你选择哪一个，段落发展的这个阶段都不能跳过。
我想找到这个短语：第一个月
整个短语，因为它是不会被发现，但其较小的部分是的。因此，它将发现：

那里
是一组
学生

这可能吗如果是这样，什么是最有效的算法来实现？
我尝试了一些递归函数，但它们无法找到短语的这些子部分，要么找到整个短语，要么只找到单个单词。

python

来源：https://stackoverflow.com/questions/76803595/python-find-a-phrase-inside-a-long-text

4条答案

按热度按时间

qfe3c7zg1#

如果你想要一个健壮的方法，在单词级别上工作，但也可以捕获，例如，“...That”与“那个”，我推荐一些基本的NLP和NLTK。这是如果你正在处理一个小型到中型的数据集。

from nltk import ngrams, word_tokenize

text = "..."  # your text
query = "there is a group of students"

def preprocess(raw):
    return [token.lower() for token in word_tokenize(raw)]
    
def extract_ngrams(tokens, min_n, max_n):
    return set(ngram for n in range(min_n, max_n + 1) for ngram in ngrams(tokens, n))

min_n = 1
max_n = len(query)

text_ngrams = extract_ngrams(preprocess(text), min_n, max_n)
query_ngrams = extract_ngrams(preprocess(query), min_n, max_n)

print(text_ngrams & query_ngrams)

字符串
输出量：

{('a',),
 ('a', 'group'),
 ('a', 'group', 'of'),
 ('group',),
 ('group', 'of'),
 ('is',),
 ('is', 'a'),
 ('is', 'a', 'group'),
 ('is', 'a', 'group', 'of'),
 ('of',),
 ('students',),
 ('there',)}

型

赞(0）回复(0）举报 2023-08-02

jdzmm42g2#

最简单的方法是生成您想要查找的短语的所有可能子集，然后使用if phrase_slice in paragraph检查文本是否包含它们。
要获得子集，您可以使用双循环-首先确定要包含的短语中的单词数量，然后偏移单词。举个例子：

text = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."
phrase = ["there", "is", "a", "group", "of", "students"]

for i in range(len(phrase)):
    n_words = len(phrase) - i
    for j in range(len(phrase)-i):
        phrase_slice = phrase[j:n_words+j]
        if " ".join(phrase_slice) in text:
            # Do stuff

字符串

赞(0）回复(0）举报 2023-08-02

gv8xihay3#

TEXT = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc... There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

TO_FIND = "there is a group of students"

dict = {}

def preprocess_text():
    TEXT.lower()
    TEXT.replace(".", "")
    TEXT.replace(",", "")
    TEXT.replace(";", "")
    TEXT.replace(":", "")
    TEXT.replace("!", "")
    TEXT.replace("?", "")
    TEXT.replace("...", "")

def find_groups(occurences):
    groups = []
    for i in range(len(occurences) - 1):
        group = []
        while occurences[i][0] + 1 == occurences[i + 1][0]:
            group.append(occurences[i])
            i += 1
            if i == len(occurences) - 1:
                if group is not None:
                    group.append(occurences[i])
                break
        if len(group) > 0:
            groups.append(group)
    return groups

def make_dict():
    idx = 0
    for word in TEXT.lower().split():
        dict[idx] = word
        idx += 1

def find_words():
    occurences = [(k, v) for k, v in dict.items() if v in TO_FIND.split()]
    return occurences

if __name__ == "__main__":
    make_dict()
    occurences = find_words()
    groups = find_groups(occurences)
    solutions = []
    for group in groups:
        tmp = []
        for elem in group:
            tmp.append(elem[1])
        tmp = " ".join(tmp)
        solutions.append(tmp)
    for occurence in occurences:
        if occurence[1] not in solutions:
            solutions.append(occurence[1])
    for solution in solutions:
        print(solution)

字符串
代码有点复杂，但它工作得很好，在我的机器上执行时间为54 ms。
代码首先对输入文本进行切片，并获取文本中要查找的所有单词。然后，它尝试将相邻的单词重新组合成组，找到剩余的没有组合在一起的单词并打印所有内容。
希望能有所帮助！

赞(0）回复(0）举报 2023-08-02

gdrx4gfi4#

像这样的工作吗？

from itertools import combinations

#Load up paragraph and phrase to be searched
paragraph = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

phrase = "there is a group of students"

phrase_words = phrase.split()

#Generate all possible iterations of the words in the phrase
phrase_sections = []
for i in range(1,len(phrase_words)):
    for combination in combinations(phrase_words, i):
        phrase_sections.append(', '.join(combination).replace(',', ''))

#Search for phrase in paragraph (searching in phrase is a quick and dirty way to maintain order of words)
for section in phrase_sections:
    if (section in phrase) & (section in paragraph):
        print(section)

字符串
该输出：

is
a
group
of
students
is a
a group
group of
is a group
a group of
is a group of

型

赞(0）回复(0）举报 2023-08-02

我来回答

Python -在长文本中查找短语

4条答案

相关问题

热门标签

最新问答