Python -在长文本中查找短语

qcuzuvrc  于 2023-08-02  发布在  Python
关注(0)|答案(4)|浏览(119)

用python在一个较长的文本中找到一个短语的最有效的方法是什么?我想做的是找到完整的短语,但是如果找不到,就把它分成更小的部分,然后试着找到它们,直到单个单词。
例如,我有一个文本:
段落是论文的基石。许多学生根据长度来定义段落:一个段落是至少五个句子的一组,一个段落是半页长,等等……头脑 Storm 有很多技巧;无论你选择哪一个,段落发展的这个阶段都不能跳过。
我想找到这个短语:第一个月
整个短语,因为它是不会被发现,但其较小的部分是的。因此,它将发现:

  • 那里
  • 是一组
  • 学生

这可能吗如果是这样,什么是最有效的算法来实现?
我尝试了一些递归函数,但它们无法找到短语的这些子部分,要么找到整个短语,要么只找到单个单词。

qfe3c7zg

qfe3c7zg1#

如果你想要一个健壮的方法,在单词级别上工作,但也可以捕获,例如,“...That”与“那个”,我推荐一些基本的NLP和NLTK。这是如果你正在处理一个小型到中型的数据集。

from nltk import ngrams, word_tokenize

text = "..."  # your text
query = "there is a group of students"

def preprocess(raw):
    return [token.lower() for token in word_tokenize(raw)]
    
def extract_ngrams(tokens, min_n, max_n):
    return set(ngram for n in range(min_n, max_n + 1) for ngram in ngrams(tokens, n))

min_n = 1
max_n = len(query)

text_ngrams = extract_ngrams(preprocess(text), min_n, max_n)
query_ngrams = extract_ngrams(preprocess(query), min_n, max_n)

print(text_ngrams & query_ngrams)

字符串
输出量:

{('a',),
 ('a', 'group'),
 ('a', 'group', 'of'),
 ('group',),
 ('group', 'of'),
 ('is',),
 ('is', 'a'),
 ('is', 'a', 'group'),
 ('is', 'a', 'group', 'of'),
 ('of',),
 ('students',),
 ('there',)}

jdzmm42g

jdzmm42g2#

最简单的方法是生成您想要查找的短语的所有可能子集,然后使用if phrase_slice in paragraph检查文本是否包含它们。
要获得子集,您可以使用双循环-首先确定要包含的短语中的单词数量,然后偏移单词。举个例子:

text = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."
phrase = ["there", "is", "a", "group", "of", "students"]

for i in range(len(phrase)):
    n_words = len(phrase) - i
    for j in range(len(phrase)-i):
        phrase_slice = phrase[j:n_words+j]
        if " ".join(phrase_slice) in text:
            # Do stuff

字符串

gv8xihay

gv8xihay3#

TEXT = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc... There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

TO_FIND = "there is a group of students"

dict = {}

def preprocess_text():
    TEXT.lower()
    TEXT.replace(".", "")
    TEXT.replace(",", "")
    TEXT.replace(";", "")
    TEXT.replace(":", "")
    TEXT.replace("!", "")
    TEXT.replace("?", "")
    TEXT.replace("...", "")

def find_groups(occurences):
    groups = []
    for i in range(len(occurences) - 1):
        group = []
        while occurences[i][0] + 1 == occurences[i + 1][0]:
            group.append(occurences[i])
            i += 1
            if i == len(occurences) - 1:
                if group is not None:
                    group.append(occurences[i])
                break
        if len(group) > 0:
            groups.append(group)
    return groups

def make_dict():
    idx = 0
    for word in TEXT.lower().split():
        dict[idx] = word
        idx += 1

def find_words():
    occurences = [(k, v) for k, v in dict.items() if v in TO_FIND.split()]
    return occurences

if __name__ == "__main__":
    make_dict()
    occurences = find_words()
    groups = find_groups(occurences)
    solutions = []
    for group in groups:
        tmp = []
        for elem in group:
            tmp.append(elem[1])
        tmp = " ".join(tmp)
        solutions.append(tmp)
    for occurence in occurences:
        if occurence[1] not in solutions:
            solutions.append(occurence[1])
    for solution in solutions:
        print(solution)

字符串
代码有点复杂,但它工作得很好,在我的机器上执行时间为54 ms。
代码首先对输入文本进行切片,并获取文本中要查找的所有单词。然后,它尝试将相邻的单词重新组合成组,找到剩余的没有组合在一起的单词并打印所有内容。
希望能有所帮助!

gdrx4gfi

gdrx4gfi4#

像这样的工作吗?

from itertools import combinations

#Load up paragraph and phrase to be searched
paragraph = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc...There are many techniques for brainstorming; whichever one you choose, this stage of paragraph development cannot be skipped."

phrase = "there is a group of students"

phrase_words = phrase.split()

#Generate all possible iterations of the words in the phrase
phrase_sections = []
for i in range(1,len(phrase_words)):
    for combination in combinations(phrase_words, i):
        phrase_sections.append(', '.join(combination).replace(',', ''))

#Search for phrase in paragraph (searching in phrase is a quick and dirty way to maintain order of words)
for section in phrase_sections:
    if (section in phrase) & (section in paragraph):
        print(section)

字符串
该输出:

is
a
group
of
students
is a
a group
group of
is a group
a group of
is a group of

相关问题