regex 如何从文本块中提取包含关键字的句子

v09wglhw  于 2023-08-08  发布在  其他
关注(0)|答案(1)|浏览(115)

我的目标是拿出一个脚本,将搜索一个特定的关键字的日志文件的文件夹,并输出到一个results.txt文件的文件名,行号在每个文件中包含的关键字,索引的关键字开始的地方和包含关键字的文本的完整行。
我已经创建了一些代码来实现这一点,但它在示例中存在问题,例如:
大家好,我们计划在本周末进行一些夜间维护,这意味着您将无法在周五晚上/周六早上(23/06/23至24/06/23)晚上7点至上午10点之间使用网络上的任何设备。我们对由此造成的不便表示歉意,但这是不可避免的。请确保您已在周五晚上(23/06/23)下午6点30分之前退出网络。
它正确地将关键字“device”标识为第1行,并从字符114开始,并且非常正确地显示了包含关键字“device”的整个文本块,而我希望它只显示出现关键字“device”的句子。
我在想:

  • 对于每个“设备”,查找前一个句号之后和下一个句号之前的文本,或
  • 获取“device”前后的n个字符

以下是我到目前为止编写的代码:

#Import os module
import os
fname2 = "D:\X250\Python_Scripts\Search_File_for_Keyword_and_Print_Line\Results.txt"

# String to search
search_path = input("Enter directory path to search : ")
file_type = input("File Type : ")
search_str = input("Enter the search string : ")

#**Create Output File**
fw = open(fname2, 'w')

# Append a directory separator if not already present
if not (search_path.endswith("/") or search_path.endswith("\\") ): 
        search_path = search_path + "/"
                                                          
# If path does not exist, set search path to current directory
if not os.path.exists(search_path):
        search_path ="."

# Repeat for each file in the directory  
for fname in os.listdir(path=search_path):

   # Apply file type filter   
   if fname.endswith(file_type):

        # Open file for reading
        fo = open(search_path + fname)

        # Read the first line from the file
        line = fo.readline()

        # Initialize counter for line number
        line_no = 1

        # Loop until EOF
        while line != '' :
                # Search for string in line
                index = line.find(search_str)
                if ( index != -1) :
                    print(fname, "[", line_no, ",", index, "] ", line, sep="")
                    #Write Output File
                    fw.write(fname + " " + str(line_no) + " " + str(index)+"  ")
                    fw.write(line)

               

                # Read next line
                line = fo.readline()  

                # Increment line counter
                line_no += 1

                

        # Close the files
        fo.close()

字符串

au9on6nz

au9on6nz1#

类似这样的东西应该会起作用:

text = "Hi everyone, we've planned some overnight maintenance this weekend so that means you will not be able to use any device on the network between 7pm and 10am on this coming Friday evening/ Saturday morning (23/06/23 to 24/06/23). We apologise for the inconvenience this will cause but it is unavoidable. Please ensure you have logged out of the network by 6.30pm on Friday evening (23/06/23)."

#split text into sentences
sentences = text.split(".")

# filter to only sentences with "device" in them 
sentences_with_device = [sentence for sentence in sentences if "device" in sentence]

# using regex
import re
# this looks for, in order, all of the following:
# 1. anything that is not a period (.) 0 or more times
# 2. the word "device"
# 3. anything that is not a period (.) 0 or more times
# 4. a period (.)
sentences_with_device = re.findall(r'([^.]*?device[^.]*\.)', text)

字符串

相关问题