如何在目录中的所有csv文件中执行python关键字搜索和单词计数器,并写入单个csv?

qacovj5a  于 2021-07-13  发布在  Java
关注(0)|答案(2)|浏览(348)

我是python的新手,正在尝试理解某些库。不知道如何将csv上传到so,但此脚本适用于任何csv,只需替换“switchedproviders\u topicmodel”
我的目标是在一个文件目录-c:\users\jj\desktop\autotranscribe中遍历所有csv,并将python脚本输出按文件写入csv。
例如,我在上面的文件夹中有这些csv文件-
'1003391793\u 1003391784\u 01bc7e411408166f7c5468f0.csv''1003478130\u 1003478103\u 8eef05b0820cf0ffe9a9754c.csv''1003478130\u 1003478103\u 8eef05b0820cf0ffe9a9882d.csv'
我希望我的python应用程序(下面)为文件夹/目录中的每个csv做一个单词计数器,并将输出写入这样的Dataframe-

csvname                                            pre existing  exclusions  limitations  fourteen
1003391793_1003391784_01bc7e411408166f7c5468f0.csv    1           2           0            1

我的剧本-

import pandas as pd
from collections import defaultdict

def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    line_number = 0
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open("SwitchedProviders_TopicModel.csv", 'r') as read_obj:
        # Read all lines in the file one by one
        for line in read_obj:
            line_number += 1
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append((string_to_search, line_number, line.rstrip()))

    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)

matched_lines, count = search_multiple_strings_in_file('SwitchedProviders_TopicModel.csv', [ 'pre existing ', 'exclusions','limitations','fourteen'])

df = pd.DataFrame.from_dict(count, orient='index').reset_index()
df.columns = ['Word', 'Count']

print(df)

我怎么能做到?只寻找一个计数器具体的话,你可以看到在我的脚本像'十四',而不是寻找一个计数器为所有的话
CSV之一的示例数据-信用用户umar h

df = pd.read_csv('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv')
print(df.head(10).to_dict())
{'transcript': {0: 'hi thanks for calling ACCA  this is many speaking could have the pleasure speaking with ', 1: 'so ', 2: 'hi ', 3: 'I have the pleasure speaking with my name is B. as in boy E. V. D. N. ', 4: 'thanks yes and I think I have your account pulled up could you please verify your email ', 5: "sure is yeah it's on _ 00 ", 6: 'I T. O.com ', 7: 'thank you how can I help ', 8: 'all right I mean I do have an insurance with you guys I just want to cancel the insurance ', 9: 'sure I can help with that what was the reason for cancellation '}, 'confidence': {0: 0.73, 1: 0.18, 2: 0.88, 3: 0.72, 4: 0.83, 5: 0.76, 6: 0.83, 7: 0.98, 8: 0.89, 9: 0.95}, 'from': {0: 1.69, 1: 1.83, 2: 2.06, 3: 2.13, 4: 2.36, 5: 2.98, 6: 3.17, 7: 3.65, 8: 3.78, 9: 3.93}, 'to': {0: 1.83, 1: 2.06, 2: 2.13, 3: 2.36, 4: 2.98, 5: 3.17, 6: 3.65, 7: 3.78, 8: 3.93, 9: 4.14}, 'speaker': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'Negative': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.116, 9: 0.0}, 'Neutral': {0: 0.694, 1: 1.0, 2: 1.0, 3: 0.802, 4: 0.603, 5: 0.471, 6: 1.0, 7: 0.366, 8: 0.809, 9: 0.643}, 'Positive': {0: 0.306, 1: 0.0, 2: 0.0, 3: 0.198, 4: 0.397, 5: 0.529, 6: 0.0, 7: 0.634, 8: 0.075, 9: 0.357}, 'compound': {0: 0.765, 1: 0.0, 2: 0.0, 3: 0.5719, 4: 0.7845, 5: 0.5423, 6: 0.0, 7: 0.6369, 8: -0.1779, 9: 0.6124}}
aurhwmvo

aurhwmvo1#

步骤-
定义输入路径
提取所有csv文件
从特定csv文件中提取所有单词(删除puntuations),并将列表传递给counter以获取计数。
create1resultdict添加文件名和计数器dict。
最后,将结果dict转换为dataframe并转置(如果需要,用0填充nan值)

import string
from collections import Counter, defaultdict
from pathlib import Path

import pandas as pd

inp_dir = Path('.')  # current dir

def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open("1.csv", 'r') as read_obj:
        # Read all lines in the file one by one
        for line_number, line in enumerate(read_obj, start=1):
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append(
                        (string_to_search, line_number, line.rstrip()))

    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)

result = {}
for csv_file in inp_dir.glob('**/*.csv'):
    matched_lines, count = search_multiple_strings_in_file(
        csv_file, ['pre existing', 'exclusions', 'limitations', 'fourteen'])
    result[csv_file.name] = count
df = pd.DataFrame(result).T.fillna(0).astype(int)

输出-

exclusions  limitations  pre existing
1.csv           1            3             1
2.csv           1            3             1
jv4diomz

jv4diomz2#

你给Pandas贴了标签,我们可以用 .str.extractall 搜索单词和行号。
您可以扩展函数并添加一些错误处理(例如,如果给定的csv文件中不存在transcript,将会发生什么)。

from pathlib import Path
import pandas as pd

def get_files_to_parse(start_dir : str) -> list:

    files = [f for f in Path(start_dir).glob('*.csv')]
    return files
def search_multiple_files(list_of_paths : list,key_words : list) -> pd.DataFrame:
    dfs = []
    for file in list_of_paths:
        df = pd.read_csv(file)
        word_df = df['transcript'].str.extractall(f"({'|'.join(key_words)})")\
                        .droplevel(1,0)\
                        .reset_index()\
                        .rename(columns={'index' : file.stem})\
                        .set_index(0).T
        dfs.append(word_df)
    return pd.concat(dfs)

用法。

使用您的示例dataframe(我从您的列表中添加了一些关键字)

files = get_files_to_parse('target\dir\folder')

[WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv'),
 WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c_copy.csv')]
search_multiple_files(files,['pre existing', 'exclusions','limitations','fourteen'])

相关问题