python-3.x 将自动文本转换为DataFrame

sqougxex 于 2023-04-08 发布在 Python

关注(0)|答案(3)|浏览(137)

我有一组计算机生成的通知保存在一个文本文件中-它看起来像这样：

我的公司有限责任公司：报表# 123456**，$10000.99，**2023年2月（FEB）
我的公司有限责任公司：报表# 123457为$100.01的月份2022 09（SEP）
我的公司有限责任公司：报表# 123458-$51.00为2022年10月（OCT）
理想输出-作为DataFrame*

声明	金额	月
一二三四五六	10000.99	2023年02月（FEB）
一二三四五七	一百零一	2022年9月
一二三四五八	-51	2022年10月（10月）

我知道我可以在Python之外将其转换为CSV，然后用pandas导入。
但是，我可以将文本文件按原样加载到Python中，并将其转换为DataFrame吗？
一个“提示”--在本例中，for可以用作列分隔符。它可以可靠地将每行拆分为我想要的字段。这有点运气。

python-3.x

来源：https://stackoverflow.com/questions/75931547/transforming-automated-text-to-a-dataframe

3条答案

按热度按时间

ff29svar1#

您期望的 Dataframe 格式 * 不清楚 *，但这里有一个extract选项：

df = (
        pd.read_csv("input2.txt", header=None, sep="|").squeeze()
            .str.extract("(.*): Statement# (.*) for (.*) for the month of (\d+) (\d+) \((\w+)\)")
            .set_axis(["company_name", "statement", "amount", "year", "month_number", "month_name"], axis=1)
      )

输出：

print(df)

     company_name statement     amount  year month_number  month_name
0  MY COMPANY LLC    123456  $10000.99  2023           02         FEB
1  MY COMPANY LLC    123457    $100.01  2022           09         SEP
2  MY COMPANY LLC    123458    -$51.00  2022           10         OCT

赞(0）回复(0）举报 2023-04-08

vktxenjb2#

输入文件：

MY COMPANY LLC: Statement# 123456 for $10000.99 for the month of 2023 02 (FEB)
MY COMPANY LLC: Statement# 123457 for $100.01 for the month of 2022 09 (SEP)
MY COMPANY LLC: Statement# 123458 for -$51.00 for the month of 2022 10 (OCT)

使用readlines()读入文本文件，并将生成的行列表加载到 Dataframe 中：

import pandas as pd

with open('input.txt', 'r') as f:
    lines = f.readlines()
    df = pd.DataFrame(lines)  

print(df)

输出：

0  MY COMPANY LLC: Statement# 123456 for $10000.9...
1  MY COMPANY LLC: Statement# 123457 for $100.01 ...
2  MY COMPANY LLC: Statement# 123458 for -$51.00 ...

如果需要从 Dataframe 中删除for字符串：

import pandas as pd

data = []

with open('input.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        data.append(line.strip().split('for'))

    df = pd.DataFrame(data)  

print(df)

输出：

0            1                            2
0  MY COMPANY LLC: Statement# 123456    $10000.99    the month of 2023 02 (FEB)
1  MY COMPANY LLC: Statement# 123457      $100.01    the month of 2022 09 (SEP)
2  MY COMPANY LLC: Statement# 123458      -$51.00    the month of 2022 10 (OCT)

赞(0）回复(0）举报 2023-04-08

qnakjoqk3#

使用' for ' * 作为 * 分隔符。
使用str.replace()删除无关项。

## Load data
cols = ['Check', 'Amount', 'Month', ]
df = pd.read_csv(input.txt, sep=' for ',
                names=cols, 
                engine='python').dropna()

## Remove extraneous terms
badTerms = [
    'MY COMPANY LLC: Statement# ', 
    'the month of ',
    ]
for col in df.columns:
    for term in badTerms:
        df[col] = df[col].str.replace(term, '',
                                     regex=False)

赞(0）回复(0）举报 2023-04-08

我来回答

python-3.x 将自动文本转换为DataFrame

3条答案

相关问题

热门标签

最新问答