如何通过python将一个行中单元格由管道符号分隔的文本数据转换为特定的数据模式?

krcsximq  于 2023-01-06  发布在  Python
关注(0)|答案(2)|浏览(142)

我想把下面的数据转换成一个4个单元格的特定行的模式。请找到下面数据的样本。

text = """A | B | Lorem | Ipsum | is | simply | dummy
C | D | text | of | the | printing | and
E | F | typesetting | industry. | Lorem
G | H | more | recently | with | desktop | publishing | software | like | Aldus
I | J | Ipsum | has | been | the | industry's
K | L | standard | dummy | text | ever | since | the | 1500s
M | N | took | a
O | P | scrambled | it | to | make | a | type | specimen | book"""

我被要求转换每行只包含不超过4个单元格。任何单元格后第四个单元格应插入到下一行具有的前两个单元格类似于第一行和当前行不应也大于4个单元格。上述文本数据的转换应看起来像下面的一个。

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
.
.
and so on...

我已经尝试了一些对我自己的,但我甚至没有一半的方式,根据下面的代码是不完整的。

new_text = ""

for i in text.split('\n'):
    row = i.split(' | ')
    if len(row) == 4:
        new_text = new_text + i + '\n'
    elif len(row) > 4:
        for j in range(len(row)):
            if j < 3:
                new_text = new_text + row[0] + ' | ' + row[1] + ...

我无法弄清楚逻辑使用前两个细胞,如果细胞的数量高于4在每一行。

bksxznpy

bksxznpy1#

您可以拆分输入行,然后一次处理每行2个元素。可能的代码:

for line in io.StringIO(text):
    row = line.strip().split(' | ')
    for i in range(2, len(row), 2):
        print(' | '.join(row[:2] + row[i: i+2]))

它给出了预期结果:

A | B | Lorem | Ipsum
A | B | is | simply
A | B | dummy
C | D | text | of
C | D | the | printing
C | D | and
E | F | typesetting | industry.
E | F | Lorem
G | H | more | recently
G | H | with | desktop
G | H | publishing | software
G | H | like | Aldus
I | J | Ipsum | has
I | J | been | the
I | J | industry's
K | L | standard | dummy
K | L | text | ever
K | L | since | the
K | L | 1500s
M | N | took | a
O | P | scrambled | it
O | P | to | make
O | P | a | type
O | P | specimen | book
ivqmmu1c

ivqmmu1c2#

我将使用pandas完成此任务:

import pandas as pd
from io import StringIO

MAX     = 100   # expected maximum number of input columns
IDX_COL = 2     # number of index columns (A / B)
N_COLS  = 2     # number of desired non-index output columns

df = (pd
   .read_csv(io.StringIO(text), sep=r'\s*\|\s*',
             engine='python', names=range(MAX))
   .set_index(list(range(IDX_COL)))
   .pipe(lambda d: d.set_axis(
       pd.MultiIndex.from_arrays(
           [(d.columns-IDX_COL)%N_COLS,
            (d.columns-IDX_COL)//N_COLS]), axis=1)
        )
   .stack().droplevel(IDX_COL)
   .to_csv('output.csv', header=None, sep='|')
)

输出文件:

A|B|Lorem|Ipsum
A|B|is|simply
A|B|dummy|
C|D|text|of
C|D|the|printing
C|D|and|
E|F|typesetting|industry.
E|F|Lorem|
G|H|more|recently
G|H|with|desktop
G|H|publishing|software
G|H|like|Aldus
I|J|Ipsum|has
I|J|been|the
I|J|industry's|
K|L|standard|dummy
K|L|text|ever
K|L|since|the
K|L|1500s|
M|N|took|a
O|P|scrambled|it
O|P|to|make
O|P|a|type
O|P|specimen|book

相关问题