在Python中缺少的列值中的Concat字符串

6g8kf2rb  于 2022-09-21  发布在  Python
关注(0)|答案(2)|浏览(179)

我有一个数据框

import pandas as pd
data_as_dict={'CHROM': {2241743: 21, 2241744: 3, 2241745: 5, 2241746: 6, 2241747: 6, 2241748: 11, 2241749: 16, 2241750: 20, 0: 'chr1', 1: 'chr2', 2: 'chr3', 3: 'chr5', 4: 'chr5', 5: 'chr8', 6: 'chr9', 7: 'chr10', 8: 'chr14', 9: 'chr14', 10: 'chr15', 11: 'chr15'}, 'POS_GRCh38': {2241743: 41668857, 2241744: 189487243, 2241745: 33946466, 2241746: 396321, 2241747: 32641081, 2241748: 89284793, 2241749: 89960104, 2241750: 34077942, 0: '233276815', 1: '217427435', 2: '169800667', 3: '1279675', 4: '112150207', 5: '32575278', 6: '97775520', 7: '103934543', 8: '36063370', 9: '36269155', 10: '67165147', 11: '67163292'}, 'REF': {2241743: 'T', 2241744: 'A', 2241745: 'A', 2241746: 'C', 2241747: 'C', 2241748: 'G', 2241749: 'T', 2241750: 'G', 0: 'G', 1: 'G', 2: 'G', 3: 'C', 4: 'T', 5: 'T', 6: 'C', 7: 'C', 8: 'C', 9: 'C', 10: 'G', 11: 'C'}, 'Effect_allele': {2241743: 'A', 2241744: 'T', 2241745: 'G', 2241746: 'T', 2241747: 'T', 2241748: 'A', 2241749: 'C', 2241750: 'A', 0: 'A', 1: 'C', 2: 'T', 3: 'T', 4: 'A', 5: 'G', 6: 'A', 7: 'T', 8: 'G', 9: 'T', 10: 'C', 11: 'T'}, 'Effect_size': {2241743: 0.094310679, 2241744: 0.122217633, 2241745: 0.527632769, 2241746: 0.482426149, 2241747: 0.157003749, 2241748: 0.148420005, 2241749: 0.285178942, 2241750: 0.2390169, 0: 0.2776317365982795, 1: 0.3576744442718159, 2: 0.2070141693843261, 3: 0.1823215567939546, 4: 0.3148107398400336, 5: 0.2776317365982795, 6: 0.5247285289349821, 7: 0.3435897043900768, 8: 0.3293037471426003, 9: 0.5933268452777344, 10: 0.2070141693843261, 11: 0.2151113796169455}, 'TYPE': {2241743: 'Basal_cell_carcinoma_PRSWeb', 2241744: 'Squamous_cell_carcinoma_PRSWeb', 2241745: 'Squamous_cell_carcinoma_PRSWeb', 2241746: 'Squamous_cell_carcinoma_PRSWeb', 2241747: 'Squamous_cell_carcinoma_PRSWeb', 2241748: 'Squamous_cell_carcinoma_PRSWeb', 2241749: 'Squamous_cell_carcinoma_PRSWeb', 2241750: 'Squamous_cell_carcinoma_PRSWeb', 0: 'THYROID_PGS', 1: 'THYROID_PGS', 2: 'THYROID_PGS', 3: 'THYROID_PGS', 4: 'THYROID_PGS', 5: 'THYROID_PGS', 6: 'THYROID_PGS', 7: 'THYROID_PGS', 8: 'THYROID_PGS', 9: 'THYROID_PGS', 10: 'THYROID_PGS', 11: 'THYROID_PGS'}, 'Cancer': {2241743: 'NMSC', 2241744: 'NMSC', 2241745: 'NMSC', 2241746: 'NMSC', 2241747: 'NMSC', 2241748: 'NMSC', 2241749: 'NMSC', 2241750: 'NMSC', 0: 'THYROID', 1: 'THYROID', 2: 'THYROID', 3: 'THYROID', 4: 'THYROID', 5: 'THYROID', 6: 'THYROID', 7: 'THYROID', 8: 'THYROID', 9: 'THYROID', 10: 'THYROID', 11: 'THYROID'}, 'Significant_YN': {2241743: 'Y', 2241744: 'Y', 2241745: 'Y', 2241746: 'Y', 2241747: 'Y', 2241748: 'Y', 2241749: 'Y', 2241750: 'Y', 0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y', 5: 'Y', 6: 'Y', 7: 'Y', 8: 'Y', 9: 'Y', 10: 'Y', 11: 'Y'}}

all_cancers = pd.DataFrame.from_dict(data_as_dict)

我想在列CHROM中缺少的地方追加字符串chr。我可以在R中用greplpaste来做这件事,但我想在Python中尝试一下。我想出了这两个命令,但不确定如何对列进行索引,因为pd.Series正在生成NAN。

pd.Series(all_cancers['CHROM']).str.contains(pat = 'chr', regex = True)
"chr" + all_cancers['CHROM'].map(str)
z8dt9xmd

z8dt9xmd1#

Pandas中的字符串操作没有经过优化,所以做你想做的事情的最好方法是通过列表理解。


# check if a value contains 'chr' and prepend it if not

all_cancers['CHROM'] = [x if isinstance(x, str) and 'chr' in x else f"chr{x}" for x in all_cancers['CHROM'].tolist()]

同样的Pandas手术可以通过mask()方法进行。


# flag rows that starts with 'chr' and prepend 'chr' to values in the remaining rows

all_cancers['CHROM'] = all_cancers['CHROM'].mask(~all_cancers['CHROM'].str.startswith('chr', na=False), 'chr'+all_cancers['CHROM'].astype(str))

一个简单的timeit测试将显示列表Comp快得多。

fykwrbwg

fykwrbwg2#

另见:

例如,使用NumPy的条件过滤器np.where

import pandas as pd
import numpy as np

data_as_dict = {
        'CHROM': {2241743: 21, 2241744: 3, 2241745: 5, 2241746: 6, 2241747: 6, 2241748: 11, 2241749: 16, 2241750: 20,
                  0: 'chr1', 1: 'chr2', 2: 'chr3', 3: 'chr5', 4: 'chr5', 5: 'chr8', 6: 'chr9', 7: 'chr10', 8: 'chr14',
                  9: 'chr14', 10: 'chr15', 11: 'chr15'},
        'POS_GRCh38': {2241743: 41668857, 2241744: 189487243, 2241745: 33946466, 2241746: 396321, 2241747: 32641081,
                       2241748: 89284793, 2241749: 89960104, 2241750: 34077942, 0: '233276815', 1: '217427435',
                       2: '169800667', 3: '1279675', 4: '112150207', 5: '32575278', 6: '97775520', 7: '103934543',
                       8: '36063370', 9: '36269155', 10: '67165147', 11: '67163292'},
        'REF': {2241743: 'T', 2241744: 'A', 2241745: 'A', 2241746: 'C', 2241747: 'C', 2241748: 'G', 2241749: 'T',
                2241750: 'G', 0: 'G', 1: 'G', 2: 'G', 3: 'C', 4: 'T', 5: 'T', 6: 'C', 7: 'C', 8: 'C', 9: 'C', 10: 'G',
                11: 'C'},
        'Effect_allele': {2241743: 'A', 2241744: 'T', 2241745: 'G', 2241746: 'T', 2241747: 'T', 2241748: 'A',
                          2241749: 'C', 2241750: 'A', 0: 'A', 1: 'C', 2: 'T', 3: 'T', 4: 'A', 5: 'G', 6: 'A', 7: 'T',
                          8: 'G', 9: 'T', 10: 'C', 11: 'T'},
        'Effect_size': {2241743: 0.094310679, 2241744: 0.122217633, 2241745: 0.527632769, 2241746: 0.482426149,
                        2241747: 0.157003749, 2241748: 0.148420005, 2241749: 0.285178942, 2241750: 0.2390169,
                        0: 0.2776317365982795, 1: 0.3576744442718159, 2: 0.2070141693843261, 3: 0.1823215567939546,
                        4: 0.3148107398400336, 5: 0.2776317365982795, 6: 0.5247285289349821, 7: 0.3435897043900768,
                        8: 0.3293037471426003, 9: 0.5933268452777344, 10: 0.2070141693843261, 11: 0.2151113796169455},
        'TYPE': {2241743: 'Basal_cell_carcinoma_PRSWeb', 2241744: 'Squamous_cell_carcinoma_PRSWeb',
                 2241745: 'Squamous_cell_carcinoma_PRSWeb', 2241746: 'Squamous_cell_carcinoma_PRSWeb',
                 2241747: 'Squamous_cell_carcinoma_PRSWeb', 2241748: 'Squamous_cell_carcinoma_PRSWeb',
                 2241749: 'Squamous_cell_carcinoma_PRSWeb', 2241750: 'Squamous_cell_carcinoma_PRSWeb', 0: 'THYROID_PGS',
                 1: 'THYROID_PGS', 2: 'THYROID_PGS', 3: 'THYROID_PGS', 4: 'THYROID_PGS', 5: 'THYROID_PGS',
                 6: 'THYROID_PGS', 7: 'THYROID_PGS', 8: 'THYROID_PGS', 9: 'THYROID_PGS', 10: 'THYROID_PGS',
                 11: 'THYROID_PGS'},
        'Cancer': {2241743: 'NMSC', 2241744: 'NMSC', 2241745: 'NMSC', 2241746: 'NMSC', 2241747: 'NMSC', 2241748: 'NMSC',
                   2241749: 'NMSC', 2241750: 'NMSC', 0: 'THYROID', 1: 'THYROID', 2: 'THYROID', 3: 'THYROID',
                   4: 'THYROID', 5: 'THYROID', 6: 'THYROID', 7: 'THYROID', 8: 'THYROID', 9: 'THYROID', 10: 'THYROID',
                   11: 'THYROID'},
        'Significant_YN': {2241743: 'Y', 2241744: 'Y', 2241745: 'Y', 2241746: 'Y', 2241747: 'Y', 2241748: 'Y',
                           2241749: 'Y', 2241750: 'Y', 0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y', 5: 'Y', 6: 'Y', 7: 'Y',
                           8: 'Y', 9: 'Y', 10: 'Y', 11: 'Y'}}

all_cancers = pd.DataFrame.from_dict(data_as_dict)
all_cancers['CHROM'] = np.where(~all_cancers['CHROM'].str.startswith("chr", na=False), "chr", all_cancers['CHROM'])

相关问题