pandas 如何在不将常规字符串转换为NaN的情况下将bytes转换为utf-8？

h5qlskok 于 2022-12-16 发布在其他

关注(0)|答案(2)|浏览(106)

我有一个运行在多个panda Dataframe 上的进程。有时数据以字节的形式出现，例如：

>>> pd.DataFrame[['x']]
['x']
b'123'
b'111'
b'110'

其他时候它以正整数的形式出现

>>> pd.DataFrame[['x']]
['x']
80
123
491

我想把字节转换成unicode-8，而不去修改常规的整数，现在，我尝试了pd.Dataframe['x'].str.decode('unicode-8')，当 Dataframe 是字节的形式时，它可以工作，但是当 Dataframe 是整数的形式时，它把所有的值都转换成NaN。
我希望解是矢量化的，因为速度很重要。例如，我不能使用列表解析。

pandas

来源：https://stackoverflow.com/questions/74805790/how-do-i-convert-bytes-to-utf-8-without-turning-regular-strings-into-nans

2条答案

按热度按时间

rm5edbpk1#

你可以定义一个函数在解码之前先检查。

import pandas as pd

# Define the decode_if_bytes function
def decode_if_bytes(input_str):
    if isinstance(input_str, bytes):
        return input_str.decode('utf-8')
    return input_str

解码df

# Apply the function to the dataframe
df = pd.DataFrame({'x':[b'80',123,491]})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

输出：

解码另一个df

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

输出：

赞(0）回复(0）举报 2022-12-16

bvpmtnay2#

一种方法是推断列的dtype，只有当它是非数值时才尝试将它从bytes转换过来：

if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')

试验代码：

import pandas as pd

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

df = pd.DataFrame({'x':[80,123,491]})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

输出：

before
        x
0  b'123'
1  b'111'
2  b'110'

after
     x
0  123
1  111
2  110

before
     x
0   80
1  123
2  491

after
     x
0   80
1  123
2  491

UPDATE：如果列部分为字节，例如x b'80' 123，则可以执行以下操作：

import pandas as pd
import numpy as np

df = pd.DataFrame({'x':[b'80',123,491]})
print('','before',df,sep='\n')
df.x = np.where(df.x.astype(np.int64) == df.x, df.x.astype(str).str.encode('utf-8'), df.x)
df.x = df.x.str.decode('utf-8')
print('','after',df,sep='\n')

输出：

before
       x
0  b'80'
1    123
2    491

after
     x
0   80
1  123
2  491

赞(0）回复(0）举报 2022-12-16

我来回答

pandas 如何在不将常规字符串转换为NaN的情况下将bytes转换为utf-8？

2条答案

相关问题

热门标签

最新问答