将字节转换为字符串

irlmq6kh  于 2022-09-18  发布在  Java
关注(0)|答案(24)|浏览(252)

I captured the standard output of an external program into a bytes object:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

I want to convert that to a normal Python string, so that I can print it like this:

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

I tried the binascii.b2a_qp() method, but got the same bytes object again:

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

How do I convert the bytes object to a str with Python 3?

mf98qq94

mf98qq941#

bytes对象进行解码以生成一个字符串:

>>> b"abcde".decode("utf-8") 
'abcde'

上面的例子假设bytes对象是UTF-8格式的,因为它是一种常见的编码。但是,您应该使用数据实际所在的编码!

mv1qrgav

mv1qrgav2#

对字节字符串进行解码,并将其转换为字符(Unicode)字符串。

巨蟒3:

encoding = 'utf-8'
b'hello'.decode(encoding)

str(b'hello', encoding)

巨蟒2:

encoding = 'utf-8'
'hello'.decode(encoding)

unicode('hello', encoding)
6mzjoqzu

6mzjoqzu3#

它将字节列表连接到一个字符串中:

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'
z9smfwbn

z9smfwbn4#

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary], to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See Python’s Unicode Support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437 solution, but it should produceidentical resultson every Python version.


# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))
34gzjxbg

34gzjxbg5#

In Python 3, the default encoding is "utf-8", so you can directly use:

b'hello'.decode()

which is equivalent to

b'hello'.decode(encoding="utf-8")

On the other hand, in Python 2, encoding defaults to the default string encoding. Thus, you should use:

b'hello'.decode(encoding)

where encoding is the encoding you want.

**Note:**support for keyword arguments was added in Python 2.7.

m4pnthwp

m4pnthwp6#

I think you actually want this:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron's answer was correct, except that you need to know which encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ASCII) characters in your content, but then it will make a difference.

By the way, the fact that it does matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them, because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).

0pizxfdo

0pizxfdo7#

Since this question is actually asking about subprocess output, you have more direct approaches available. The most modern would be using subprocess.check_output and passing text=True (Python 3.7+) to automatically decode stdout using the system default coding:

text = subprocess.check_output(["ls", "-l"], text=True)

For Python 3.6, Popen accepts an encoding keyword:

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:

>>> b'abcde'.decode()
'abcde'

With no argument, sys.getdefaultencoding() will be used. If your data is not sys.getdefaultencoding(), then you must specify the encoding explicitly in the decode call:

>>> b'caf\xe9'.decode('cp1250')
'café'
lrpiutwd

lrpiutwd8#

Set universal_newlines to True, i.e.

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
brgchamk

brgchamk9#

要将字节序列解释为文本,您必须知道相应的字符编码:

unicode_text = bytestring.decode(character_encoding)

示例:

>>> b'xc2xb5'.decode('utf-8')
'µ'

ls命令可能会产生无法解释为文本的输出。Unix上的文件名可以是除斜杠b'/'和零b'\0'之外的任何字节序列:

>>> open(bytes(range(0x100)).translate(None, b'0/'), 'w').close()

尝试使用UTF-8编码对这种字节流进行解码会引发UnicodeDecodeError

情况可能会更糟。如果您使用了错误的不兼容编码,则解码可能会静默失败并生成mojibake

>>> '—'.encode('utf-8').decode('cp1252')
'—'

数据已损坏,但您的程序仍然不知道发生了故障。

通常,要使用的字符编码不会嵌入到字节序列本身中。您必须在带外传达此信息。某些结果比其他结果更有可能出现,因此chardet模块可以“猜测”字符编码。单个Python脚本可以在不同位置使用多个字符编码。

即使对于undecodable filenames,也可以使用os.fsdecode()函数将ls输出转换为Python字符串(它在Unix上使用sys.getfilesystemencoding()surrogateescape错误处理程序):

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节,可以使用os.fsencode()

如果传递universal_newlines=True参数,则subprocess使用locale.getpreferredencoding(False)来解码字节,例如,在Windows上可以是cp1252

要即时解码字节流,可以使用io.TextIOWrapper()example

不同的命令可以对其输出使用不同的字符编码,例如,dir内部命令(cmd)可以使用cp437。要对其输出进行解码,您可以显式传递编码(Python3.6+):

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能不同于os.listdir()(使用Windows Unicode API),例如,'\xb6'可以替换为'\x14'-Python的cp437编解码器Mapb'\x14'以控制字符U+0014,而不是U+00B6(?)。要支持包含任意Unicode字符的文件名,请参见Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

prdp8dxp

prdp8dxp10#

虽然@Aaron Maenpaa的答案很管用,但一位用户最近问道:
有没有更简单的方法?‘fhand.read().decode(“ASCII”)’[...]太长了!

您可以使用:

command_stdout.decode()

decode()有一个标准参数:

codecs.decode(obj, encoding='utf-8', errors='strict')

xiozqbni

xiozqbni11#

如果您应该通过尝试decode()获得以下结果:
AttributeError:“Str”对象没有属性“”Decode“”

您还可以在强制转换中直接指定编码类型:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'
klsxnrf1

klsxnrf112#

如果您出现此错误:

utf-8 codec can't decode byte 0x8a

那么最好使用以下代码将字节转换为字符串:

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")
vbopmzt1

vbopmzt113#

字节数

m=b'This is bytes'

转换为字符串

方法一

m.decode("utf-8")

m.decode()

方法二

import codecs
codecs.decode(m,encoding="utf-8")

import codecs
codecs.decode(m)

方法三

str(m,encoding="utf-8")

str(m)[2:-1]

结果

'This is bytes'
cclgggtu

cclgggtu14#

我做了一个清理清单的函数

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('n', '') for x in lista]
    lista = [x.replace('b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista
scyqe7ek

scyqe7ek15#

在处理Windows系统中的数据(行结尾为\r\n)时,我的答案是

String = Bytes.decode("utf-8").replace("rn", "n")

为什么?尝试使用多行Input.txt:

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

您的所有行尾都将加倍(到\r\r\n),从而导致额外的空行。Python的文本读取函数通常会规格化行尾,因此字符串只使用\n。如果您从Windows系统接收二进制数据,则Python没有机会做到这一点。因此,

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("rn", "n")
open("Output.txt", "w").write(String)

将复制您的原始文件。

相关问题