Go语言 缺少符文二进制表示前缀

gcuhipw9  于 2023-02-27  发布在  Go
关注(0)|答案(2)|浏览(120)

我试图理解UTF-8/Go中的符文。我以为符文只是被打包成int 32的字节,但是如果我检查UTF-8码位的每个字节和符文,底层的二进制表示不匹配。更具体地说,每个字节的Unicode前缀都缺失了。

import "fmt"

func main() {
    s := "中"
    r1 := []rune(s)[0]
    r2 := int32(r1)
    fmt.Printf("'%b %b %b'\n", s[0], s[1], s[2])
    fmt.Printf("'%b'\n", r1)
    fmt.Printf("'%b'\n", r2)
}

'11100100 10111000 10101101' '100111000101101'
符文不应该是字节的二进制连接吗?符文表示中三个字节的1110,10,10 UTF-8前缀在哪里?

e1xvtsh3

e1xvtsh31#

tkausl在评论中回答
符文是Unicode码位,而不是UTF-8

7uzetpgm

7uzetpgm2#

事情是这样的:

import "fmt"

func main() {
    s := "中"

    asRunes := []rune(s)
    fmt.Println(asRunes) // [20013]

    asBytes := []byte(s)
    fmt.Println(asBytes) // [228 184 173]
}

The character "中" is a unicode point. In the link given by @rocka2q "The Unicode standard uses the term “code point” to refer tothe item represented by a single value."
"中" is a single value whose code point is 20013 . Decomposing 20013 byte by byte is not the same as decomposing "中" byte by byte.

相关问题