我试图理解UTF-8/Go中的符文。我以为符文只是被打包成int 32的字节,但是如果我检查UTF-8码位的每个字节和符文,底层的二进制表示不匹配。更具体地说,每个字节的Unicode前缀都缺失了。
import "fmt"
func main() {
s := "中"
r1 := []rune(s)[0]
r2 := int32(r1)
fmt.Printf("'%b %b %b'\n", s[0], s[1], s[2])
fmt.Printf("'%b'\n", r1)
fmt.Printf("'%b'\n", r2)
}
'11100100 10111000 10101101' '100111000101101'
符文不应该是字节的二进制连接吗?符文表示中三个字节的1110,10,10 UTF-8前缀在哪里?
2条答案
按热度按时间e1xvtsh31#
tkausl在评论中回答
符文是Unicode码位,而不是UTF-8
7uzetpgm2#
事情是这样的:
The character "中" is a unicode point. In the link given by @rocka2q "The Unicode standard uses the term “code point” to refer tothe item represented by a single value."
"中" is a single value whose code point is
20013
. Decomposing20013
byte by byte is not the same as decomposing "中" byte by byte.