2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>
首先,请记住在Erlang中只有二进制字符串使用UTF-8。纯双引号字符串已经只是代码点的列表(很像UTF-32)。unicode:chardata()类型表示这两种类型的字符串,包括混合列表,如["Hello", $\s, [<<"Filip"/utf8>>, $!]]。如果需要,您可以使用unicode:characters_to_list(Chardata)或unicode:characters_to_binary(Chardata)来获得扁平版本。 Meanwhile, the JS codePointAt function works on UTF-16 encoded strings, which is what JavaScript uses. Note that the index in this case is not a byte position, but the index of the 16-bit units of the encoding. And UTF-16 is also a variable length encoding: code points that need more than 16 bits use a kind of escape sequence called "surrogate pairs" - for example emojis like 👍 - so if such characters can occur, the index is misleading: in "a👍z" (in JavaScript), the a is at 0, but the z is not at 2 but at 3. 你想要的可能是所谓的"字素簇"--那些在打印出来时看起来像一个单一的东西(参见Erlang的字符串模块的文档:https://www.erlang.org/doc/man/string.html)。而且你不能真正使用数字索引从字符串中挖掘出字形簇--你需要从头开始迭代字符串,一次提取一个。这可以通过string:next_grapheme(Chardata)(请参阅https://www.erlang.org/doc/man/string.html#next_grapheme-1)来实现,或者如果你出于某种原因确实需要用数字索引它们,你可以在数组中插入单个簇子字符串(请参阅https://www.erlang.org/doc/man/array.html)。例如:array:from_list(string:to_graphemes(Chardata)).
2条答案
按热度按时间zyfwsgd61#
您可以使用bit syntax模式匹配跳过前N个字节,并将剩余字节中的第一个字符解码为UTF-8:
测试项目:
正如您所看到的,如果偏移量不在有效的UTF-8字符边界内,函数将抛出一个错误。如果需要,您可以使用
case
表达式以不同的方式处理该错误。jfewjypa2#
首先,请记住在Erlang中只有二进制字符串使用UTF-8。纯双引号字符串已经只是代码点的列表(很像UTF-32)。unicode:chardata()类型表示这两种类型的字符串,包括混合列表,如
["Hello", $\s, [<<"Filip"/utf8>>, $!]]
。如果需要,您可以使用unicode:characters_to_list(Chardata)
或unicode:characters_to_binary(Chardata)
来获得扁平版本。Meanwhile, the JS codePointAt function works on UTF-16 encoded strings, which is what JavaScript uses. Note that the index in this case is not a byte position, but the index of the 16-bit units of the encoding. And UTF-16 is also a variable length encoding: code points that need more than 16 bits use a kind of escape sequence called "surrogate pairs" - for example emojis like 👍 - so if such characters can occur, the index is misleading: in
"a👍z"
(in JavaScript), thea
is at 0, but thez
is not at 2 but at 3.你想要的可能是所谓的"字素簇"--那些在打印出来时看起来像一个单一的东西(参见Erlang的字符串模块的文档:https://www.erlang.org/doc/man/string.html)。而且你不能真正使用数字索引从字符串中挖掘出字形簇--你需要从头开始迭代字符串,一次提取一个。这可以通过
string:next_grapheme(Chardata)
(请参阅https://www.erlang.org/doc/man/string.html#next_grapheme-1)来实现,或者如果你出于某种原因确实需要用数字索引它们,你可以在数组中插入单个簇子字符串(请参阅https://www.erlang.org/doc/man/array.html)。例如:array:from_list(string:to_graphemes(Chardata))
.