javascript代码的Erlang等效项PointAt?

brvekthn  于 2022-12-08  发布在  Erlang
关注(0)|答案(2)|浏览(168)

js中是否有一个与codePointAt等价的erlang?它可以从字节偏移量开始获取码点,而不修改底层字符串/二进制文件?

zyfwsgd6

zyfwsgd61#

您可以使用bit syntax模式匹配跳过前N个字节,并将剩余字节中的第一个字符解码为UTF-8:

1> CodePointAt = fun(Binary, Offset) ->
  <<_:Offset/binary, Char/utf8, _/binary>> = Binary,
  Char
end.

测试项目:

2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>

正如您所看到的,如果偏移量不在有效的UTF-8字符边界内,函数将抛出一个错误。如果需要,您可以使用case表达式以不同的方式处理该错误。

jfewjypa

jfewjypa2#

首先,请记住在Erlang中只有二进制字符串使用UTF-8。纯双引号字符串已经只是代码点的列表(很像UTF-32)。unicode:chardata()类型表示这两种类型的字符串,包括混合列表,如["Hello", $\s, [<<"Filip"/utf8>>, $!]]。如果需要,您可以使用unicode:characters_to_list(Chardata)unicode:characters_to_binary(Chardata)来获得扁平版本。
Meanwhile, the JS codePointAt function works on UTF-16 encoded strings, which is what JavaScript uses. Note that the index in this case is not a byte position, but the index of the 16-bit units of the encoding. And UTF-16 is also a variable length encoding: code points that need more than 16 bits use a kind of escape sequence called "surrogate pairs" - for example emojis like 👍 - so if such characters can occur, the index is misleading: in "a👍z" (in JavaScript), the a is at 0, but the z is not at 2 but at 3.
你想要的可能是所谓的"字素簇"--那些在打印出来时看起来像一个单一的东西(参见Erlang的字符串模块的文档:https://www.erlang.org/doc/man/string.html)。而且你不能真正使用数字索引从字符串中挖掘出字形簇--你需要从头开始迭代字符串,一次提取一个。这可以通过string:next_grapheme(Chardata)(请参阅https://www.erlang.org/doc/man/string.html#next_grapheme-1)来实现,或者如果你出于某种原因确实需要用数字索引它们,你可以在数组中插入单个簇子字符串(请参阅https://www.erlang.org/doc/man/array.html)。例如:array:from_list(string:to_graphemes(Chardata)).

相关问题