为什么R中的strwrap函数会将某些字符更改为某种字节序列?

ygya80vv  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(97)
> s <- " ‘in silico’’) and the object o"
> strwrap(s, 5)
[1] "<e2><80><98>in"                  "silico<e2><80><99><e2><80><99>)"
[3] "and"                             "the"
[5] "object"                          "o"

字符串
有些字符被基R中的strwrap()函数改变为那些奇怪的尖括号字节序列。有没有一种方法可以将 Package 字符串中的这些序列转换回原始字符?为什么函数首先要这样做?使用stringi::stri_wrap(),可以很好地避免这个问题。
以防万一,这与本地设置有关,这里是我的会话信息

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] stringi_1.7.6     gridExtra_2.3     data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.2.0 magrittr_2.0.3 tools_4.2.0    gtable_0.3.0   stringr_1.4.0

55ooxyrt

55ooxyrt1#

你确定你的sessionInfo吗?我可以重现一下,将locale全部设置为“C”。设置Sys.setlocale("LC_ALL", "en_US.UTF-8")很有帮助。你的.RenvironRenvironRenviron.siteRprofile.site中有locale特定的条目吗?

> s <- " ‘in silico’’) and the object o"
> Sys.setlocale("LC_ALL", "C")
[1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
> strwrap(s, 5)
[1] "<U+2018>in"          "silico<U+2019><U+2019>)" "and"                 "the"                
[5] "object"              "o"                  
> stringi::stri_wrap(s, 5)
[1] "<U+2018>in"          "silico<U+2019><U+2019>)" "and"                 "the"                
[5] "object"              "o"                  
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
> strwrap(s, 5)
[1] "‘in"       "silico’’)" "and"       "the"       "object"    "o"        
> stringi::stri_wrap(s, 5)
[1] "‘in"       "silico’’)" "and"       "the"       "object"    "o"        
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8    LC_NUMERIC=C            LC_TIME=en_US.UTF-8    
 [4] LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C              LC_NAME=C               LC_ADDRESS=C           
[10] LC_TELEPHONE=C          LC_MEASUREMENT=C        LC_IDENTIFICATION=C    

time zone: Europe/Zurich
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.2 tools_4.3.2    stringi_1.8.3

字符串

相关问题