使用regex从java中的字符串中删除unicode字符

tvmytwxo 于 2021-07-11 发布在 Java

关注(0)|答案(2)|浏览(577)

我有如下输入字符串。

String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
 little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

我想删除unicode字符，如“\u2028”、“\u2019”等，如果它出现在注解部分。在运行时，我不知道所有额外的字符是什么。那么处理这个问题最好的方法是什么呢？
我试着像下面这样删除给定字符串中的unicode字符。

Comments.replaceAll("\\P{Print}", "");

那么什么是匹配注解部分中存在的unicode字符的最佳方法，如果存在，请删除这些字符，否则只需将注解传递给目标系统。
有人能帮我解决这个问题吗？

Java regex unicode non-ascii-characters

来源：https://stackoverflow.com/questions/64887216/to-remove-unicode-character-from-string-in-java-using-regex

2条答案

按热度按时间

qaxu7uf21#

您可以按如下顺序执行此操作：

public static void main(final String args[]) {
    String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

    // remove all non-ASCII characters
    comment = comment.replaceAll("[^\\x00-\\x7F]", "");

    // remove all the ASCII control characters
    comment = comment.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    comment = comment.replaceAll("\\p{C}", "");
    System.out.println(comment);
  }

赞(0）回复(0）举报 2021-07-11

k2fxgqgv2#

如果你使用 replace ，例如，您将丢失一些字符 I'm 将成为 Im . 所以最好的办法就是转化。
您可以将unicode转换为utf-8。

byte[] byteComment = comment.getBytes("UTF-8");

String formattedComment = new String(byteComment, "UTF-8");

赞(0）回复(0）举报 2021-07-11

我来回答

使用regex从java中的字符串中删除unicode字符

2条答案

相关问题

热门标签

最新问答