使用regex从java中的字符串中删除unicode字符

tvmytwxo  于 2021-07-11  发布在  Java
关注(0)|答案(2)|浏览(560)

我有如下输入字符串。

String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
 little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

我想删除unicode字符,如“\u2028”、“\u2019”等,如果它出现在注解部分。在运行时,我不知道所有额外的字符是什么。那么处理这个问题最好的方法是什么呢?
我试着像下面这样删除给定字符串中的unicode字符。

Comments.replaceAll("\\P{Print}", "");

那么什么是匹配注解部分中存在的unicode字符的最佳方法,如果存在,请删除这些字符,否则只需将注解传递给目标系统。
有人能帮我解决这个问题吗?

qaxu7uf2

qaxu7uf21#

您可以按如下顺序执行此操作:

public static void main(final String args[]) {
    String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

    // remove all non-ASCII characters
    comment = comment.replaceAll("[^\\x00-\\x7F]", "");

    // remove all the ASCII control characters
    comment = comment.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    comment = comment.replaceAll("\\p{C}", "");
    System.out.println(comment);
  }
k2fxgqgv

k2fxgqgv2#

如果你使用 replace ,例如,您将丢失一些字符 I'm 将成为 Im . 所以最好的办法就是转化。
您可以将unicode转换为utf-8。

byte[] byteComment = comment.getBytes("UTF-8");

String formattedComment = new String(byteComment, "UTF-8");

相关问题