java 阅读UTF-8 - BOM标记

5tmbdcev 于 2023-05-05 发布在 Java

关注(0)|答案(9)|浏览(243)

我正在通过FileReader阅读一个文件-该文件是UTF-8解码的（使用BOM），现在我的问题是：我读取文件并输出一个字符串，但遗憾的是BOM标记也输出了。为什么会发生这种情况？

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

第一行后输出

?<style>

Java

来源：https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker

9条答案

按热度按时间

w51jfk4q1#

锁定，本次有disputes about this answer’s content正在解析。它目前不接受新的交互。

在Java中，如果存在UTF8 BOM，则必须手动使用该BOM。Java bug数据库here和here中记录了这种行为。目前还没有修复，因为它会破坏现有的工具，如JavaDoc或XML解析器。Apache IO Commons提供了一个BOMInputStream来处理这种情况。

赞(0）回复(0）举报 2023-05-05

cgh8pdjw2#

最简单的修复方法可能只是从字符串中删除结果\uFEFF，因为它极不可能因为任何其他原因而出现。

tmp = tmp.replace("\uFEFF", "");

参见this Guava bug report

赞(0）回复(0）举报 2023-05-05

f0brbegy3#

使用Apache Commons library。
分类：org.apache.commons.io.input.BOMInputStream
示例用法：

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

赞(0）回复(0）举报 2023-05-05

whhtz7ly4#

下面是我如何使用Apache BOMInputStream，它使用了try-with-resources块。“false”参数告诉对象忽略以下BOM（出于安全原因，我们使用“无BOM”文本文件，哈哈）：

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

赞(0）回复(0）举报 2023-05-05

yruzcnhs5#

考虑一下Google的UnicodeReader，它可以为您完成所有这些工作。

Charset utf8 = StandardCharsets.UTF_8;  // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name())) {
    ....
}

Maven依赖：

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

赞(0）回复(0）举报 2023-05-05

t2a7ltrp6#

使用Apache Commons IO。
例如，让我们看一下我的代码（用于阅读包含拉丁文和西里尔文字符的文本文件）：

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

因此，我们有一个名为“ari”的ArrayList，其中包含文件“1.txt”中除BOM外的所有字符。

赞(0）回复(0）举报 2023-05-05

vwkv1x7d7#

如果有人想用标准来做，这将是一种方法：

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

赞(0）回复(0）举报 2023-05-05

chy5wohz8#

提到here，这通常是Windows上文件的问题。
一个可能的解决方案是首先通过dos2unix这样的工具运行该文件。

赞(0）回复(0）举报 2023-05-05

oo7oh9g99#

最简单的方法我发现绕过BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("ï»¿","");

赞(0）回复(0）举报 2023-05-05

我来回答

java 阅读UTF-8 - BOM标记

9条答案

相关问题

热门标签

最新问答