编码解码原理和历史问题

x33g5p2x  于2022-05-25 转载在 其他  
字(6.9k)|赞(0)|评价(0)|浏览(296)

一 点睛

平时在编程时,会遇到各种各样的编码和解码,那么为什么有这么不同的编码类型呢?

程序最终都是通过字节码的形式存储在硬盘文件上,而字节码就是 byte 数组。因此编码就是将人类眼睛所能看到的字符串,转为字节数组,即 String 转为 byte[];反之,解码就是将字节数组转为字符串,即 byte[] 转为 String。

之所以会出现各种类型编码类型,其实是一个历史问题。在计算机发展初期,美国等少数国家最先给自己的语言设置一套编码,即 ASCII。由于这些国家使用的都是英语,而英语只需要 26 个英文字母以及一些常见的符号,因此只需要用1个字节的7位,即128个整数就能完全表示英文字符。但随着计算机的普及,西欧的一些其他国家也需要给自己的语言设置一套编码(如法语),而 ASCII 只能表示 128 个字符,显然不能满足需求,因此就产生了第二套编码类型 ISO-8859-1~ISO-8859-15,其中使用最广泛的是 ISO-8859-1,ISO-8859-1 使用了一个字节的8位,可以表示 256 个字符。为了避免乱码问题,ISO-8859-1 完全兼容 ASCII,即 ISO-8859-1 中的前 128 个字符与 ASCII 完全一致,后 128 个字符才是 ISO-8859-1 自身新扩展的字符编码。

之后,我国为了给汉语也设置一些编码,提出了适合汉语的编码集 GB2312。GB2312 包含了 682 个英文、字母等符号及常见的 6763 个简体中文。同时,我国台湾地区也给繁体中文设置了一套编码,成为 BIG5。GB2312 和 BIG5 都兼容 ASCII,都使用一个字节存储 ASCII 中的英文、数字、常见符号,使用两个字节存储简体中文(GB2312)或繁体(BIG5)。

再往后,为了将简体中文和繁体中文容纳到一个字符集里,我国又发布了新的编码 GBK。GBK 实际上是 GB2312 的扩展(兼容 GB2312),支持简体中文和繁体中文,也是使用1个字节存储 ASCII 中的字符,使用2个字节存储一个中文汉字(简体或繁体)。

再后来,为了将中文、生僻字、中国少数名族文字、日文、朝鲜语等纳入一套编码,又将 GBK 升级为 GB18030 。GB18030 兼容 GBK,可以使用1个字节、2个字节和4个字节存储一个字符。

最后,国际社会为了给世界的所有字符设置一套统一的编码,出台了一个统一的字符集规范 Unicode(国际标准字符集)。但 Unicode 仅仅是一套规范,并不能直接使用(类似于接口的概念),能够使用的是 Unicode 的具体实现 UTF-8、UTF-16等(类似于实现类)。实际上,Unicode 是通过一定的算法将每种语言中的每个字符转为了 UTF-8、UTF-16 等具体的编码类型。Unicode 使用 4 个字节存储一个字符(其中包含了2个字节的附加字符),而最常用的 UTF-8 存储一个字符所使用的字节数不是固定的。此外,UTF-8 是 ASCII 的超集,即 ASCII 中每个字符的编码与在 UTF-8 中完全一致的,因此当使用 UTF-8 存储汉字或其他字符时,可能会使用 2 个、3 个或 4 个字节。用 UTF-8 存储一个常见的字符所占用的字节数如下。

| <br>字符种类 UTF-8 存储一个该种类的字符<br> | <br>所占用的字节数<br> |
| <br>英文、数字、回车符、各类常见符号<br> | <br>1<br> |
| <br>常见汉字(即在 GBK 中存在的汉字)<br> | <br>3<br> |
| <br>中日韩等超大字符集里的汉字<br> | <br>4<br> |
| <br>个别特殊符号<br> | <br>2<br> |

二 查看当前环境默认的编码类型

1 代码

public static void test1() {
    System.out.println("当前环境默认的编码类型:" + Charset.defaultCharset());
    Charset.forName("utf-8");

    Set<Map.Entry<String, Charset>> entries = Charset.availableCharsets().entrySet();
    System.out.println("当前jdk共支持编码类型数:" + entries.size());

    System.out.println("当前环境支持的所有编码类型:");
    for (Map.Entry<String, Charset> entry : entries) {
        System.out.println("key:" + entry.getKey() + "\tvalue:" + entry.getValue());
    }
}

2 测试

当前环境默认的编码类型:UTF-8

当前jdk共支持编码类型数:170

当前环境支持的所有编码类型:

key:Big5    value:Big5

key:Big5-HKSCS    value:Big5-HKSCS

key:CESU-8    value:CESU-8

key:EUC-JP    value:EUC-JP

key:EUC-KR    value:EUC-KR

key:GB18030    value:GB18030

key:GB2312    value:GB2312

key:GBK    value:GBK

key:IBM-Thai    value:IBM-Thai

key:IBM00858    value:IBM00858

key:IBM01140    value:IBM01140

key:IBM01141    value:IBM01141

key:IBM01142    value:IBM01142

key:IBM01143    value:IBM01143

key:IBM01144    value:IBM01144

key:IBM01145    value:IBM01145

key:IBM01146    value:IBM01146

key:IBM01147    value:IBM01147

key:IBM01148    value:IBM01148

key:IBM01149    value:IBM01149

key:IBM037    value:IBM037

key:IBM1026    value:IBM1026

key:IBM1047    value:IBM1047

key:IBM273    value:IBM273

key:IBM277    value:IBM277

key:IBM278    value:IBM278

key:IBM280    value:IBM280

key:IBM284    value:IBM284

key:IBM285    value:IBM285

key:IBM290    value:IBM290

key:IBM297    value:IBM297

key:IBM420    value:IBM420

key:IBM424    value:IBM424

key:IBM437    value:IBM437

key:IBM500    value:IBM500

key:IBM775    value:IBM775

key:IBM850    value:IBM850

key:IBM852    value:IBM852

key:IBM855    value:IBM855

key:IBM857    value:IBM857

key:IBM860    value:IBM860

key:IBM861    value:IBM861

key:IBM862    value:IBM862

key:IBM863    value:IBM863

key:IBM864    value:IBM864

key:IBM865    value:IBM865

key:IBM866    value:IBM866

key:IBM868    value:IBM868

key:IBM869    value:IBM869

key:IBM870    value:IBM870

key:IBM871    value:IBM871

key:IBM918    value:IBM918

key:ISO-2022-CN    value:ISO-2022-CN

key:ISO-2022-JP    value:ISO-2022-JP

key:ISO-2022-JP-2    value:ISO-2022-JP-2

key:ISO-2022-KR    value:ISO-2022-KR

key:ISO-8859-1    value:ISO-8859-1

key:ISO-8859-13    value:ISO-8859-13

key:ISO-8859-15    value:ISO-8859-15

key:ISO-8859-2    value:ISO-8859-2

key:ISO-8859-3    value:ISO-8859-3

key:ISO-8859-4    value:ISO-8859-4

key:ISO-8859-5    value:ISO-8859-5

key:ISO-8859-6    value:ISO-8859-6

key:ISO-8859-7    value:ISO-8859-7

key:ISO-8859-8    value:ISO-8859-8

key:ISO-8859-9    value:ISO-8859-9

key:JIS_X0201    value:JIS_X0201

key:JIS_X0212-1990    value:JIS_X0212-1990

key:KOI8-R    value:KOI8-R

key:KOI8-U    value:KOI8-U

key:Shift_JIS    value:Shift_JIS

key:TIS-620    value:TIS-620

key:US-ASCII    value:US-ASCII

key:UTF-16    value:UTF-16

key:UTF-16BE    value:UTF-16BE

key:UTF-16LE    value:UTF-16LE

key:UTF-32    value:UTF-32

key:UTF-32BE    value:UTF-32BE

key:UTF-32LE    value:UTF-32LE

key:UTF-8    value:UTF-8

key:windows-1250    value:windows-1250

key:windows-1251    value:windows-1251

key:windows-1252    value:windows-1252

key:windows-1253    value:windows-1253

key:windows-1254    value:windows-1254

key:windows-1255    value:windows-1255

key:windows-1256    value:windows-1256

key:windows-1257    value:windows-1257

key:windows-1258    value:windows-1258

key:windows-31j    value:windows-31j

key:x-Big5-HKSCS-2001    value:x-Big5-HKSCS-2001

key:x-Big5-Solaris    value:x-Big5-Solaris

key:x-euc-jp-linux    value:x-euc-jp-linux

key:x-EUC-TW    value:x-EUC-TW

key:x-eucJP-Open    value:x-eucJP-Open

key:x-IBM1006    value:x-IBM1006

key:x-IBM1025    value:x-IBM1025

key:x-IBM1046    value:x-IBM1046

key:x-IBM1097    value:x-IBM1097

key:x-IBM1098    value:x-IBM1098

key:x-IBM1112    value:x-IBM1112

key:x-IBM1122    value:x-IBM1122

key:x-IBM1123    value:x-IBM1123

key:x-IBM1124    value:x-IBM1124

key:x-IBM1166    value:x-IBM1166

key:x-IBM1364    value:x-IBM1364

key:x-IBM1381    value:x-IBM1381

key:x-IBM1383    value:x-IBM1383

key:x-IBM300    value:x-IBM300

key:x-IBM33722    value:x-IBM33722

key:x-IBM737    value:x-IBM737

key:x-IBM833    value:x-IBM833

key:x-IBM834    value:x-IBM834

key:x-IBM856    value:x-IBM856

key:x-IBM874    value:x-IBM874

key:x-IBM875    value:x-IBM875

key:x-IBM921    value:x-IBM921

key:x-IBM922    value:x-IBM922

key:x-IBM930    value:x-IBM930

key:x-IBM933    value:x-IBM933

key:x-IBM935    value:x-IBM935

key:x-IBM937    value:x-IBM937

key:x-IBM939    value:x-IBM939

key:x-IBM942    value:x-IBM942

key:x-IBM942C    value:x-IBM942C

key:x-IBM943    value:x-IBM943

key:x-IBM943C    value:x-IBM943C

key:x-IBM948    value:x-IBM948

key:x-IBM949    value:x-IBM949

key:x-IBM949C    value:x-IBM949C

key:x-IBM950    value:x-IBM950

key:x-IBM964    value:x-IBM964

key:x-IBM970    value:x-IBM970

key:x-ISCII91    value:x-ISCII91

key:x-ISO-2022-CN-CNS    value:x-ISO-2022-CN-CNS

key:x-ISO-2022-CN-GB    value:x-ISO-2022-CN-GB

key:x-iso-8859-11    value:x-iso-8859-11

key:x-JIS0208    value:x-JIS0208

key:x-JISAutoDetect    value:x-JISAutoDetect

key:x-Johab    value:x-Johab

key:x-MacArabic    value:x-MacArabic

key:x-MacCentralEurope    value:x-MacCentralEurope

key:x-MacCroatian    value:x-MacCroatian

key:x-MacCyrillic    value:x-MacCyrillic

key:x-MacDingbat    value:x-MacDingbat

key:x-MacGreek    value:x-MacGreek

key:x-MacHebrew    value:x-MacHebrew

key:x-MacIceland    value:x-MacIceland

key:x-MacRoman    value:x-MacRoman

key:x-MacRomania    value:x-MacRomania

key:x-MacSymbol    value:x-MacSymbol

key:x-MacThai    value:x-MacThai

key:x-MacTurkish    value:x-MacTurkish

key:x-MacUkraine    value:x-MacUkraine

key:x-MS932_0213    value:x-MS932_0213

key:x-MS950-HKSCS    value:x-MS950-HKSCS

key:x-MS950-HKSCS-XP    value:x-MS950-HKSCS-XP

key:x-mswin-936    value:x-mswin-936

key:x-PCK    value:x-PCK

key:x-SJIS_0213    value:x-SJIS_0213

key:x-UTF-16LE-BOM    value:x-UTF-16LE-BOM

key:X-UTF-32BE-BOM    value:X-UTF-32BE-BOM

key:X-UTF-32LE-BOM    value:X-UTF-32LE-BOM

key:x-windows-50220    value:x-windows-50220

key:x-windows-50221    value:x-windows-50221

key:x-windows-874    value:x-windows-874

key:x-windows-949    value:x-windows-949

key:x-windows-950    value:x-windows-950

key:x-windows-iso2022jp    value:x-windows-iso2022jp

相关文章