Gzip Ruby中的json中的单个字段

pkln4tw6  于 2023-11-18  发布在  Ruby
关注(0)|答案(1)|浏览(134)

假设我有一个ruby Hash,它有一个非常大的字符串。它是如此之大,以至于压缩字符串可能是有意义的。使用ActiveSupport::Gzip.compress压缩字符串是微不足道的,但由于编码,将该哈希转换为JSON被证明是一个问题。
基本上,这段代码失败了:
{ compressed: ActiveSupport::Gzip.compress('asdf') }.to_json
出现以下错误:
JSON::GeneratorError: Invalid Unicode [8b 08 00 56 dd] at 1
当使用to_json转换为json时,不包含任何压缩数据的散列会被编码为UTF-8,但调用ActiveSupport::Gzip.compress('asdf').encode('UTF-8')会失败,并出现以下错误:
Encoding::UndefinedConversionError: "\x8B" from ASCII-8BIT to UTF-8
我这是在做傻事吗?我的目标能实现吗?

cuxqih21

cuxqih211#

Base-64 encode gzip输出,使其成为可读字符,因此是有效的UTF-8。这将使数据扩展约三分之一以上,抵消一些压缩。您也可以使用更有效的字符编码,例如Base-85,以减少影响,在这种情况下扩展约四分之一以上。经过一些工作,您应该能够将其降低到接近1/7的增加。
下面是一个用C语言编写的示例代码,它将字节编码为1..127中的符号,这些符号都是有效的UTF-8。(JSON不允许字符串中有空字节。)由此产生的扩展约为1.145。

  1. #include <stddef.h>
  2. // Encode, reading six or seven bits from bin[0..len-1] to encode each symbol
  3. // to *enc, where a symbol is one byte in the range 1..127. enc must have room
  4. // for at least ceil((len * 4) / 3) symbols. The average number of encoded
  5. // symbols for random input is 1.1454 * len. The number of encoded symbols is
  6. // returned.
  7. size_t enc127(char *enc, unsigned char const *bin, size_t len) {
  8. unsigned buf = 0;
  9. int bits = 0;
  10. size_t i = 0, k = 0;
  11. for (;;) {
  12. if (bits < 7) {
  13. if (i == len)
  14. break;
  15. buf = (buf << 8) | bin[i++];
  16. bits += 8;
  17. }
  18. unsigned sym = ((buf >> (bits - 7)) & 0x7f) + 1;
  19. if (sym > 0x7e) {
  20. enc[k++] = 0x7f;
  21. bits -= 6;
  22. }
  23. else {
  24. enc[k++] = sym;
  25. bits -= 7;
  26. }
  27. }
  28. if (bits)
  29. enc[k++] = ((buf << (7 - bits)) & 0x7f) + 1;
  30. return k;
  31. }
  32. // Decode, converting each symbol from enc, which must be in the range 1..127,
  33. // into 6 or 7 bits in the output, from which 8 bits at a time is written to
  34. // bin. bin must have room for at least floor((len * 7) / 8) bytes. The number
  35. // of decoded bytes is returned.
  36. size_t dec127(unsigned char *bin, char const *enc, size_t len) {
  37. unsigned buf = 0;
  38. int bits = 0;
  39. size_t k = 0;
  40. for (size_t i = 0; i < len; i++) {
  41. unsigned sym = enc[i];
  42. if (sym == 0x7f) {
  43. buf = (buf << 6) | 0x3f;
  44. bits += 6;
  45. }
  46. else {
  47. buf = (buf << 7) | (sym - 1);
  48. bits += 7;
  49. }
  50. if (bits >= 8) {
  51. bin[k++] = buf >> (bits - 8);
  52. bits -= 8;
  53. }
  54. }
  55. return k;
  56. }

字符串

展开查看全部

相关问题