Histogram

先附上文档链接: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/histogram.html

当在网络上搜索 elasticsearch Histogram 时，会有两个结果：

type Histogram
aggregation Histogram

但是对于 aggregation 的结果会比较多，而 type 的却很少，那么，本篇博文主要记录 type Histogram 的使用以及注意事项。ps（本篇博文还有一些未理解的点待调研，因此，本篇博文会不断更新）

Histogram field type

Histogram 是由两个成对数组定义的类型。
它有以下注意事项：

values 存储类型为 double 而且必须升序
counts 必须是 integet 必须是正整数或者0
这两个数组的长度是一致的，这是因为他们的值一一对应
并且不支持嵌套数组，以及排序。

Histogram 存储的数据为二进制文档，而不是索引，这样可以更快速的聚合，它的字节大小最多为 13*数组的长度。

Quick start

添加 mapping

PUT histogram_test
{
  "mappings" : {
    "properties" : {
      "my_histogram" : {
        "type" : "histogram"
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

添加数据

PUT histogram_test/_doc/1
{
  "my_text" : "histogram_1",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}
PUT histogram_test/_doc/2
{
  "my_text" : "histogram_2",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 1], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}

Error example

错误示范：添加 values 不是递增的字段

PUT histogram_test/_doc/1
{
  "my_text" : "histogram_1",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.1, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}
 
***********result************** 
{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "error parsing field [my_histogram], [values] values must be in increasing order, got [0.1] but previous value was [0.2]"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse field [my_histogram] of type [histogram]",
    "caused_by" : {
      "type" : "mapper_parsing_exception",
      "reason" : "error parsing field [my_histogram], [values] values must be in increasing order, got [0.1] but previous value was [0.2]"
    }
  },
  "status" : 400
}

错误示范：counts 的数值小于0

PUT histogram_test/_doc/3
{
  "my_text" : "histogram_3",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 1], 
      "counts" : [3, 7, 23, 12, -6] 
   }
}
 
***********result**************
 
{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "error parsing field [my_histogram], [counts] elements must be >= 0 but got -6"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse field [my_histogram] of type [histogram]",
    "caused_by" : {
      "type" : "mapper_parsing_exception",
      "reason" : "error parsing field [my_histogram], [counts] elements must be >= 0 but got -6"
    }
  },
  "status" : 400
}

Aggregation

min aggregation
max aggregation
sum aggregation
value_count aggregation
avg aggregation
percentiles aggregation （ps 还没搞懂，待调研）
percentile ranks aggregation （ps 还没搞懂，待调研）
boxplot aggregation （ps 还没搞懂，待调研）
histogram aggregation
range aggregation （ps 还没搞懂，待调研）

min aggregation

将 values 中最小的值返回

GET /histogram_test/_search
{
  "aggs": {
    "min_latency": {
      "min": {
        "field": "my_histogram"
      }
    }
  }
}
**********************value********************
 
 "aggregations" : {
    "min_latency" : {
      "value" : 0.1
    }
  }

max

将 values 中最大的值返回

GET /histogram_test/_search
{
  "aggs": {
    "max_histogram": {
      "max": {
        "field": "my_histogram"
      }
    }
  }
}
**********************value********************
"aggregations" : {
    "max_histogram" : {
      "value" : 1.0
    }
  }

sum

将 values 和 counts 的一一对应的值进行相乘，最后在一起相加。

GET /histogram_test/_search
{
  "aggs": {
    "sum_histogram": {
      "sum": {
        "field": "my_histogram"
      }
    }
  }
}
**********************value********************
"aggregations" : {
    "sum_histogram" : {
      "value" : 35.8
    }
  }

value_count

对所有 counts 的值进行相加。

GET /histogram_test/_search
{
  "aggs": {
    "count_histogram": {
      "value_count": {
        "field": "my_histogram"
      }
    }
  }
}
**********************value********************
  "aggregations" : {
    "count_histogram" : {
      "value" : 102
    }
  }

avg

将值数组 values 中的每个数字乘以其在计数数组 counts 中的关联计数。最终，它将计算所有直方图的这些值的平均值，可以理解成 sum / count.

GET /histogram_test/_search
{
  "aggs": {
    "avg_histogram": {
      "avg": {
        "field": "my_histogram"
      }
    }
  }
}
**********************value********************
"aggregations" : {
    "avg_histogram" : {
      "value" : 0.3509803921568627
    }
  }

histogram aggregation

根据 values 计算出每个区间的数量。
interval 区间的间隔数。

GET /histogram_test/_search
{
  "aggs": {
    "histogram_histogram": {
      "histogram": {
        "field": "my_histogram",
        "interval": 0.5
      }
    }
  }
}
**********************value********************
"aggregations" : {
    "histogram_histogram" : {
      "buckets" : [
        {
          "key" : 0.0,
          "doc_count" : 90
        },
        {
          "key" : 0.5,
          "doc_count" : 6
        },
        {
          "key" : 1.0,
          "doc_count" : 6
        }
      ]
    }
  }

Query

只有指定的查询才可用。

exists query

GET /histogram_test/_search
{
  "query": {
    "exists": {
      "field": "my_histogram"
    }
  }
}

END

博文中的待调研的部分，博主会在后续的时间里进行补齐，欢迎小伙伴们多多交流。

elasticsearch Histogram field type 使用及注意事项