elasticsearch摄取管道:如何递归地修改hashmap中的值

mv1qrgav  于 2021-06-09  发布在  ElasticSearch
关注(0)|答案(2)|浏览(470)

使用摄取管道,我希望遍历hashmap并从所有字符串值(存在下划线的地方)中删除下划线,使键中的下划线保持不变。有些值是数组,必须进一步迭代才能进行相同的修改。
在管道中,我使用一个函数遍历和修改hashmap集合视图的值。

PUT /_ingest/pipeline/samples
{
    "description": "preprocessing of samples.json",
    "processors": [
        {
            "script": {
                "tag": "remove underscore from sample_tags values",
                "source": """
                    void findReplace(Collection collection) {
                    collection.forEach(element -> {
                        if (element instanceof String) {
                            element.replace('_',' ');
                        } else {
                            findReplace(element);
                        }
                        return true;
                        })
                    }

                    Collection samples = ctx.samples;
                    samples.forEach(sample -> { //sample.sample_tags is a HashMap
                        Collection sample_tags = sample.sample_tags.values();
                        findReplace(sample_tags);
                        return true;
                    })
                """
            }
        }
    ]
}

当我模拟管道摄取时,我发现字符串值没有被修改。我哪里出错了?

POST /_ingest/pipeline/samples/_simulate
{
    "docs": [
        {
            "_index": "samples",
            "_id": "xUSU_3UB5CXFr25x7DcC",
            "_source": {
                "samples": [
                    {
                        "sample_tags": {
                            "Entry_A": [
                                "A_hyphentated-sample",
                                "sample1"
                            ],
                            "Entry_B": "A_multiple_underscore_example",
                            "Entry_C": [
                                        "sample2",
                                        "another_example_with_underscores"
                            ],
                            "Entry_E": "last_example"
                        }
                    }
                ]
            }
        }
    ]
}

\\Result

{
  "docs" : [
    {
      "doc" : {
        "_index" : "samples",
        "_type" : "_doc",
        "_id" : "xUSU_3UB5CXFr25x7DcC",
        "_source" : {
          "samples" : [
            {
              "sample_tags" : {
                "Entry_E" : "last_example",
                "Entry_C" : [
                  "sample2",
                  "another_example_with_underscores"
                ],
                "Entry_B" : "A_multiple_underscore_example",
                "Entry_A" : [
                  "A_hyphentated-sample",
                  "sample1"
                ]
              }
            }
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-12-01T17:29:52.3917165Z"
        }
      }
    }
  ]
}
uujelgoq

uujelgoq1#

您的路径是正确的,但是您正在处理值的副本,并且没有将修改后的值设置回文档上下文 ctx 最终从管道返回。这意味着您将需要跟踪当前的迭代索引——因此对于数组列表,以及散列Map和介于两者之间的所有内容——这样您就可以在深度嵌套的上下文中定位字段的位置。
下面是一个处理字符串和(仅字符串)数组列表的示例。您需要扩展它来处理散列Map(和其他类型),然后可能需要将整个过程提取到一个单独的函数中。但是在java中不能返回多个数据类型,所以这可能很有挑战性。。。

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          ArrayList samples = ctx.samples;

          for (int i = 0; i < samples.size(); i++) {
              def sample = samples.get(i).sample_tags;

              for (def entry : sample.entrySet()) {
                  def key = entry.getKey();
                  def val = entry.getValue();
                  def replaced_val;

                  if (val instanceof String) {
                    replaced_val = val.replace('_',' ');
                  } else if (val instanceof ArrayList) {
                    replaced_val = new ArrayList();
                    for (int j = 0; j < val.length; j++) {
                        replaced_val.add(val[j].replace('_',' ')); 
                    }
                  } 
                  // else if (val instanceof HashMap) {
                    // do your thing
                  // }

                  // crucial part
                  ctx.samples[i][key] = replaced_val;
              }
          }
        """
      }
    }
  ]
}
k2arahey

k2arahey2#

以下是您的脚本的修改版本,它将处理您提供的数据:

PUT /_ingest/pipeline/samples
{
  "description": "preprocessing of samples.json",
  "processors": [
    {
      "script": {
        "tag": "remove underscore from sample_tags values",
        "source": """
          String replaceString(String value) {
            return value.replace('_',' ');
          }

          void findReplace(Map map) {
            map.keySet().forEach(key -> {
              if (map[key] instanceof String) {
                  map[key] = replaceString(map[key]);
              } else {
                  map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
              }
            });
          }

          ctx.samples.forEach(sample -> {
              findReplace(sample.sample_tags);
              return true;
          });
          """
      }
    }
  ]
}

结果如下:

{
      "samples" : [
        {
          "sample_tags" : {
            "Entry_E" : "last example",
            "Entry_C" : [
              "sample2",
              "another example with underscores"
            ],
            "Entry_B" : "A multiple underscore example",
            "Entry_A" : [
              "A hyphentated-sample",
              "sample1"
            ]
          }
        }
      ]
    }

相关问题