CoreNLP 对于HTML标签,invertible是不正确的,

brvekthn  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(50)

我不确定这是否是预期的行为,但对我来说似乎很奇怪。
使用基本选项集
-annotators tokenize,cleanxml,ssplit,pos,lemma
解析句子
This is a <b>test</b> sentence.
输出结果为

{
  "sentences": [
    {
      "index": 0,
      "tokens": [
        {
          "index": 1,
          "word": "This",
          "originalText": "This",
          "lemma": "this",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 4,
          "pos": "DT",
          "before": "",
          "after": " "
        },
        {
          "index": 2,
          "word": "is",
          "originalText": "is",
          "lemma": "be",
          "characterOffsetBegin": 5,
          "characterOffsetEnd": 7,
          "pos": "VBZ",
          "before": " ",
          "after": " "
        },
        {
          "index": 3,
          "word": "a",
          "originalText": "a",
          "lemma": "a",
          "characterOffsetBegin": 8,
          "characterOffsetEnd": 9,
          "pos": "DT",
          "before": " ",
          "after": "  <b>"
        },
        {
          "index": 4,
          "word": "test",
          "originalText": "test",
          "lemma": "test",
          "characterOffsetBegin": 13,
          "characterOffsetEnd": 17,
          "pos": "NN",
          "before": " <b>",
          "after": "</b>"
        },
        {
          "index": 5,
          "word": "sentence",
          "originalText": "sentence",
          "lemma": "sentence",
          "characterOffsetBegin": 22,
          "characterOffsetEnd": 30,
          "pos": "NN",
          "before": "</b> ",
          "after": ""
        },
        {
          "index": 6,
          "word": ".",
          "originalText": ".",
          "lemma": ".",
          "characterOffsetBegin": 30,
          "characterOffsetEnd": 31,
          "pos": ".",
          "before": "",
          "after": ""
        }
      ]
    }
  ],
  "sections": [
  ]
}

对于索引 #3 ,之后元素是 " <b>"(两个空格)。前一个字符偏移量是9,当前的是13,这意味着之后元素应该是4个字符,而不是5个。
同样地,对于索引 #5 ,之前元素应该是5个字符,而不是4个,以匹配字符偏移量。
在版本4.3.1中进行测试。

wb1gzix0

wb1gzix01#

我实际上不同意关于#5:之前的文本是</b>,5个字符,这正是testsentence之间的东西。
然而,AfterAnnotation绝对是错误的。我刚刚提交了一个PR,请求解决这个问题,如果测试通过,我会合并它。

frebpwbc

frebpwbc2#

我指向了错误的索引。在索引 #5 中,"</b> " 是正确的 之前 ,但在索引 #4 中,它是错误的 之后

rt4zxlrg

rt4zxlrg4#

谢谢,这更有道理。我也可以解决这个问题。谢谢你找到这些!

相关问题