doccano 序列标注/命名实体识别:导致错误的跨度的Unicode字符

omqzjyyz  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(37)

如何复现这个问题
Unicode带有重音符号的字符,如 é ,显示正确但被计算为两个字符,导致跨度错误。
在jsonlines文件中使用的Unicode字符串:
{"text":"Département de Médecine, Université de Sherbrooke, Centre Hospitalier, 3001 12e Ave Nord, Sherbrooke, QC J1H5N4","label":[[0,114,"affiliation"],[0,25,"org"],[27,52,"org"],[54,72,"org"],[74,91,"street-address"],[93,103,"city"],[105,107,"state"],[108,114,"region-postal-code"]]}

用于创建jsonl文件的代码:
doccano_df.to_json('data.json',orient='records', lines=True, force_ascii=False)
我尝试了在Doccano前端上传文件时从下拉菜单中选择 autoutf_8ascii 编码。我的数据是 utf-8
当你在Python 3.10中使用unicode字符串时,跨度似乎正确:

  • "De\u0301partement de Me\u0301decine, Universite\u0301 de Sherbrooke, Centre Hospitalier, 3001 12e Ave Nord, Sherbrooke, QC J1H5N4"[0:25] # 输出:'Département de Médecine'
  • "Département de Médecine, Université de Sherbrooke, Centre Hospitalier, 3001 12e Ave Nord, Sherbrooke, QC J1H5N4"[0:25] # 输出:'Département de Médecine'

你的环境

  • 操作系统:MacOS Montery 12.6.2 (21G320)
  • 使用的Python版本:3.8
  • 安装doccano的时间:2023年8月14日
  • 如何安装doccano(Heroku按钮等):Docker镜像 doccano/doccano:latest
  • debian:11-slim, 11.4-slim
  • python:3.8-slim
rkkpypqq

rkkpypqq1#

看起来你可能为项目设置了“将字形单元格计数视为一个字符”选项?如果是这样的话,那么在导入之前,你需要从标准Python代码点跨度进行转换。我认为以下方法可行,使用grapheme包:

from typing import Tuple

from grapheme import graphemes

def convert_codepoint_to_grapheme_span(
    text: str, start: int, end: int
) -> Tuple[int, int]:
    """Convert Python spans indexed by codepoint count to spans
indexed by grapheme count.

Args:
text: Text containing spans.
start: Start index, indexed by codepoints.
end: End index, indexed by codepoints.

Returns:
Tuple of start, end indexed by grapheme count.
"""
    start_grapheme = 0
    end_grapheme = 0
    i = 0
    for i_grapheme, grapheme_unit in enumerate(graphemes(text)):
        if i == start:
            start_grapheme = i_grapheme
        if i == end:
            end_grapheme = i_grapheme
            break
        i += len(grapheme_unit)
    else:
        end_grapheme = i_grapheme + 1
    return start_grapheme, end_grapheme

相关问题