pytorch 用于NER的Transformer管道返回带有##s的部分字

hk8txs48 于 2023-10-20 发布在其他

关注(0)|答案(2)|浏览(162)

我应该如何解释Transformer NER管道返回的带有“##”的部分单词？Flair和SpaCy等其他工具返回单词及其标签。我以前使用过CONLL数据集，从来没有注意到这样的事情。而且，为什么要这样划分文字呢？
来自HuggingFace的例子：

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

输出量：

[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

pytorch

来源：https://stackoverflow.com/questions/61107371/transformer-pipeline-for-ner-returns-partial-words-with-s

2条答案

按热度按时间

pod7payv1#

Pytorch transformers和BERT生成2个token，规则词作为token，词+子词作为token;它将单词按其基本含义+其补语进行划分，并在开头添加“##”。
假设你有这样一个短语：I like hugging animals
第一组代币是：

["I", "like", "hugging", "animals"]

第二个包含子单词的列表是：

["I", "like", "hug", "##gging", "animal", "##s"]

你可以在这里了解更多：https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer

赞(0）回复(0）举报 2023-10-20

oknwwptz2#

使用aggregation_strategy对实体进行分组：

pipeline('ner', model="YOUR_MODEL", aggregation_strategy="average")

阅读更多关于战略这里.

赞(0）回复(0）举报 2023-10-20

我来回答

pytorch 用于NER的Transformer管道返回带有##s的部分字

2条答案

相关问题

热门标签

最新问答