nltk Bad solving of issue #2151

lskq00tm  于 5个月前  发布在  其他
关注(0)|答案(3)|浏览(53)

nltk/nltk/tag/mapping.py
Lines 90 to 112 in 2a5aece
| | iftarget=="universal": |
| | _load_universal_map(source) |
| | # Added the new Russian National Corpus mappings because the |
| | # Russian model for nltk.pos_tag() uses it. |
| | _MAPPINGS["ru-rnc-new"]["universal"] = { |
| | "A": "ADJ", |
| | "A-PRO": "PRON", |
| | "ADV": "ADV", |
| | "ADV-PRO": "PRON", |
| | "ANUM": "ADJ", |
| | "CONJ": "CONJ", |
| | "INTJ": "X", |
| | "NONLEX": ".", |
| | "NUM": "NUM", |
| | "PARENTH": "PRT", |
| | "PART": "PRT", |
| | "PR": "ADP", |
| | "PRAEDIC": "PRT", |
| | "PRAEDIC-PRO": "PRON", |
| | "S": "NOUN", |
| | "S-PRO": "PRON", |
| | "V": "VERB", |
| | } |
This patch from #2151 just don't work, because source == 'ru-rnc-new' failed on line
nltk/nltk/tag/mapping.py
Line 90 in 2a5aece
| | iftarget=="universal": |
with LookupError for file 'ru-rnc-new.map'
So, why don't change 'ru-rnc-new' to ru-rnc.map , or just create ru-rnc-new.map ?
P.S. this is a @alvations patch, so requesting the author

2jcobegt

2jcobegt1#

感谢您提出这个问题。在 ru-rnc-new 中的补丁是为了在不破坏 nltk_data 中现有数据的情况下热插拔新Map。

我也认为更好的方法是将新Map添加到 ru-rnc-new.map 文件中,然后将其添加到 nltk_data/taggers/universal_tagset 中。但无论如何,代码实现没有失败,按需工作。

>>> from nltk.tag.mapping import tagset_mapping
>>> tagset_mapping("ru-rnc", "universal")
defaultdict(<function _load_universal_map.<locals>.<lambda> at 0x101bb8e18>, {'!': '.', 'A': 'ADJ', 'AD': 'ADV', 'C': 'CONJ', 'COMP': 'CONJ', 'IJ': 'X', 'NC': 'NUM', 'NN': 'NOUN', 'P': 'ADP', 'PTCL': 'PRT', 'V': 'VERB', 'VG': 'VERB', 'VI': 'VERB', 'VP': 'VERB', 'YES_NO_SENT': 'X', 'Z': 'X'})
>>> tagset_mapping("ru-rnc-new", "universal")
{'A': 'ADJ', 'A-PRO': 'PRON', 'ADV': 'ADV', 'ADV-PRO': 'PRON', 'ANUM': 'ADJ', 'CONJ': 'CONJ', 'INTJ': 'X', 'NONLEX': '.', 'NUM': 'NUM', 'PARENTH': 'PRT', 'PART': 'PRT', 'PR': 'ADP', 'PRAEDIC': 'PRT', 'PRAEDIC-PRO': 'PRON', 'S': 'NOUN', 'S-PRO': 'PRON', 'V': 'VERB'}

我可能没有理解这个问题,所以请解释一下,如果从 #2151 实现的 tagset_mapping 没有达到预期效果 =)

huwehgph

huwehgph2#

问题在于,在尝试使用通用标签集时,LookUpError 会搜索 'ru-rnc-new.map' 文件。我自己已经下载并通过 nltk.download 检查了所有文件,但错误仍然存在:

>>> tokens = nltk.word_tokenize(text)
>>> nltk.pos_tag(tokens, tagset='universal', lang = 'rus')
LookUpError
0kjbasz6

0kjbasz63#

看起来bug仍然存在?

相关问题