nltk Splitting sentences fails on some corner cases

gblwokeq 于 6个月前发布在其他

关注(0)|答案(3)|浏览(63)

我理解在句子中包含缩写词时，如何拆分句子以及添加缩写词可能会带来问题，正如 #2154 中很好地解释的那样。然而，我遇到了一些边缘情况，我想请教一下。看起来使用以下任何一种

e.g.
i.e.
et al.

在句子中会以错误的方式拆分句子。
i.e. 和 e.g. 的示例

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

et al. 的示例

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

在我的手提电脑上，我正在使用 nltk.__version__ 3.4.5。我认为这个问题与 #2154 不同，因为这些都是众所周知且常用的缩写词(尤其是在学术界)。

nltk

来源：https://github.com/nltk/nltk/issues/2376

3条答案

按热度按时间

z5btuh9x1#

快速的hack,遵循 #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但是也许有一个改进的句子分词器( #2008 , #1214 )是一个好主意，就像我们用单词分词器( #2355 )所做的那样。
例如，我们可以轻松地将所有 nltk.corpus.nonbreaking_prefixes 转换为 punkt._params.abbrev_types 作为第一步。

赞(0）回复(0）举报 6个月前

xam8gpfp2#

我正要提出一个类似的问题，但是看到你已经解决了。如果有人最终解决了这个问题，我建议回顾一下这里的拉丁缩写列表：

$x_{1e0f_1}^{x}$

赞(0）回复(0）举报 6个月前

xxe27gdn3#

从技术上讲，这些缩写在它们的完整书写形式中，代表了多词表达(MWEs),对吗？我的意思是，好吧，从技术上讲，它们也代表了固定的短语模板，但这并不会改变它们的MWE状态。所以，我在想#2202是否能帮助解决这个问题(尽管我觉得答案会是“否”)🤔

(P.S.有趣的事实：'& al.' 是 'et al.' 的法律简称，'&c.' 是 'etc.' 的法律简称)

赞(0）回复(0）举报 6个月前