nltk Splitting sentences fails on some corner cases

gblwokeq  于 6个月前  发布在  其他
关注(0)|答案(3)|浏览(63)

我理解在句子中包含缩写词时,如何拆分句子以及添加缩写词可能会带来问题,正如 #2154 中很好地解释的那样。然而,我遇到了一些边缘情况,我想请教一下。看起来使用以下任何一种

  • e.g.
  • i.e.
  • et al.

在句子中会以错误的方式拆分句子。
i.e. 和 e.g. 的示例

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

et al. 的示例

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

在我的手提电脑上,我正在使用 nltk.__version__ 3.4.5。我认为这个问题与 #2154 不同,因为这些都是众所周知且常用的缩写词(尤其是在学术界)。

z5btuh9x

z5btuh9x1#

快速的hack,遵循 #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但是也许有一个改进的句子分词器( #2008 , #1214 )是一个好主意,就像我们用单词分词器( #2355 )所做的那样。
例如,我们可以轻松地将所有 nltk.corpus.nonbreaking_prefixes 转换为 punkt._params.abbrev_types 作为第一步。

xam8gpfp

xam8gpfp2#

我正要提出一个类似的问题,但是看到你已经解决了。如果有人最终解决了这个问题,我建议回顾一下这里的拉丁缩写列表:

$x_{1e0f_1}^{x}$

xxe27gdn

xxe27gdn3#

从技术上讲,这些缩写在它们的完整书写形式中,代表了多词表达(MWEs),对吗?我的意思是,好吧,从技术上讲,它们也代表了固定的短语模板,但这并不会改变它们的MWE状态。所以,我在想#2202是否能帮助解决这个问题(尽管我觉得答案会是“否”)🤔

(P.S.有趣的事实:'& al.' 是 'et al.' 的法律简称,'&c.' 是 'etc.' 的法律简称)

相关问题