nltk 在bleu_score.SmoothingFunction中，错误的实现,

bfnvny8b 于 5个月前发布在其他

关注(0)|答案(6)|浏览(182)

我发现了这个问题：

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

hyp = ['a', 'a', 'a', 'a', 'a', 'a', 'a']
ref = [['a', 'a', 'a', 'd', 'a', 'a', 'a']]
print(corpus_bleu([ref], [hyp], (0, 0, 0, 1), smoothing_function=SmoothingFunction().method4))

hyp = ['a', 'b', 'b', 'b', 'b', 'b', 'b']
ref = [['a', 'a', 'a', 'd', 'a', 'a', 'a']]
print(corpus_bleu([ref], [hyp], (0, 0, 0, 1), smoothing_function=SmoothingFunction().method4))

输出是：

0.17954959837224665
0.17954959837224665

很奇怪为什么它们的4-gram平滑精度相同，所以我查看了https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L576-L590 的代码。

def method4(self, p_n, references, hypothesis, hyp_len, *args, **kwargs):
        """
        Smoothing method 4:
        Shorter translations may have inflated precision values due to having
        smaller denominators; therefore, we give them proportionally
        smaller smoothed counts. Instead of scaling to 1/(2^k), Chen and Cherry
        suggests dividing by 1/ln(len(T)), where T is the length of the translation.
        """
        for i, p_i in enumerate(p_n):
            if p_i.numerator == 0 and hyp_len != 0:
                incvnt = i + 1 * self.k / math.log(
                    hyp_len
                )  # Note that this K is different from the K from NIST.
                p_n[i] = 1 / incvnt
        return p_n

p_n[i] 是一个分数对象，一旦 p_n[i].numberator == 0 ,它就会变成一个浮点数常量 p_n[i] = 1 / (i + self.k / math.log(hyp_len)) 。
我找到了原始论文，它说：

分子(确切地说，匹配的n-gram计数)应该是 1 / (i + self.k / math.log(hyp_len)) ,但分母应该保持不变！

我大致浏览了一下其他的平滑函数，似乎Smoothing 4-7也有相同的问题......我认为应该立即修复，因为我已经发现一些论文的结果因为 nltk 而错误......

nltk

来源：https://github.com/nltk/nltk/issues/2341

6条答案

按热度按时间

jk9hmnmh1#

感谢您提出这个问题。是的，分母缩放丢失了。感谢您发现这一点！
顺便说一下，这种平滑方法在陈和Cherry(2014)的描述中是非常独特的，据我们所知，只有NLTK实现了这个方法，如果方便的话，

能否告诉我们哪些论文报告了与陈和Cherry(2014)相关的BLEU数字？
他们是否将其用于片段级别的BLEU和/或机器翻译任务？

参考，从 http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf ,方法3是：

方法4是：

顺便说一下，从https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L572,方法3的实现应该是正确的，因为在 p_n[i] 计算中已经考虑了分母。

赞(0）回复(0）举报 5个月前

8iwquhpp2#

我刚刚找到了一篇论文，作者使用了方法4来评估对话模型。我不知道这篇论文是否正确，但在复现结果时遇到了问题。然而，令人怀疑的是，bleu-4的值比bleu-2还要大。

他们使用的是sentence_bleu,而不是MT任务中使用的。

https://github.com/ricsinaruto/dialog-eval/blob/29c424fd7b9ad566cb6c65425dfe974f043ac98d/metrics/bleu_metrics.py#L23-L47

在GitHub上搜索，有很多论文代码使用了方法4-7。顺便说一下，这可能不相关。在旧版本的nltk中，当没有匹配到n-gram时，方法0会给出警告并返回1(现在它会返回0)。但这确实给https://github.com/FudanNLP/Irl_gen和实验结果带来了问题。我希望其他用户能注意到这个bug,但仍然有一些研究在使用旧版的nltk进行研究......

赞(0）回复(0）举报 5个月前

2cmtqfgy3#

感谢您@hzhwcmhf #2342 解决了这个问题。是的，我想在野外有很多关于"误用"BLEU的论文，有平滑和没有平滑=)
再次感谢您提出这个问题！

赞(0）回复(0）举报 5个月前

wqlqzqxt4#

我已经看过你的PR,但我认为method5中的问题还没有解决。
它应该对分子求平均值，而不是概率。
在method7中也存在一个问题。method5可能需要Fraction,但method4返回的是浮点数。