CoreNLP QuoteAnnotator的返回值不一致?(缺少提及数据)

rryofs0p  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(40)

我正在尝试理解在使用QuoteAnnotator对新闻语料库进行处理时,我们得到的一些精度较低的结果。我希望了解并调整筛选器以提高精度,但在通过HTTP查询服务器时无法检索到提及和筛选数据。然而,当我使用sample server jupyter notebook时,我确实可以获取这些数据。

CoreNLPClient的结果

我在笔记本中运行了一个测试,如下所示:

# create server
os.environ["CORENLP_HOME"] = "./corenlp"
client = CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner', 'depparse', 'coref', 'quote'], memory='8G', endpoint='http://localhost:9001')
client.start()
time.sleep(10)
text = """Vice President Mike Pence announced Thursday that Israel's Prime Minister and opposition leader will visit the White House next week to discuss "regional issues as well as the prospect of peace." The announcement comes as reports suggest a potential reveal of the Trump administration's Middle East peace plan could be imminent. Pence, who was in Jerusalem for a gathering of world leaders to mark the 75th anniversary of the liberation of Auschwitz, extended the invitation to Prime Minister Benjamin Netanyahu from President Donald Trump. He also announced that Blue and White chairman Benny Gantz will also attend the meeting at the White House on Tuesday. In a tweet Trump appeared to dash speculation an announcement on the peace plan may be imminent. "The United States looks forward to welcoming Prime Minister @Netanyahu & Blue & White Chairman @Gantzbe to the @WhiteHouse next week. Reports about details and timing of our closely-held peace plan are purely speculative," he tweeted. The unveiling of the plan, which is being spearheaded by Trump's senior adviser and son-in-law Jared Kushner, has been delayed amid the months-long period of turmoil in Israeli politics with the country due to hold an unprecedented third national election in less than a year in March."""
document = client.annotate(text)
document.quote[1]

收到了包括mentionmentionSieve和其他属性的结果:

...
mention: "he"
mentionBegin: 171
mentionEnd: 171
mentionType: "pronoun"
mentionSieve: "trigram QPV"
canonicalMention: "Donald Trump"
...

HTTP请求的结果

在同一台笔记本中,查询相同的服务器,我通过HTTP获取json结果,但没有提及信息:

import requests
url = 'http://localhost:9001/?properties={"annotators":"tokenize,ssplit,pos,lemma,ner,depparse,coref,quote","outputFormat":"json"}'
r = requests.post(url, data=text.encode('utf-8'))
results = r.json()
results['quotes'][1]

我得到了以下结果:

{'id': 1,
 'text': '"The United States looks forward to welcoming Prime Minister @Netanyahu & Blue & White Chairman @Gantzbe to the @WhiteHouse next week. Reports about details and timing of our closely-held peace plan are purely speculative,"',
 'beginIndex': 757,
 'endIndex': 980,
 'beginToken': 133,
 'endToken': 170,
 'beginSentence': 5,
 'endSentence': 6,
 'speaker': 'Unknown',
 'canonicalSpeaker': 'Donald Trump'}

问题

我认为我在两种情况下都在做同样的请求。一个线索是,我认为输出格式可能不同- jsonserialized。但是为什么服务器会根据指定的输出格式返回不同的结果呢?有什么方法可以在json输出中包含提及数据?我需要直接接入Java代码来实现这一点吗?(可能与#616有关?)

tf7tbtn2

tf7tbtn21#

现在遇到了同样的问题!你是否已经解决了这个问题?如果是,请与我分享解决此问题的方法。谢谢

qyzbxkaa

qyzbxkaa2#

不,我从未收到过回复。我们最终没有根据筛子调整事物。我看到@J38之前曾就QuoteAnnotator相关功能进行过工作 - 也许他们有想法?我假设这个标注器不是主动维护的。

ljo96ir5

ljo96ir53#

我刚刚添加了一个更改,应该可以解决这个问题。有一件事情我不确定,那就是是否有必要为空白字段设置"Unknown"。在我看来这是不对的——如果有人用"Unknown"作为报价,这似乎不合适——但我不知道在另一端检测缺失字段有多容易。我必须想象这通常是相当容易的,尽管如此。
如果这个看起来有用,我可以制作一个临时包供你下载,因为下一次发布不会在几个月后进行。

r1zk6ea1

r1zk6ea14#

看起来很棒,谢谢!我暂时不需要,因为我之前和quotes一起做的项目已经完成了,但我想我会再次使用它,到时候这会很有帮助。

相关问题