nltk 引用的作者名字在wordnet定义中混淆了,

5gfr0r5j  于 5个月前  发布在  其他
关注(0)|答案(6)|浏览(68)

如果我运行以下代码:

from nltk.corpus import wordnet

for ss in wordnet.all_synsets():
    if ' - ' in ss.definition():
        print(ss, ss.definition())

我得到一个定义列表,如下所示:

Synset('abstemious.a.01') sparing in consumption of especially food and drink; - John Galsworthy
Synset('ascetic.s.02') practicing great self-denial; - William James
Synset('dead-on.s.01') accurate and to the point; ; - Peter S.Prescott
Synset('used_to.s.01') in the habit; ; ; - Henry David Thoreau
Synset('predaceous.s.02') living by or given to victimizing others for personal gain; ; - Peter S. Prescott; - W.E.Swinton
Synset('passive.a.01') lacking in energy or will; - George Meredith
Synset('resistless.s.02') offering no resistance; ; - Theodore Roosevelt
Synset('alcoholic.s.02') addicted to alcohol; - Carl Van Doren
Synset('reductive.s.01') characterized by or causing diminution or curtailment;  - R.H.Rovere
Synset('mounted.s.02') decorated with applied ornamentation; often used in combination; - F.V.W.Mason
Synset('coordinated.s.02') being dexterous in the use of more than one set of muscle movements; - Mary McCarthy
Synset('light-fingered.s.01') having nimble fingers literally or figuratively; especially for stealing or picking pockets; - Harry Hansen; - Time
Synset('bumbling.s.01') lacking physical movement skills, especially with the hands; ; ; ; - Mary H. Vorse
Synset('uninfluenced.s.01') not influenced or affected; - V.L.Parrington

我担心这些作者(如 - Theodore Roosevelt )可能不应该出现在定义中?我认为这些是 ss.examples() 列表中最后一个示例的作者,因为它们不在双引号内,所以没有被解析为示例的一部分。

rqdpfwrv

rqdpfwrv1#

我认为这些是ss.examples()列表中最后一个示例的作者,因为它们不在双引号内,所以没有被解析为示例的一部分。

是的,你是对的。在原始的WNDB format数据中,"gloss"紧跟在|字符后面。一个gloss可能包含定义和/或示例,但没有指定它们是如何分隔的。这是你列表上的第一个(你可能需要滚动到右侧查看gloss):

00009046 00 a 01 abstemious 0 006 ^ 01299888 a 0000 = 04883942 n 0000 + 04883942 n 0101 ! 00009978 a 0101 & 00009346 a 0000 & 00009618 a 0000 | sparing in consumption of especially food and drink; "the pleasures of the table, never of much consequence to one naturally abstemious"- John Galsworthy

这里是解析它的代码:
nltk/nltk/corpus/reader/wordnet.py
第1436行至第1441行
| | # 从gloss中解析出定义和示例 |
| | columns_str, gloss=data_file_line.strip().split("|") |
| | definition=re.sub(r"["].?["]", "", gloss).strip() |
| | examples=re.findall(r'"([^"]
)"', gloss) |
| | forexampleinexamples: |
| | synset._examples.append(example) |
这当然可以改进。

dced5bon

dced5bon2#

前言
我将把这篇文章分成多个评论,以便更好地组织我的想法。对于大量的文本转储,我提前道歉。所有标记为“Breakdown”的部分都详细解释了它们前面的正则表达式,为了简洁起见,可以跳过它们。我认为它们仍然很重要,以便我的逻辑和推理更容易被检查。谢谢阅读。<3
这个问题实际上是:#2265 的延续。

假设

虽然没有明确定义,但我相信我们可以对gloss的结构做出一些假设,并根据这些假设定义新的正则表达式。

  • 分隔符
  • gloss部分的分隔符 应该是:
  • | (分隔gloss的开始)
  • ; (分隔gloss的部分)
  • 两个空格和一个换行符(分隔gloss的结束,也分隔一行的结束)
  • 所有分隔符 应该在右侧有一个空格的填充,除非分隔符是两个空格和一个换行符。
  • ; 's 也可以出现在示例/引用的中间,当它们这样做时,不应该被视为分隔符
  • ; 's 也可以出现在额外注解的中间,当它们这样做时,不应该被视为分隔符(可选,请参阅下面的“从gloss中提取额外注解”)
  • 示例/引用
  • 所有示例 应该完全由 " 's 包裹,除非示例是一个引用
  • 引用是示例的特殊情况,其中示例文本 应该完全由 " 's 包裹,然后立即跟一个破折号(可选:一个空格),以及来源的名字,如:"quoted_text"- origin_name 或:"quoted_text"-origin_name
  • origin_name 可以是任何长度的字符串,可能包含空白,但 应该由分隔符终止
  • 额外注解
  • 所有额外注解 应该完全由 () 包裹
  • 额外的括号 *不应该**存在于额外注解中
  • 定义
  • 不匹配上述任何一个 应该是一个定义

他们让你和我出丑

我经常说 应该是因为WordNet 3.0的gloss部分偶尔不完美,会给这些假设带来问题。WordNet 3.1解决了大部分问题,但我的测试还远未完成。不幸的是,update to WordNet 3.1 似乎已经被无限期推迟了。一个解决方案可能是手动更新WordNet 3.0的gloss部分,并将纠正推送到nltk_data ,这希望不会引发与WordNet 3.1更新相同的问题。我会在nltk_data中开始一个new issue ,看看它能走多远。
除此之外,我相信我们仍然可以改进正在进行的解析工作,尤其是对于引用,尽管它最终可能会变成一种交换一组问题为另一组问题的情况,直到数据正确为止。

dgsult0t

dgsult0t3#

提出的解决方案:

提取所有gloss部分,直到只剩下定义
关键概念是,给定正确格式的输入,我们可以删除一切可以明显识别的内容,直到剩下的gloss部分必须是定义。

分隔符和查找周围内容

这些是以下正则表达式中常见的部分,以确保在尝试捕获的任何内容两侧都有匹配的分隔符。

delim = "; "
lookbehind = f"(?<={delim})"
lookahead = f"(?={delim})"

从文件行中提取gloss

我们需要从文件行中提取gloss部分并进行一些设置以准备解析。通过在|上分割并去除空格,我们已经去除了开头和结尾的分隔符。我们将在开头和结尾添加新的分隔符,以便lookarounds能够正确匹配。

columns_str, gloss = data_file_line.split("|")
gloss = f"{delim}{gloss.strip()}{delim}"

从gloss中提取例子/引述

捕获并提取文本中正确分隔且符合假设条件的例子或引述的所有内容。捕获的文本包括 Package " 's,可以从例子中剥离并保留在引述中。

lazy_match_quotes = r'"[^"\n]+?"'
lazy_match_origin = r"(?:-.+?)*?"
extract_examples = f"{lookbehind}({lazy_match_quotes}{lazy_match_origin}){lookahead}"

分解

第一个主要部分是"[^"\n]+?",它将匹配任何用" 's正确 Package 的内容。接下来是一个非捕获组,这样后续的标记就可以单独量化而不被捕获。在这个非捕获组中,我们有-.+?来匹配破折号和一个或多个字符(origin_name)。由于并非所有的例子都是引述,所以-origin_name是可选的,用*?表示。

  • 所有量词都被设置为懒惰模式,以防止过度匹配,这样第一个遇到的正向前瞻结束捕获。

从gloss中提取额外的注解(可选)

捕获并提取文本中正确分隔且符合假设条件的其他注解的所有内容。捕获的文本包括括号。

lazy_match_parens = r"\([^()\n]+?\)"
extract_notes = f"{lookbehind}({lazy_match_parens}){lookahead}"

分解

虽然不是非常易读,因为括号是特殊字符,必须转义或放入字符集中,但表达式本身很简单。该表达式匹配所有用括号正确 Package 的内容,其中没有另一个括号在外部 Package 括号之间。有些定义以一组匹配的括号开始,以不同的一组匹配的括号结束,但中间也有完全位于括号之外的内容,例如WordNet 3.0的第334行上的'crocketed':

00058379 00 s 01 crocketed 0 001 & 00056002 a 0000 | (of a gable or spire) furnished with a crocket (an ornament in the form of curved or bent foliage); "a crocketed spire"

如果正则表达式中的否定字符集不包括打开/关闭括号,那么对这个定义的解释将错误地将其视为其他注解。

  • 所有量词都被设置为懒惰模式,以防止过度匹配,这样第一个遇到的正向前瞻结束捕获。

可选?

额外注解部分被标记为“可选”,因为我不确定如何处理gloss的部分。到目前为止,我所知道的Synset类的结构没有处理这些额外注解的结构。原始的WNDB format文档只提到了定义和例子。在我看来,这些用括号包裹的部分不适合放在定义中,而更适合放在例子中。

值得注意:

  • 如果假设 ; 's 不能出现在额外注解的中间,那么这一部分就是可选的,它们可以作为定义提取出来。
  • 如果假设 ; 's 可以出现在额外注解的中间,那么这一部分就不是可选的,它们必须作为附加步骤提取出来并追加。
  • 如果这些额外注解不能作为定义的一部分出现,那么它们必须作为附加步骤删除。

从gloss中删除多余的分隔符

捕获所有多余的分隔符序列。

greedy_match_extra_delims = f"{lookbehind}({delim})+"

分解

匹配所有情况,其中一个或多个分隔符紧跟在另一个分隔符之后。 + 量词是贪婪的,以尽可能一次捕获尽可能多的多余分隔符。

将除定义外的所有其他内容从gloss中提取出来作为定义

捕获并提取文本中正确分隔的所有内容作为定义。由于我们在分隔符之间匹配任何内容而没有任何额外限制,因此我们必须确保文本格式正确(没有多余的分隔符,每个可能的分隔符都应该是一个分隔符)。

lazy_match_anything = r".*?"
extract_definitions = f"{lookbehind}({lazy_match_anything}){lookahead}"

分解

这是所有正则表达式中最简单的一个,它匹配分隔符之间的所有内容。

  • 所有量词都被设置为懒惰模式,以防止过度匹配,这样第一个遇到的正向前瞻结束捕获。
rta7y2nd

rta7y2nd4#

将所有部件组合在一起

现在我们已经处理了我们的假设和表达式,我们可以设计一个解决方案。我已经准备了一个示例实现。在这个例子中,我还假设;不能出现在其他注解中,并且其他注解应该被视为定义,使得单独提取其他注解变得不再必要。

需要注意的是:

  • 输出中的多个定义或示例由分隔,以避免混淆。

实现

import re
import textwrap
from typing import List

# helper method for displaying results
def print_dict(dictionary, title, widest_key, display_width = 100):
    wrapper = textwrap.TextWrapper()
    wrap_width = display_width - 4
    bar_width = display_width - 2

    wrapper.width = wrap_width
    wrapper.subsequent_indent = " " * (widest_key + 8)

    box_line = "─" * bar_width
    title_line = "═" * bar_width
    print(f"╒{title_line}╕")
    print(f"│ {title.center(wrap_width)} │")
    print(f"├{box_line}┤")

    for key, val in dictionary.items():
        full_line = f"{key.ljust(widest_key)} : {val}"
        for line in wrapper.wrap(full_line):
            print(f"│ {line.ljust(wrap_width)} │")

    print(f"└{box_line}┘")

# set up regular expressions
delim = "; "
lookbehind = f"(?<={delim})"
lookahead = f"(?={delim})"

lazy_match_quotes = r'"[^"\n]+?"'
lazy_match_origin = r"(?:-.+?)*?"
extract_examples = f"{lookbehind}({lazy_match_quotes}{lazy_match_origin}){lookahead}"

lazy_match_parens = r"\([^()\n]+?\)"
extract_notes = f"{lookbehind}({lazy_match_parens}){lookahead}"

lazy_match_anything = r".*?"
extract_definitions = f"{lookbehind}({lazy_match_anything}){lookahead}"

greedy_match_extra_delims = f"{lookbehind}({delim})+"

replacement_delimiter = " ■ "

# test method
def test_parse(lines: List[str]):
    # initialization
    longest_lemma_name = 0
    definitions_dict = {}
    examples_dict = {}

    # iterate over lines
    for data_file_line in lines:

        # set up columns_str and gloss
        columns_str, gloss = data_file_line.split("|")
        columns_str = columns_str.strip()

        # get lemma name for displaying
        lemma_name = columns_str.strip().split(" ")[4]

        # attach beginning and ending delimiters
        gloss = f"{delim}{gloss.strip()}{delim}"

        # extract examples
        examples = re.findall(extract_examples, gloss)
        for example in examples:
            # strip double quotes if they wrap the entire example (ie: the example is not a quote)
            if example[-1] == '"':
                example = example.strip('"')
        examples_dict.update({lemma_name: replacement_delimiter.join(examples)})
        # if there were any examples, remove them from the gloss
        if len(examples) > 0:
            gloss = re.sub(extract_examples, "", gloss)

        # trim escess delimiters
        gloss = re.sub(greedy_match_extra_delims, "", gloss)

        # extract definitions
        definitions = re.findall(extract_definitions, gloss)
        definitions_dict.update({lemma_name: replacement_delimiter.join(definitions)})

        # get longest name to display cleanly
        if len(lemma_name) > longest_lemma_name:
            longest_lemma_name = len(lemma_name)

    # display results
    print_dict(definitions_dict, 'Definitions', longest_lemma_name)
    print()
    print_dict(examples_dict, 'Examples', longest_lemma_name)

输入

test_lines = [
    # improper gloss formatting, ": " used as delimiter
    '00342626 00 s 01 certificated 0 001 & 00342250 a 0000 | furnished with or authorized by a certificate: "certificated teachers"  \n',
    # improper gloss formatting, ":" used as delimiter
    '00015589 00 s 01 long 0 001 & 00013887 a 0000 | having or being more than normal or necessary:"long on brains"; "in long supply"  \n',
    # improper gloss formatting, missing closing " for example
    '00006885 00 s 03 assimilating 0 assimilative 0 assimilatory 0 003 & 00006336 a 0000 + 01540042 v 0301 + 01540042 v 0201 | capable of taking (gas, light, or liquids) into a solution; "an assimilative substance  \n',
    # semicolon in example
    '02298766 00 s 02 unacceptable 0 unaccepted 0 003 & 02298285 a 0000 ;c 06172789 n 0000 + 04793925 n 0102 | not conforming to standard usage; "the following use of `access\' was judged unacceptable by a panel of linguists; `You can access your cash at any of 300 automatic tellers\'"  \n',
    '00325281 00 a 01 cautious 0 011 ^ 00309021 a 0000 ^ 00066800 a 0000 + 07944900 n 0102 + 05615869 n 0101 + 04664058 n 0102 ! 00326436 a 0101 & 00325619 a 0000 & 00325840 a 0000 & 00325995 a 0000 & 00326202 a 0000 & 00326296 a 0000 | showing careful forethought; "reserved and cautious; never making swift decisions"; "a cautious driver"  \n',
    '00018435 00 s 01 unobjectionable 0 001 & 00017782 a 0000 | not objectionable; "the ends are unobjectionable; it\'s the means that one can\'t accept"  \n',
    '00478685 00 s 02 irritating 0 painful 0 002 & 00478015 a 0000 + 04720024 n 0201 | causing physical discomfort; "bites of black flies are more than irritating; they can be very painful"  \n',
    # example with attribution
    '00252130 00 s 01 bashful 0 002 & 00251809 a 0000 + 07508092 n 0102 | self-consciously timid; "I never laughed, being bashful; lowering my head, I looked at the wall"- Ezra Pound  \n',
    # multiple definitions
    '00266634 00 a 02 gutsy 0 plucky 0 004 + 04859816 n 0202 + 05032351 n 0103 + 04859816 n 0101 ! 00266985 a 0101 | marked by courage and determination in the face of difficulties or danger; robust and uninhibited; "you have to admire her; it was a gutsy thing to do"; "the gutsy...intensity of her musical involvement"-Judith Crist; "a gutsy red wine"  \n',
    # multiple definitions / Semicolon in example
    '00547641 00 s 04 crisp 0 curt 0 laconic 0 terse 0 003 & 00546646 a 0000 + 07088438 n 0401 + 07089276 n 0101 | brief and to the point; effectively cut short; "a crisp retort"; "a response so curt as to be almost rude"; "the laconic reply; `yes\'"; "short and terse and easy to understand"  \n',
    # multiple examples with attribution / Semicolon in quote
    '01179767 00 s 02 divine 2 godlike 0 002 & 01178974 a 0000 + 09505418 n 0102 | being or having the nature of a god; "the custom of killing the divine king upon any serious failure of his...powers"-J.G.Frazier; "the divine will"; "the divine capacity for love"; "\'Tis wise to learn; \'tis God-like to create"-J.G.Saxe  \n',
    # examples followed by additional notes
    '00074641 02 r 06 forward 0 forwards 1 frontward 0 frontwards 0 forrad 0 forrard 0 002 ;u 07155661 n 0000 ! 00074407 r 0102 | at or to or toward the front; "he faced forward"; "step forward"; "she practiced sewing backward as well as frontward on her new sewing machine"; (`forrad\' and `forrard\' are dialectal variations)  \n',
    # additional notes between definition and examples
    '01623360 00 s 03 busy 0 engaged 0 in_use(p) 0 002 & 01623187 a 0000 + 14008050 n 0101 | (of facilities such as telephones or lavatories) unavailable for use by anyone else or indicating unavailability; (`engaged\' is a British term for a busy telephone line); "her line is busy"; "receptionists\' telephones are always engaged"; "the lavatory is in use"; "kept getting a busy signal"  \n',
    # definition that begins and ends with parens but is not an additional note
    '00058379 00 s 01 crocketed 0 001 & 00056002 a 0000 | (of a gable or spire) furnished with a crocket (an ornament in the form of curved or bent foliage); "a crocketed spire"  \n',
]

# run test
test_parse(test_lines)

输出

╒══════════════════════════════════════════════════════════════════════════════════════════════════╕
│                                           Definitions                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ certificated    : furnished with or authorized by a certificate: "certificated teachers"         │
│ long            : having or being more than normal or necessary:"long on brains"                 │
│ assimilating    : capable of taking (gas, light, or liquids) into a solution ■ "an assimilative  │
│                        substance                                                                 │
│ unacceptable    : not conforming to standard usage                                               │
│ cautious        : showing careful forethought                                                    │
│ unobjectionable : not objectionable                                                              │
│ irritating      : causing physical discomfort                                                    │
│ bashful         : self-consciously timid                                                         │
│ gutsy           : marked by courage and determination in the face of difficulties or danger ■    │
│                        robust and uninhibited                                                    │
│ crisp           : brief and to the point ■ effectively cut short                                 │
│ divine          : being or having the nature of a god                                            │
│ forward         : at or to or toward the front ■ (`forrad' and `forrard' are dialectal           │
│                        variations)                                                               │
│ busy            : (of facilities such as telephones or lavatories) unavailable for use by anyone │
│                        else or indicating unavailability ■ (`engaged' is a British term for a    │
│                        busy telephone line)                                                      │
│ crocketed       : (of a gable or spire) furnished with a crocket (an ornament in the form of     │
│                        curved or bent foliage)                                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

╒══════════════════════════════════════════════════════════════════════════════════════════════════╕
│                                             Examples                                             │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ certificated    :                                                                                │
│ long            : "in long supply"                                                               │
│ assimilating    :                                                                                │
│ unacceptable    : "the following use of `access' was judged unacceptable by a panel of           │
│                        linguists; `You can access your cash at any of 300 automatic tellers'"    │
│ cautious        : "reserved and cautious; never making swift decisions" ■ "a cautious driver"    │
│ unobjectionable : "the ends are unobjectionable; it's the means that one can't accept"           │
│ irritating      : "bites of black flies are more than irritating; they can be very painful"      │
│ bashful         : "I never laughed, being bashful; lowering my head, I looked at the wall"- Ezra │
│                        Pound                                                                     │
│ gutsy           : "you have to admire her; it was a gutsy thing to do" ■ "the gutsy...intensity  │
│                        of her musical involvement"-Judith Crist ■ "a gutsy red wine"             │
│ crisp           : "a crisp retort" ■ "a response so curt as to be almost rude" ■ "the laconic    │
│                        reply; `yes'" ■ "short and terse and easy to understand"                  │
│ divine          : "the custom of killing the divine king upon any serious failure of             │
│                        his...powers"-J.G.Frazier ■ "the divine will" ■ "the divine capacity for  │
│                        love" ■ "'Tis wise to learn; 'tis God-like to create"-J.G.Saxe            │
│ forward         : "he faced forward" ■ "step forward" ■ "she practiced sewing backward as well   │
│                        as frontward on her new sewing machine"                                   │
│ busy            : "her line is busy" ■ "receptionists' telephones are always engaged" ■ "the     │
│                        lavatory is in use" ■ "kept getting a busy signal"                        │
│ crocketed       : "a crocketed spire"                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
t3irkdon

t3irkdon5#

感谢你提供的精彩分析、示例列表和建议的解决方案。如果NLTK的wordnet模块继续进行维护,我认为你的解决方案会很有用。然而,如果NLTK转向基于WN-LMF的wordnets,那么在这里的努力可以应用到WNDB到LMF转换脚本中,特别是关于提取示例的部分。

我还有一些建议:

  1. 使用str.partition()代替str.split(),以避免在gloss中出现|字符时出现ValueError。

  2. 如果一个例子中有两组引号,这种方法将无法正常工作,因为它只会获取第一个引号的开始和最后一个引号的结束,例如(虚构的例子):

| ...; "Yeah well, the Dude abides."- Jeffrey Lebowski, "The Dude"

但是,这些问题可能甚至不会出现在数据中,而且不太可能出现新的WNDB来源来挑战系统的健壮性,所以不值得投入太多精力。最好检测剩余的问题并为它们编写特殊情况。

dgiusagp

dgiusagp6#

我已经考虑了这个问题更久。我不确定我们是否应该尝试提取多个定义。从 WNDB docs 中(强调):

  • gloss

每个同义词集包含一个解释。解释用竖线(|)表示,后面跟着一个文本字符串,直到行尾。解释可能包含定义、一个或多个示例句子,或者两者都有。
此外,有时候有多个定义,但这些倾向于被组合成一个大的定义,例如在更新后的 Open English WordNet 中,所以也许我们应该遵循先例。
如果我们只是说示例是直到第一个清晰的示例或字符串结束之前的所有内容,哪个先到就是哪个,那么处理括号等就容易多了。我只在 PWN 3.0 中看到一个 ; 在括号内发生,而且它后面没有跟随 " :

$ grep -P '\|.*?\([^)]*?;' data.*
data.noun:06844739 10 n 01 semicolon 0 001 @ 06841365 n 0000 | a punctuation mark (`;') used to connect independent clauses; indicates a closer relation than does a period

; 实际上并不分隔定义而是列表中的项目时,这也很有帮助。例如(缩写):
data.noun:01725240 ... | extinct marine reptiles: plesiosaurs; nothosaurs
一旦我们确定定义在哪里结束,我们就可以允许其他类型的分隔符在示例之间进行一些变化:

$ grep -P -o '"[^"]*"[^;]? "[^"]*"' data.*
data.adj:"unfavorable comments", "unfavorable impression"
data.adj:"universal wrench", "universal chuck"
data.adj:"an advisory memorandum", "his function was purely consultative"
data.adj:"a difficult child", "an unmanageable situation"
data.adj:"a potent cup of tea", "a stiff drink"
data.adj:"interchangeable electric outlets" "interchangeable parts"
data.adj:"assimilative processes", "assimilative capacity of the human mind"
data.adj:"a rich vein of copper", "a rich gas mixture"
data.adj:"secular architecture", "children being brought up in an entirely profane environment"
data.adv:"well-done beef", "well-satisfied customers"
data.noun:"it was the swimming they enjoyed most": "they took a short swim in the pool"
data.noun:"the head of the nail", "a pinhead is the head of a pin"
data.noun:"the depth of his sighs," "the depth of his emotion"
data.noun:"dogs, foxes, and the like", "we don't want the likes of you around here"
data.noun:"in contrast to", "by contrast"
data.verb:"The efforts were intensified", "Her rudeness intensified his dislike for her"
data.verb:"How do you evaluate this grant proposal?" "We shouldn't pass judgment on other people"
data.verb:"The journalists have defamed me!" "The article in the paper sullied my reputation"
data.verb:"How do you spell this word?" "We had to spell out our names for the police officer"
data.verb:"pin the needle to the shirt". "pin the blame on the innocent man"
data.verb:"the bodybuilder's neck muscles tensed;" "the rope strained when the weight was attached"
data.verb:"The burglar jimmied the lock": "Raccoons managed to pry the lid off the garbage pail"
data.verb:"bring charges", "institute proceedings"
data.verb:"the vestiges of political democracy were soon uprooted" "root out corruption"
data.verb:"The tourists moved through the town and bought up all the souvenirs;" "Some travelers pass through the desert"
data.verb:"deliver an attack", "deliver a blow"
data.verb:"I don't dare call him", "she dares to dress differently from the others"
data.verb:"weigh heavily on the mind", "Something pressed on his mind"
data.verb:"Where is my umbrella?" "The toolshed is in the back"
data.verb:"Rivers traverse the valley floor", "The parking lot spans 3 acres"
data.verb:"Can he reach?" "The chair must not touch the wall"

也就是说,在示例中,字符 ;:,. 和 `` (空格)都用于分隔单个示例。
还有两个特殊情况,其中一个结束引号紧接着是一个单词字符,有不同的解决方法:

  • 应该在中间引号处拆分为两个示例:

data.noun:04203889 ... | the commodities purchased from stores; "she loaded her shopping into the car"women carrying home shopping didn't give me a second glance"

  • 中间引号应该字面包含

data.verb:01148961 ... | take sides for or against; "Who are you widing with?"; "I"m siding against the current candidate"
一些其他有趣的示例:

  • or 作为示例分隔符,句号结尾:

data.noun:05677340 ... | an intuitive awareness; "he has a feel for animals" or "it's easy when you get the feel of it";

  • 继续引用:

data.verb:00781000 ... | continue talking; "I know it's hard," he continued, "but there is no choice"; "carry on--pretend we are not in the room"

  • e.g.as in e.g. 之前的例子:

data.adj:00347707 ... | incapable of being changed or moved or undone; e.g. "frozen prices"; "living on fixed incomes"
data.noun:06469597 ... | a summary list; as in e.g. "a news roundup"
这些涉及到格式错误的gloss字符串的长尾巴,因此尝试将模式泛化为处理它们需要花费大量的工作,收益却很少。此外,一个标准(我在另一个项目中有这个标准,但可能与NLTK无关)是底层数据保持不变,所以我必须接受一些精度损失来识别示例。

相关问题