nltk 从AWS S3 URL加载.pickle文件并将其与word_tokenize()一起使用,

t1rydlwq 于 5个月前发布在其他

关注(0)|答案(2)|浏览(44)

I uploaded the german.pickle file to aws s3 and would like to use it with word_tokenize()
First I loaded the .pickle file and use it with tokenize()

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.tokenize(text))
>> ['Ich bin ein Test.', 'Tokenisierung ist toll']

As a result I get text tokenized in sentences.
But when I use the following code I receive an AttributeError:

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.word_tokenize(text))

>> AttributeError: 'PunktSentenceTokenizer' object has no attribute 'word_tokenize'

How can I use nltk.tokenize.word_tokenize() with the downloaded file from S3? Since when I try the common way word_tokenize(text, language = 'german') the LookupError is returned since the data are not available in my local environment.
Regards!

nltk

来源：https://github.com/nltk/nltk/issues/2947

2条答案

按热度按时间

c2e8gylq1#

你好！
我看到你已经上传了德语Punkt模型，它适用于句子分割，但不适用于单词分割，正如你所注意到的。
原因是Punkt(PunktSentenceTokenizer)是一个句子分词器，而用于单词分词的模型是不同的：NLTKWordTokenizer。
我不是100%确定，而且我现在没有时间去验证这个，但我认为NLTKWordTokenizer不需要任何额外的数据，这会导致你的本地环境中出现LookupError。这意味着你可能可以使用以下代码：

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')
sentences = tokenizer.tokenize(text)

sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]

请注意这里的preserve_line=True。word_tokenize函数如下：
nltk/nltk/tokenize/init.py
第114行到第132行
| | defword_tokenize(text, language="english", preserve_line=False): |
| | """ |
| | 返回一个使用NLTK推荐的单词分词器(目前是一个改进后的：class:.TreebankWordTokenizer,与：class:.PunktSentenceTokenizer一起用于指定的语言)对text进行分词的副本。 |
| | |
| | :param text: 要拆分为单词的文本 |
| | :type text: str |
| | :param language: Punkt语料库中模型的名称 |
| | :type language: str |
| | :param preserve_line: 一个标志，决定是否对文本进行句子分割。 |
| | :type preserve_line: bool |
| | """ |
| | sentences= [text] ifpreserve_lineelsesent_tokenize(text, language) |
| | return [ |
| | tokenforsentinsentencesfortokenin_treebank_word_tokenizer.tokenize(sent) |
| | ] |
如你在这里所看到的，当preserve_line等于False(默认值)时，会使用sent_tokenize,我们希望避免这种情况，并使用我们已经加载(通过AWS)的句子分词器代替。
希望这对你有帮助！

赞(0）回复(0）举报 5个月前

xoshrz7s2#

你好，@tomaarsen,

感谢你的迅速帮助！我会按照这种方式实现，但也注意到了一个区别。word_tokenize()返回一个包含所有标记的列表，而sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]返回一个列表，其中每个子列表包含一个句子的标记。如果能像这样直接实现就更好了：

import nltk
from nltk.tokenize import word_tokenize

nltk.data.path.append('https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/')

text = "Ich bin ein Test. Tokenisierung ist toll"

print(word_tokenize(text, language='german'))

尽管路径被附加到NLTK尝试查找文件的位置，但代码返回：

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/german.pickle

文件的url是：https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

赞(0）回复(0）举报 5个月前

我来回答

nltk 从AWS S3 URL加载.pickle文件并将其与word_tokenize()一起使用,

2条答案

相关问题

热门标签

最新问答