nltk 从AWS S3 URL加载.pickle文件并将其与word_tokenize()一起使用,

t1rydlwq  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(44)

I uploaded the german.pickle file to aws s3 and would like to use it with word_tokenize()
First I loaded the .pickle file and use it with tokenize()

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.tokenize(text))
>> ['Ich bin ein Test.', 'Tokenisierung ist toll']

As a result I get text tokenized in sentences.
But when I use the following code I receive an AttributeError:

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.word_tokenize(text))

>> AttributeError: 'PunktSentenceTokenizer' object has no attribute 'word_tokenize'

How can I use nltk.tokenize.word_tokenize() with the downloaded file from S3? Since when I try the common way word_tokenize(text, language = 'german') the LookupError is returned since the data are not available in my local environment.
Regards!

c2e8gylq

c2e8gylq1#

你好!
我看到你已经上传了德语Punkt模型,它适用于句子分割,但不适用于单词分割,正如你所注意到的。
原因是Punkt(PunktSentenceTokenizer)是一个句子分词器,而用于单词分词的模型是不同的:NLTKWordTokenizer
我不是100%确定,而且我现在没有时间去验证这个,但我认为NLTKWordTokenizer不需要任何额外的数据,这会导致你的本地环境中出现LookupError。这意味着你可能可以使用以下代码:

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')
sentences = tokenizer.tokenize(text)

sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]

请注意这里的preserve_line=Trueword_tokenize函数如下:
nltk/nltk/tokenize/init.py
第114行到第132行
| | defword_tokenize(text, language="english", preserve_line=False): |
| | """ |
| | 返回一个使用NLTK推荐的单词分词器(目前是一个改进后的:class:.TreebankWordTokenizer,与:class:.PunktSentenceTokenizer一起用于指定的语言)对text进行分词的副本。 |
| | |
| | :param text: 要拆分为单词的文本 |
| | :type text: str |
| | :param language: Punkt语料库中模型的名称 |
| | :type language: str |
| | :param preserve_line: 一个标志,决定是否对文本进行句子分割。 |
| | :type preserve_line: bool |
| | """ |
| | sentences= [text] ifpreserve_lineelsesent_tokenize(text, language) |
| | return [ |
| | tokenforsentinsentencesfortokenin_treebank_word_tokenizer.tokenize(sent) |
| | ] |
如你在这里所看到的,当preserve_line等于False(默认值)时,会使用sent_tokenize,我们希望避免这种情况,并使用我们已经加载(通过AWS)的句子分词器代替。
希望这对你有帮助!

xoshrz7s

xoshrz7s2#

你好,@tomaarsen,

感谢你的迅速帮助!我会按照这种方式实现,但也注意到了一个区别。word_tokenize()返回一个包含所有标记的列表,而sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]返回一个列表,其中每个子列表包含一个句子的标记。如果能像这样直接实现就更好了:

import nltk
from nltk.tokenize import word_tokenize

nltk.data.path.append('https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/')

text = "Ich bin ein Test. Tokenisierung ist toll"

print(word_tokenize(text, language='german'))

尽管路径被附加到NLTK尝试查找文件的位置,但代码返回:

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/german.pickle

文件的url是:https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

相关问题