你好!
首先,感谢你开发BERTopic,它很棒!然而,我在尝试重命名我的聚类表示时遇到了一个问题。只要我只使用embedding_model,一切都正常。但是,一旦我开始使用representation_model,我就一直得到相同的错误。
以下是一些受此文档启发的示例代码。
# Import the necessary libraries
from bertopic import BERTopic
import pandas as pd
from transformers import pipeline
from bertopic.representation import TextGeneration
# prompt = f"I have a topic described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about?"
# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-base')
representation_model = TextGeneration(generator)
# 4. Get some sample data
data = pd.read_excel(testdata.xlsx')
# 5. Initialize BERTopic with the representation model
topic_model = BERTopic(
embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
representation_model = representation_model # if commented, code works
)
# 6. Fit BERTopic to the sample texts
topics, _ = topic_model.fit_transform(data['text'])
# 6. Get the topic information
topic_info = topic_model.get_topic_info()
# 7. Print the topic information
print(topic_info)
我得到的错误是:
TypeError Traceback (most recent call last)
Cell In[3], line 26
20 topic_model = BERTopic(
21 embedding_model= 'paraphrase-multilingual-mpnet-base-v2',
22 representation_model = representation_model
23 )
25 # 6. Fit BERTopic to the sample texts
---> 26 topics, _ = topic_model.fit_transform(data['Absatz'])
28 # 6. Get the topic information
29 topic_info = topic_model.get_topic_info()
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:433, in BERTopic.fit_transform(self, documents, embeddings, images, y)
430 self._save_representative_docs(custom_documents)
431 else:
432 # Extract topics by calculating c-TF-IDF
--> 433 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
435 # Reduce topics
436 if self.nr_topics:
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3637, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
3635 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
3636 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
-> 3637 self.topic_representations_ = self._extract_words_per_topic(words, documents)
3638 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
3639 self.topic_labels_ = {key: f"{key}_" + "_".join([word[0] for word in values[:4]])
3640 for key, values in
3641 self.topic_representations_.items()}
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py:3922, in BERTopic._extract_words_per_topic(self, words, documents, c_tf_idf, calculate_aspects)
3920 topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
3921 elif isinstance(self.representation_model, BaseRepresentation):
-> 3922 topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
3923 elif isinstance(self.representation_model, dict):
3924 if self.representation_model.get("Main"):
File ~/Code/NDR/.venv/lib/python3.11/site-packages/bertopic/representation/_textgeneration.py:147, in TextGeneration.extract_topics(self, topic_model, documents, c_tf_idf, topics)
143 updated_topics = {}
144 for topic, docs in tqdm(repr_docs_mappings.items(), disable=not topic_model.verbose):
145
146 # Prepare prompt
--> 147 truncated_docs = [truncate_document(topic_model, self.doc_length, self.tokenizer, doc) for doc in docs]
148 prompt = self._create_prompt(truncated_docs, topic, topics)
149 self.prompts_.append(prompt)
TypeError: 'NoneType' object is not iterable
在M1 Mac上运行它,如果这有帮助的话。任何帮助都将不胜感激。还尝试从best practise复制所有代码并得到了相同的错误。
最好的祝愿!
亚历克斯·穆尔豪森
4条答案
按热度按时间zour9fqk1#
老实说,我不确定这里发生了什么。我相信还有一个同样的问题没有解决,但它可能与底层的T5模型有关。另外,你试过将文档作为字符串列表传递,而不是pandas系列吗?
v1l68za42#
我遇到了相同的问题,但只在使用
TextGeneration
表示模型时出现。我可以生成其他表示模型而没有问题。我确实尝试将文档作为字符串列表传递,但错误仍然存在。我在v0.15.0上运行相同的代码成功。
编辑:我进行了一些调查,发现问题出在这一行。似乎无论何时使用默认提示,顶部代表性文档都将是
None
。为了解决这个问题,可以在第141行的else条件中将空列表分配为默认值。我打开了一个PR with this change
puruo6ea3#
感谢PR。我刚刚合并了#1726,这应该解决了问题。你们中的一位能否测试一下,以便我知道它对其他人也有效?
z6psavjg4#
感谢您的更新!我测试了一下,在我这边运行没有出现任何错误。