BERTopic ``` Supervised Topic Modelling无法使用fit_transform()和transform()产生相同的输出, ```

jk9hmnmh 于 5个月前发布在其他

关注(0)|答案(5)|浏览(58)

你好，
我参考了https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html的代码，用于BERTopic监督建模。尽管我能够使用fit_transform(docs, y=y)获得正确的输出，但即使输入的doc相同，我也无法使用transform(docs)获得相同的输出。生成的大多数主题都被标记为-1。
请问出了什么问题？
参考一下，这是我的模型代码：
empty_dimensionality_model = BaseDimensionalityReduction()
clf = LogisticRegression()
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(
umap_model=empty_dimensionality_model,
hdbscan_model=clf,
ctfidf_model=ctfidf_model,
n_gram_range=(1,3))
谢谢！任何帮助都将不胜感激。

BERTopic

来源：https://github.com/MaartenGr/BERTopic/issues/1270

5条答案

按热度按时间

31moq8wy1#

你能创建一个可复现的例子吗？没有它很难看出到底出了什么问题。此外，在使用LogisticRegression时设置random_state可能会有所帮助。

赞(0）回复(0）举报 5个月前

yhived7q2#

你好，
我遇到了同样的问题。

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from umap import UMAP

################## Data Preparation ##################
# Get labeled data
data = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))
docs = data["data"]
y = data["target"]

################## Topic modelling ##################

##Prepare embeddings
sentence_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

##Train model
empty_dimensionality_model = BaseDimensionalityReduction()
clf = LogisticRegression(random_state=42)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

base_topic_model = BERTopic(
    umap_model=empty_dimensionality_model,
    hdbscan_model=clf,
    ctfidf_model=ctfidf_model,
    verbose=True,
    # calculate_probabilities=True,
)
topics, probs = base_topic_model.fit_transform(docs, embeddings, y=y)

##Transform same documents and check if the predicted topics are the same
new_topics, new_probs = base_topic_model.transform(docs, embeddings)
assert topics == new_topics

赞(0）回复(0）举报 5个月前

pkln4tw63#

我相信在你的案例中，发生的情况如下。当你使用预测模型运行 .fit_transform 时，它返回了 y 变量，而不是实际的预测结果。由于模型应该在监督任务中进行拟合，并且你已经将文档分配到了聚类中，因此 y 变量直接用于创建所谓的“预测”。尽管它们在技术上并不是真正的预测。
然后，当你运行 .transform 时，底层的预测确实被用于运行 .predict 以生成预测，因为我们假设 .transform 用于为未见过的文档生成预测。
总之，在 BERTopic 的 .fit_transform 中，只使用了你的 LogisticRegression 的 .fit,而没有实际运行预测就返回了 y。相比之下，在 BERTopic 的 .transform 中，使用了你的 LogisticRegression 的 .predict,并返回了它的预测结果。如果这两个主题实际上是相同的，那实际上意味着你的模型过拟合了。