flutter 如何使用GridSearchCV进行聚类(MeanShift或DBSCAN)？

aydmsdu9 于 2023-06-24 发布在 Flutter

关注(0)|答案(3)|浏览(142)

我正在尝试使用scikit-learn聚类一些文本文档。我正在尝试DBSCAN和MeanShift，并希望确定哪些超参数（例如：bandwidth用于MeanShift，eps用于DBSCAN）最适合我正在使用的数据类型（新闻文章）。
我有一些测试数据，其中包括预先标记的集群。我一直在尝试使用scikit-learn的GridSearchCV，但不明白如何（或是否可以）在这种情况下应用，因为它需要分割测试数据，但我想在整个数据集上运行评估，并将结果与预先标记的数据进行比较。
我一直在尝试指定一个评分函数，用于比较估计器的标签和真实标签，但当然它不起作用，因为只有一个样本的数据被聚类，而不是全部。
什么是合适的方法呢？

flutter

来源：https://stackoverflow.com/questions/25633383/how-can-gridsearchcv-be-used-for-clustering-meanshift-or-dbscan

3条答案

按热度按时间

z9ju0rcb1#

以下DBSCAN函数可能会有所帮助。我编写它是为了迭代超参数eps和min_samples，并包含了min和max集群的可选参数。由于DBSCAN是无监督的，所以我没有包括评估参数。

def dbscan_grid_search(X_data, lst, clst_count, eps_space = 0.5,
                       min_samples_space = 5, min_clust = 0, max_clust = 10):

    """
Performs a hyperparameter grid search for DBSCAN.

Parameters:
    * X_data            = data used to fit the DBSCAN instance
    * lst               = a list to store the results of the grid search
    * clst_count        = a list to store the number of non-whitespace clusters
    * eps_space         = the range values for the eps parameter
    * min_samples_space = the range values for the min_samples parameter
    * min_clust         = the minimum number of clusters required after each search iteration in order for a result to be appended to the lst
    * max_clust         = the maximum number of clusters required after each search iteration in order for a result to be appended to the lst

Example:

# Loading Libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data[:, :] 
y = iris.target

# Scaling X data
dbscan_scaler = StandardScaler()

dbscan_scaler.fit(X)

dbscan_X_scaled = dbscan_scaler.transform(X)

# Setting empty lists in global environment
dbscan_clusters = []
cluster_count   = []

# Inputting function parameters
dbscan_grid_search(X_data = dbscan_X_scaled,
                   lst = dbscan_clusters,
                   clst_count = cluster_count
                   eps_space = pd.np.arange(0.1, 5, 0.1),
                   min_samples_space = pd.np.arange(1, 50, 1),
                   min_clust = 3,
                   max_clust = 6)

"""

    # Importing counter to count the amount of data in each cluster
    from collections import Counter

    # Starting a tally of total iterations
    n_iterations = 0

    # Looping over each combination of hyperparameters
    for eps_val in eps_space:
        for samples_val in min_samples_space:

            dbscan_grid = DBSCAN(eps = eps_val,
                                 min_samples = samples_val)

            # fit_transform
            clusters = dbscan_grid.fit_predict(X = X_data)

            # Counting the amount of data in each cluster
            cluster_count = Counter(clusters)

            # Saving the number of clusters
            n_clusters = sum(abs(pd.np.unique(clusters))) - 1

            # Increasing the iteration tally with each run of the loop
            n_iterations += 1

            # Appending the lst each time n_clusters criteria is reached
            if n_clusters >= min_clust and n_clusters <= max_clust:

                dbscan_clusters.append([eps_val,
                                        samples_val,
                                        n_clusters])

                clst_count.append(cluster_count)

    # Printing grid search summary information
    print(f"""Search Complete. \nYour list is now of length {len(lst)}. """)
    print(f"""Hyperparameter combinations checked: {n_iterations}. \n""")

赞(0）回复(0）举报 2023-06-24

3bygqnnd2#

你是否考虑过**自己执行搜索？
实现for循环并不是特别困难。即使你想优化两个参数，它仍然相当容易。
不过，对于DBSCAN和MeanShift，我建议首先了解你的相似性度量。更有意义的是，基于对度量的理解来选择参数，而不是参数优化来匹配某些标签（过度拟合的风险很高）。
换句话说，两个文章 * 应该 * 聚集在一起的距离是多少？
如果这个距离从一个数据点到另一个数据点变化太大，这些算法就会严重失败;并且您可能需要找到一个归一化的距离函数，使得实际的相似性值再次有意义。TF-IDF在文本上是标准的，但主要是在 * 检索 * 上下文中。它们在集群环境中可能会更糟糕。
还要注意MeanShift（类似于k-means）需要重新计算坐标-在文本数据上，这可能会产生不希望的结果;更新后的坐标实际上变得更糟，而不是更好。

赞(0）回复(0）举报 2023-06-24

ej83mcc03#

您可以将GridSearchCV的cv参数指定为“An iterable yielding（train，test）splits as arrays of indices”（引用自the doc）。
特别是对于DBSCAN，还有一个问题--没有predict方法。我使用this answer的解决方案。
下面是示例代码。

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# The scorer function
def cmp(y_pred, y_true):
    return np.sum(y_pred == y_true)

class DBSCANWrapper(DBSCAN):
    # Won't work if `_X` is not the same X used in `self.fit`
    def predict(self, _X, _y=None):
        return self.labels_

# Let X be your data to cluster, e.g.:
X = np.random.rand(100, 10)
# Let y_true be the groundtruth clustering result, e.g.:
y_true = np.random.randint(5, size=100)
# hyper parameters to search, e.g.:
hyperparams_dict = {'eps': np.linspace(0.1, 1.0, 10)}

# Notice here, the spec of `cv`:
cv = [(np.arange(X.shape[0]), np.arange(X.shape[0]))]

search = GridSearchCV(DBSCANWrapper(), hyperparams_dict, scoring=make_scorer(cmp), cv=cv)
search.fit(X, y_true)
print(search.best_params_)

但当然它不起作用，因为只有一个样本数据被聚类，而不是所有数据。
如果你不想在trainset上安装并在与trainset不同的测试集上进行评估（当然这与DBSCAN不起作用），上面的解决方案也可以：只需修改cv = ...行代码。

赞(0）回复(0）举报 2023-06-24

我来回答

flutter 如何使用GridSearchCV进行聚类(MeanShift或DBSCAN)？

3条答案

相关问题

热门标签

最新问答