scipy 通过SSE确定最佳拟合分布- Python 3.8

vptzau2j 于 2023-01-26 发布在 Python

关注(0)|答案(2)|浏览(349)

我正试图提出一种方法来确定以下分布之间的"最佳拟合"：Gaussian, Multinomial, Bernoulli.
我有一个很大的pandas df，其中每一列都可以看作是一个数字的分布，我试图做的是，对于每一列，确定上面列表的分布为best fit。
我注意到this question问了一些熟悉的问题，但这些看起来都像离散分布测试，而不是连续分布测试。我知道scipy has metrics for a lot of these，但我不能确定如何正确放置输入。我的想法是：
1.对于每列，将数据保存在临时np array中
1.生成Gaussian, Multinomial, Bernoulli分布，执行SSE测试以确定提供"最佳拟合"的分布，然后继续下一列。
示例数据集（任意，我的数据集是29888 x 73231）可以是：

| could | couldnt | coupl | cours | death | develop | dialogu | differ | direct | director | done |
|:-----:|:-------:|:-----:|:-----:|:-----:|:-------:|:-------:|:------:|:------:|:--------:|:----:|
|   0   |    0    |   0   |   1   |   0   |    1    |    1    |    0   |    0   |     0    |   0  |
|   0   |    2    |   1   |   0   |   0   |    1    |    0    |    2   |    0   |     0    |   1  |
|   0   |    0    |   0   |   0   |   0   |    0    |    0    |    0   |    1   |     1    |   2  |
|   1   |    0    |   0   |   0   |   0   |    1    |    0    |    1   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    1    |    1    |    1   |    1   |     0    |   0  |
|   0   |    0    |   0   |   1   |   0   |    0    |    0    |    0   |    0   |     0    |   1  |
|   0   |    0    |   0   |   0   |   2   |    1    |    0    |    1   |    0   |     0    |   2  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    0   |    2   |     0    |   1  |
|   0   |    0    |   0   |   0   |   0   |    2    |    0    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   1   |   0   |    0    |    5    |    0   |    0   |     0    |   3  |
|   1   |    1    |   0   |   0   |   1   |    2    |    0    |    0   |    1   |     0    |   0  |
|   1   |    1    |   0   |   0   |   0   |    4    |    0    |    0   |    1   |     0    |   1  |
|   0   |    0    |   0   |   0   |   1   |    0    |    0    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    0    |    1    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    3   |    0   |     0    |   1  |
|   2   |    0    |   0   |   0   |   0   |    0    |    0    |    0   |    1   |     0    |   2  |
|   0   |    0    |   1   |   0   |   0   |    0    |    0    |    0   |    0   |     0    |   2  |
|   1   |    1    |   0   |   0   |   1   |    0    |    0    |    1   |    1   |     0    |   2  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    0   |    0   |     0    |   1  |
|   0   |    1    |   0   |   3   |   0   |    0    |    0    |    1   |    1   |     0    |   0  |

我现在有一些基本的代码which was edited from this question，它尝试这样做：

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.norm, st.multinomial, st.bernoulli
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            print("Error on: {}".format(distribution))
            pass

        #print("Distribution: {} | SSE: {}".format(distribution, sse))

    return best_distribution.name, best_sse

for col in df.columns:
    nm, pm = best_fit_distribution(df[col])
    print(nm)
    print(pm)

然而，我得到：

Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
norm
(4.4, 7.002856560004639)

对于每列，我的预期输出如下所示：Gaussian SSE: <val> | Multinomial SSE: <val> | Bernoulli SSE: <val>

- UPDATE**捕获错误会产生：

Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
'multinomial_gen' object has no attribute 'fit'
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
'bernoulli_gen' object has no attribute 'fit'

为什么会出现错误？我认为是因为multinomial和bernoulli没有fit方法。如何创建拟合方法，并将其集成以获得SSE？? The target output of this function or program would be, for a高斯、多项式、伯努利分布，df中每列的平均SSE是多少，对于每种分布类型（尝试并确定列的最佳拟合）。

- 2015年6月更新**：我加了赏金。
- 2016年6月更新**：由于这是一个更大应用程序的一部分，因此更大的目的是在一个非常大的 Dataframe 过程中辨别tfidf值的最常见分布。然后，基于此，应用sklearn中与最常见分布匹配的朴素贝叶斯分类器。www.example.com包含不同分类器的详细信息。因此，我需要了解的是，scikit-learn.org/stable/modules/naive_bayes.html有一个未显示的列称为class，它是positive或negative分类。我不是在寻找输入，我只是按照我的领导给我的指示。 is which distribution is the best fit across my entire dataframe, which I assumed to mean, which was the most common amongst the distribution of tfidf values in my words. From there, I will know which type of classifier to apply to my dataframe. In the example above, there is a column not shown called class which is a positive or negative classification. I am not looking for input to this, I am simply following the instructions I have been given by my lead.

scipy

来源：https://stackoverflow.com/questions/62360325/determining-best-fit-distributions-by-sse-python-3-8

2条答案

按热度按时间

von4xj4u1#

我把这个问题总结为：给定一列非负整数，我们能拟合一个概率分布，特别是高斯分布、多项式分布和伯努利分布，并比较拟合的质量吗？
对于离散量，正确项为probability mass function：P（k）是所选取的数字恰好等于整数值k的概率。伯努利分布可由p参数参数化：Be（k，p），其中0 ≤ p ≤ 1，k只能取0或1，它是binomial distribution B（k，p，n）的一个特例，参数0 ≤ p ≤ 1，整数n〉= 1。（参见维基百科相关文章，了解p和n的含义）它与伯努利分布相关，如Be（k，p）= B（k，p，n=1）。三项分布T（k1，k2，p1，p2，n）由p1，p2，n参数化，并描述对（k1，k2）的概率。例如，集合{（0，0），（0，1），（1，0），（0，1），（0，0）}可以从三项分布中提取。二项式和三项分布是multinomial distributions的特殊情况;如果数据以五元组的形式出现，例如（1，5，5，2，7），则可以从多项式（六项？）分布M6（k1，...，k5，p1，...，p5，n）中提取这些数据。该问题特别要求了解单列数字的概率分布，因此此处唯一适合的多项式分布是二项式分布。除非你指定序列[0，1，5，2，3，1]应该被解释为[（0，1），（5，2），（3，1）]或[（0，1，5），（2，3，1）]。但该问题没有指定数字可以成对或三元组累加。
因此，就离散分布而言，一个整数列表的PMF的形式为P（k），并且只能在n和p值合适的情况下拟合为二项式分布，如果在n=1时获得最佳拟合，则它是伯努利分布。
高斯分布是连续分布G（x，mu，sigma），其中mu（平均值）和sigma（标准差）是参数。它告诉你找到x 0-a/2〈x〈x 0 +a/2的概率等于G（x 0，μ，σ）*a，对于a〈〈σ严格地说，高斯分布不适用于离散变量，因为高斯分布对于非整数x值具有非零概率，而从整数分布中提取非整数的概率为零。通常，您将使用高斯分布作为二项式分布的近似值，这里你设a=1，P（k）= G（x=k，mu，sigma）*a。
对于足够大的n，根据下式，二项式分布和高斯分布看起来相似

B(k, p, n) =  G(x=k, mu=p*n, sigma=sqrt(p*(1-p)*n)).

如果要拟合高斯分布，可以使用标准scipy函数scipy.stats.norm.fit。此类拟合函数不适用于离散分布（如二项分布）。可以使用函数scipy.optimize.curve_fit拟合非整数参数（如二项分布的p参数）。为了找到最佳整数n值，您需要改变n，为每个n调整p，并选择最适合的n, p组合。
在下面的实现中，我从上面的均值和sigma值的关系中估计n和p，并围绕该值进行搜索。搜索可以做得更聪明，但对于我使用的小测试数据集来说，它已经足够快了。此外，它有助于说明一个观点;我已经提供了一个函数fit_binom，它获取一个带有实际计数的直方图，以及一个函数fit_samples，它可以从 Dataframe 中获取一列数字。

"""Binomial fit routines.

Author: Han-Kwang Nienhuys (2020)
Copying: CC-BY-SA, CC-BY, BSD, GPL, LGPL.
https://stackoverflow.com/a/62365555/6228891 
"""

import numpy as np
from scipy.stats import binom, poisson
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

class BinomPMF:
    """Wrapper so that integer parameters don't occur as function arguments."""
    def __init__(self, n):
        self.n = n
    def __call__(self, ks, p):
        return binom(self.n, p).pmf(ks)

def fit_binom(hist, plot=True, weighted=True, f=1.5, verbose=False):
    """Fit histogram to binomial distribution.
    
    Parameters:

    - hist: histogram as int array with counts, array index as bin.
    - plot: whether to plot
    - weighted: whether to fit assuming Poisson statistics in each bin.
      (Recommended: True).
    - f: try to fit n in range n0/f to n0*f where n0 is the initial estimate.
      Must be >= 1.
    - verbose: whether to print messages.
    
    Return: 
        
    - histf: fitted histogram as int array, same length as hist.
    - n: binomial n value (int)
    - p: binomial p value (float)
    - rchi2: reduced chi-squared. This number should be around 1.
      Large values indicate a bad fit; small values indicate
      "too good to be true" data.
    """ 
   
    hist = np.array(hist, dtype=int).ravel() # force 1D int array
    pmf = hist/hist.sum() # probability mass function
    nk = len(hist)
    if weighted:
        sigmas = np.sqrt(hist+0.25)/hist.sum()
    else:
        sigmas = np.full(nk, 1/np.sqrt(nk*hist.sum()))
    ks = np.arange(nk)
    mean = (pmf*ks).sum()
    variance = ((ks-mean)**2 * pmf).sum()
    
    # initial estimate for p and search range for n
    nest = max(1, int(mean**2 /(mean-variance) + 0.5))
    nmin = max(1, int(np.floor(nest/f)))
    nmax = max(nmin, int(np.ceil(nest*f)))
    nvals = np.arange(nmin, nmax+1)
    num_n = nmax-nmin+1
    verbose and print(f'Initial estimate: n={nest}, p={mean/nest:.3g}')

    # store fit results for each n
    pvals, sses = np.zeros(num_n), np.zeros(num_n)
    for n in nvals:
        # fit and plot
        p_guess = max(0, min(1, mean/n))
        fitparams, _ = curve_fit(
            BinomPMF(n), ks, pmf, p0=p_guess, bounds=[0., 1.],
            sigma=sigmas, absolute_sigma=True)
        p = fitparams[0]
        sse = (((pmf - BinomPMF(n)(ks, p))/sigmas)**2).sum()
        verbose and print(f'  Trying n={n} -> p={p:.3g} (initial: {p_guess:.3g}),'
                          f' sse={sse:.3g}')
        pvals[n-nmin] = p
        sses[n-nmin] = sse
    n_fit = np.argmin(sses) + nmin
    p_fit = pvals[n_fit-nmin]
    sse = sses[n_fit-nmin]    
    chi2r = sse/(nk-2) if nk > 2 else np.nan
    if verbose:
        print(f'  Found n={n_fit}, p={p_fit:.6g} sse={sse:.3g},'
              f' reduced chi^2={chi2r:.3g}')
    histf = BinomPMF(n_fit)(ks, p_fit) * hist.sum()

    if plot:    
        fig, ax = plt.subplots(2, 1, figsize=(4,4))
        ax[0].plot(ks, hist, 'ro', label='input data')
        ax[0].step(ks, histf, 'b', where='mid', label=f'fit: n={n_fit}, p={p_fit:.3f}')
        ax[0].set_xlabel('k')
        ax[0].axhline(0, color='k')
        ax[0].set_ylabel('Counts')
        ax[0].legend()
        
        ax[1].set_xlabel('n')
        ax[1].set_ylabel('sse')
        plotfunc = ax[1].semilogy if sses.max()>20*sses.min()>0 else ax[1].plot
        plotfunc(nvals, sses, 'k-', label='SSE over n scan')
        ax[1].legend()
        fig.show()
        
    return histf, n_fit, p_fit, chi2r

def fit_binom_samples(samples, f=1.5, weighted=True, verbose=False):
    """Convert array of samples (nonnegative ints) to histogram and fit.
    
    See fit_binom() for more explanation.
    """
    
    samples = np.array(samples, dtype=int)
    kmax = samples.max()
    hist, _ = np.histogram(samples, np.arange(kmax+2)-0.5)
    return fit_binom(hist, f=f, weighted=weighted, verbose=verbose) 

def test_case(n, p, nsamp, weighted=True, f=1.5):
    """Run test with n, p values; nsamp=number of samples."""
    
    print(f'TEST CASE: n={n}, p={p}, nsamp={nsamp}')
    ks = np.arange(n+1) # bins
    pmf = BinomPMF(n)(ks, p)
    hist = poisson.rvs(pmf*nsamp)
    fit_binom(hist, weighted=weighted, f=f, verbose=True)

if __name__ == '__main__':
    plt.close('all')
    np.random.seed(1)
    weighted = True
    test_case(10, 0.2, 500, f=2.5, weighted=weighted)
    test_case(10, 0.3, 500, weighted=weighted)
    test_case(10, 0.8, 10000, weighted)
    test_case(1, 0.3, 100, weighted) # equivalent to Bernoulli distribution
    fit_binom_samples(binom(15, 0.5).rvs(100), weighted=weighted)

原则上，如果设置weighted=True，将获得最佳拟合。然而，问题要求将最小平方误差和（SSE）作为度量;然后，您可以设置weighted=False。
结果表明，除非你有大量的数据，否则很难拟合二项分布。下面是对不同样本数的n，p组合（10，0.2）、（10，0.3）、（10，0.8）和（1，0.3）的真实（随机生成）数据进行的检验。图中还显示了加权SSE如何随n变化。

通常，对于500个样本，您会得到一个肉眼看起来不错的拟合，但它无法正确恢复实际的n和p值，尽管乘积n*p相当准确。在这些情况下，SSE曲线具有较宽的最小值，这表明存在多个合理的拟合。
上述代码可适用于不同的离散分布。在这种情况下，您需要计算出拟合参数的合理初始估计值。例如：Poisson：均值是唯一的参数（使用简化的chi 2或SSE来判断是否是良好的拟合）。
如果要将m输入列的组合拟合为（m+1）维多项式，可以对每个输入列执行二项式拟合，并将拟合结果存储在数组nn和pp（每个数组的形状为(m,)）中。

n_est = int(nn.mean()+0.5)
pp_est = pp*nn/n_est
pp_est = np.append(pp_est, 1-pp_est.sum())

如果nn数组中的单个值变化很大，或者pp_est的最后一个元素为负，则它可能不是多项式。
您希望比较多个模型的残差;注意，具有更多拟合参数的模型将倾向于产生更低的残差，但这不一定意味着该模型更好。
注：这一答复经过了大幅度修改。

赞(0）回复(0）举报 2023-01-26

9q78igpj2#

distfit library可以帮助您确定最佳拟合分布。如果您将方法设置为离散，则遵循Han-Kwang Nienhuys所述的类似方法。

赞(0）回复(0）举报 2023-01-26

我来回答

scipy 通过SSE确定最佳拟合分布- Python 3.8

2条答案

相关问题

热门标签

最新问答