scipy 通过SSE确定最佳拟合分布- Python 3.8

vptzau2j  于 2023-01-26  发布在  Python
关注(0)|答案(2)|浏览(348)

我正试图提出一种方法来确定以下分布之间的"最佳拟合":Gaussian, Multinomial, Bernoulli.
我有一个很大的pandas df,其中每一列都可以看作是一个数字的分布,我试图做的是,对于每一列,确定上面列表的分布为best fit
我注意到this question问了一些熟悉的问题,但这些看起来都像离散分布测试,而不是连续分布测试。我知道scipy has metrics for a lot of these,但我不能确定如何正确放置输入。我的想法是:
1.对于每列,将数据保存在临时np array
1.生成Gaussian, Multinomial, Bernoulli分布,执行SSE测试以确定提供"最佳拟合"的分布,然后继续下一列。
示例数据集(任意,我的数据集是29888 x 73231)可以是:

| could | couldnt | coupl | cours | death | develop | dialogu | differ | direct | director | done |
|:-----:|:-------:|:-----:|:-----:|:-----:|:-------:|:-------:|:------:|:------:|:--------:|:----:|
|   0   |    0    |   0   |   1   |   0   |    1    |    1    |    0   |    0   |     0    |   0  |
|   0   |    2    |   1   |   0   |   0   |    1    |    0    |    2   |    0   |     0    |   1  |
|   0   |    0    |   0   |   0   |   0   |    0    |    0    |    0   |    1   |     1    |   2  |
|   1   |    0    |   0   |   0   |   0   |    1    |    0    |    1   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    1    |    1    |    1   |    1   |     0    |   0  |
|   0   |    0    |   0   |   1   |   0   |    0    |    0    |    0   |    0   |     0    |   1  |
|   0   |    0    |   0   |   0   |   2   |    1    |    0    |    1   |    0   |     0    |   2  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    0   |    2   |     0    |   1  |
|   0   |    0    |   0   |   0   |   0   |    2    |    0    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   1   |   0   |    0    |    5    |    0   |    0   |     0    |   3  |
|   1   |    1    |   0   |   0   |   1   |    2    |    0    |    0   |    1   |     0    |   0  |
|   1   |    1    |   0   |   0   |   0   |    4    |    0    |    0   |    1   |     0    |   1  |
|   0   |    0    |   0   |   0   |   1   |    0    |    0    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    0    |    1    |    0   |    0   |     0    |   0  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    3   |    0   |     0    |   1  |
|   2   |    0    |   0   |   0   |   0   |    0    |    0    |    0   |    1   |     0    |   2  |
|   0   |    0    |   1   |   0   |   0   |    0    |    0    |    0   |    0   |     0    |   2  |
|   1   |    1    |   0   |   0   |   1   |    0    |    0    |    1   |    1   |     0    |   2  |
|   0   |    0    |   0   |   0   |   0   |    1    |    0    |    0   |    0   |     0    |   1  |
|   0   |    1    |   0   |   3   |   0   |    0    |    0    |    1   |    1   |     0    |   0  |

我现在有一些基本的代码which was edited from this question,它尝试这样做:

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.norm, st.multinomial, st.bernoulli
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            print("Error on: {}".format(distribution))
            pass

        #print("Distribution: {} | SSE: {}".format(distribution, sse))

    return best_distribution.name, best_sse

for col in df.columns:
    nm, pm = best_fit_distribution(df[col])
    print(nm)
    print(pm)

然而,我得到:

Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
norm
(4.4, 7.002856560004639)

对于每列,我的预期输出如下所示:Gaussian SSE: <val> | Multinomial SSE: <val> | Bernoulli SSE: <val>

    • UPDATE**捕获错误会产生:
Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
'multinomial_gen' object has no attribute 'fit'
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
'bernoulli_gen' object has no attribute 'fit'

为什么会出现错误?我认为是因为multinomialbernoulli没有fit方法。如何创建拟合方法,并将其集成以获得SSE?? The target output of this function or program would be, for a高斯、多项式、伯努利分布,df中每列的平均SSE是多少,对于每种分布类型(尝试并确定列的最佳拟合)。

    • 2015年6月更新**:我加了赏金。
    • 2016年6月更新**:由于这是一个更大应用程序的一部分,因此更大的目的是在一个非常大的 Dataframe 过程中辨别tfidf值的最常见分布。然后,基于此,应用sklearn中与最常见分布匹配的朴素贝叶斯分类器。www.example.com包含不同分类器的详细信息。因此,我需要了解的是,scikit-learn.org/stable/modules/naive_bayes.html有一个未显示的列称为class,它是positivenegative分类。我不是在寻找输入,我只是按照我的领导给我的指示。 is which distribution is the best fit across my entire dataframe, which I assumed to mean, which was the most common amongst the distribution of tfidf values in my words. From there, I will know which type of classifier to apply to my dataframe. In the example above, there is a column not shown called class which is a positive or negative classification. I am not looking for input to this, I am simply following the instructions I have been given by my lead.
von4xj4u

von4xj4u1#

我把这个问题总结为:给定一列非负整数,我们能拟合一个概率分布,特别是高斯分布、多项式分布和伯努利分布,并比较拟合的质量吗?
对于离散量,正确项为probability mass function:P(k)是所选取的数字恰好等于整数值k的概率。伯努利分布可由p参数参数化:Be(k,p),其中0 ≤ p ≤ 1,k只能取0或1,它是binomial distribution B(k,p,n)的一个特例,参数0 ≤ p ≤ 1,整数n〉= 1。(参见维基百科相关文章,了解p和n的含义)它与伯努利分布相关,如Be(k,p)= B(k,p,n=1)。三项分布T(k1,k2,p1,p2,n)由p1,p2,n参数化,并描述对(k1,k2)的概率。例如,集合{(0,0),(0,1),(1,0),(0,1),(0,0)}可以从三项分布中提取。二项式和三项分布是multinomial distributions的特殊情况;如果数据以五元组的形式出现,例如(1,5,5,2,7),则可以从多项式(六项?)分布M6(k1,...,k5,p1,...,p5,n)中提取这些数据。该问题特别要求了解单列数字的概率分布,因此此处唯一适合的多项式分布是二项式分布。除非你指定序列[0,1,5,2,3,1]应该被解释为[(0,1),(5,2),(3,1)]或[(0,1,5),(2,3,1)]。但该问题没有指定数字可以成对或三元组累加。
因此,就离散分布而言,一个整数列表的PMF的形式为P(k),并且只能在n和p值合适的情况下拟合为二项式分布,如果在n=1时获得最佳拟合,则它是伯努利分布。
高斯分布是连续分布G(x,mu,sigma),其中mu(平均值)和sigma(标准差)是参数。它告诉你找到x 0-a/2〈x〈x 0 +a/2的概率等于G(x 0,μ,σ)*a,对于a〈〈σ严格地说,高斯分布不适用于离散变量,因为高斯分布对于非整数x值具有非零概率,而从整数分布中提取非整数的概率为零。通常,您将使用高斯分布作为二项式分布的近似值,这里你设a=1,P(k)= G(x=k,mu,sigma)*a。
对于足够大的n,根据下式,二项式分布和高斯分布看起来相似

B(k, p, n) =  G(x=k, mu=p*n, sigma=sqrt(p*(1-p)*n)).

如果要拟合高斯分布,可以使用标准scipy函数scipy.stats.norm.fit。此类拟合函数不适用于离散分布(如二项分布)。可以使用函数scipy.optimize.curve_fit拟合非整数参数(如二项分布的p参数)。为了找到最佳整数n值,您需要改变n,为每个n调整p,并选择最适合的n, p组合。
在下面的实现中,我从上面的均值和sigma值的关系中估计np,并围绕该值进行搜索。搜索可以做得更聪明,但对于我使用的小测试数据集来说,它已经足够快了。此外,它有助于说明一个观点;我已经提供了一个函数fit_binom,它获取一个带有实际计数的直方图,以及一个函数fit_samples,它可以从 Dataframe 中获取一列数字。

"""Binomial fit routines.

Author: Han-Kwang Nienhuys (2020)
Copying: CC-BY-SA, CC-BY, BSD, GPL, LGPL.
https://stackoverflow.com/a/62365555/6228891 
"""

import numpy as np
from scipy.stats import binom, poisson
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

class BinomPMF:
    """Wrapper so that integer parameters don't occur as function arguments."""
    def __init__(self, n):
        self.n = n
    def __call__(self, ks, p):
        return binom(self.n, p).pmf(ks)

def fit_binom(hist, plot=True, weighted=True, f=1.5, verbose=False):
    """Fit histogram to binomial distribution.
    
    Parameters:

    - hist: histogram as int array with counts, array index as bin.
    - plot: whether to plot
    - weighted: whether to fit assuming Poisson statistics in each bin.
      (Recommended: True).
    - f: try to fit n in range n0/f to n0*f where n0 is the initial estimate.
      Must be >= 1.
    - verbose: whether to print messages.
    
    Return: 
        
    - histf: fitted histogram as int array, same length as hist.
    - n: binomial n value (int)
    - p: binomial p value (float)
    - rchi2: reduced chi-squared. This number should be around 1.
      Large values indicate a bad fit; small values indicate
      "too good to be true" data.
    """ 
   
    hist = np.array(hist, dtype=int).ravel() # force 1D int array
    pmf = hist/hist.sum() # probability mass function
    nk = len(hist)
    if weighted:
        sigmas = np.sqrt(hist+0.25)/hist.sum()
    else:
        sigmas = np.full(nk, 1/np.sqrt(nk*hist.sum()))
    ks = np.arange(nk)
    mean = (pmf*ks).sum()
    variance = ((ks-mean)**2 * pmf).sum()
    
    # initial estimate for p and search range for n
    nest = max(1, int(mean**2 /(mean-variance) + 0.5))
    nmin = max(1, int(np.floor(nest/f)))
    nmax = max(nmin, int(np.ceil(nest*f)))
    nvals = np.arange(nmin, nmax+1)
    num_n = nmax-nmin+1
    verbose and print(f'Initial estimate: n={nest}, p={mean/nest:.3g}')

    # store fit results for each n
    pvals, sses = np.zeros(num_n), np.zeros(num_n)
    for n in nvals:
        # fit and plot
        p_guess = max(0, min(1, mean/n))
        fitparams, _ = curve_fit(
            BinomPMF(n), ks, pmf, p0=p_guess, bounds=[0., 1.],
            sigma=sigmas, absolute_sigma=True)
        p = fitparams[0]
        sse = (((pmf - BinomPMF(n)(ks, p))/sigmas)**2).sum()
        verbose and print(f'  Trying n={n} -> p={p:.3g} (initial: {p_guess:.3g}),'
                          f' sse={sse:.3g}')
        pvals[n-nmin] = p
        sses[n-nmin] = sse
    n_fit = np.argmin(sses) + nmin
    p_fit = pvals[n_fit-nmin]
    sse = sses[n_fit-nmin]    
    chi2r = sse/(nk-2) if nk > 2 else np.nan
    if verbose:
        print(f'  Found n={n_fit}, p={p_fit:.6g} sse={sse:.3g},'
              f' reduced chi^2={chi2r:.3g}')
    histf = BinomPMF(n_fit)(ks, p_fit) * hist.sum()

    if plot:    
        fig, ax = plt.subplots(2, 1, figsize=(4,4))
        ax[0].plot(ks, hist, 'ro', label='input data')
        ax[0].step(ks, histf, 'b', where='mid', label=f'fit: n={n_fit}, p={p_fit:.3f}')
        ax[0].set_xlabel('k')
        ax[0].axhline(0, color='k')
        ax[0].set_ylabel('Counts')
        ax[0].legend()
        
        ax[1].set_xlabel('n')
        ax[1].set_ylabel('sse')
        plotfunc = ax[1].semilogy if sses.max()>20*sses.min()>0 else ax[1].plot
        plotfunc(nvals, sses, 'k-', label='SSE over n scan')
        ax[1].legend()
        fig.show()
        
    return histf, n_fit, p_fit, chi2r

def fit_binom_samples(samples, f=1.5, weighted=True, verbose=False):
    """Convert array of samples (nonnegative ints) to histogram and fit.
    
    See fit_binom() for more explanation.
    """
    
    samples = np.array(samples, dtype=int)
    kmax = samples.max()
    hist, _ = np.histogram(samples, np.arange(kmax+2)-0.5)
    return fit_binom(hist, f=f, weighted=weighted, verbose=verbose) 

def test_case(n, p, nsamp, weighted=True, f=1.5):
    """Run test with n, p values; nsamp=number of samples."""
    
    print(f'TEST CASE: n={n}, p={p}, nsamp={nsamp}')
    ks = np.arange(n+1) # bins
    pmf = BinomPMF(n)(ks, p)
    hist = poisson.rvs(pmf*nsamp)
    fit_binom(hist, weighted=weighted, f=f, verbose=True)

if __name__ == '__main__':
    plt.close('all')
    np.random.seed(1)
    weighted = True
    test_case(10, 0.2, 500, f=2.5, weighted=weighted)
    test_case(10, 0.3, 500, weighted=weighted)
    test_case(10, 0.8, 10000, weighted)
    test_case(1, 0.3, 100, weighted) # equivalent to Bernoulli distribution
    fit_binom_samples(binom(15, 0.5).rvs(100), weighted=weighted)

原则上,如果设置weighted=True,将获得最佳拟合。然而,问题要求将最小平方误差和(SSE)作为度量;然后,您可以设置weighted=False
结果表明,除非你有大量的数据,否则很难拟合二项分布。下面是对不同样本数的n,p组合(10,0.2)、(10,0.3)、(10,0.8)和(1,0.3)的真实(随机生成)数据进行的检验。图中还显示了加权SSE如何随n变化。

通常,对于500个样本,您会得到一个肉眼看起来不错的拟合,但它无法正确恢复实际的np值,尽管乘积n*p相当准确。在这些情况下,SSE曲线具有较宽的最小值,这表明存在多个合理的拟合。
上述代码可适用于不同的离散分布。在这种情况下,您需要计算出拟合参数的合理初始估计值。例如:Poisson:均值是唯一的参数(使用简化的chi 2或SSE来判断是否是良好的拟合)。
如果要将m输入列的组合拟合为(m+1)维多项式,可以对每个输入列执行二项式拟合,并将拟合结果存储在数组nnpp(每个数组的形状为(m,))中。

n_est = int(nn.mean()+0.5)
pp_est = pp*nn/n_est
pp_est = np.append(pp_est, 1-pp_est.sum())

如果nn数组中的单个值变化很大,或者pp_est的最后一个元素为负,则它可能不是多项式。
您希望比较多个模型的残差;注意,具有更多拟合参数的模型将倾向于产生更低的残差,但这不一定意味着该模型更好。
注:这一答复经过了大幅度修改。

9q78igpj

9q78igpj2#

distfit library可以帮助您确定最佳拟合分布。如果您将方法设置为离散,则遵循Han-Kwang Nienhuys所述的类似方法。

相关问题