scipy 在Python中计算累积分布函数(CDF)

watbbzwu 于 2022-11-09 发布在 Python

关注(0)|答案(6)|浏览(247)

如何在python中计算Cumulative Distribution Function (CDF)？
我想从一个点的数组（离散分布）来计算它，而不是像scipy那样的连续分布。

scipy

来源：https://stackoverflow.com/questions/24788200/calculate-the-cumulative-distribution-function-cdf-in-python

6条答案

按热度按时间

bvhaajcl1#

(It我对这个问题的理解可能是错误的。如果问题是如何从离散的PDF转换为离散的CDF，那么如果样本是等间距的，那么np.cumsum除以一个合适的常数就可以了。如果数组不是等间距的，那么np.cumsum乘以点之间的距离就可以了。）
如果你有一个离散的样本数组，并且你想知道样本的CDF，那么你可以对数组进行排序。如果你看排序后的结果，你会发现最小值代表0%，最大值代表100%。如果你想知道50%分布的值，只要看一下排序后数组中间的数组元素。
让我们通过一个简单的例子来更深入地了解这一点：

import matplotlib.pyplot as plt
import numpy as np

# create some randomly ddistributed data:

data = np.random.randn(10000)

# sort the data:

data_sorted = np.sort(data)

# calculate the proportional values of samples

p = 1. * np.arange(len(data)) / (len(data) - 1)

# plot the sorted data:

fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')

ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')

这给出了下面的图，其中右手边的图是传统的累积分布函数，它应该反映点后面的过程的CDF，但自然地，只要点的数量是有限的，它就不长。

这个函数很容易反转，这取决于你的应用程序需要哪种形式。

赞(0）回复(0）举报 2022-11-09

a7qyws3x2#

假设您知道数据是如何分布的（例如，您知道数据的pdf格式），那么scipy在计算cdf时确实支持离散数据

import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete

# plot the cdf

sns.lineplot(x=x, y=norm_cdf)
plt.show()

我们甚至可以打印cdf的前几个值以显示它们是离散的。

print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
       0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

计算cdf的相同方法也适用于多个维度：下面我们使用2D数据来说明

mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix

# generate 2d normally distributed samples using 0 mean and the covariance matrix above

x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)

在上面的例子中，我事先知道我的数据是正态分布的，这就是我使用scipy.stats.norm()的原因--scipy支持多种分布。但是同样，你需要事先知道你的数据是如何分布的，才能使用这些函数。如果你不知道你的数据是如何分布的，而只是使用任何分布来计算cdf，你很可能会得到不正确的结果。

赞(0）回复(0）举报 2022-11-09

o0lyfsai3#

经验累积分布函数是一个CDF，它在数据集中的值处精确跳跃。它是离散分布的CDF，它在每个值处放置一个质量，其中质量与该值的频率成比例。由于质量之和必须为1，因此这些约束条件决定经验CDF中每个跳跃的位置和高度。

给定一个数组a，通过首先获取这些值的频数来计算经验CDF。numpy函数unique()在这里很有用，因为它不仅返回频数，而且还返回按排序顺序排列的值。要计算累积分布，请使用cumsum()函数。并除以总和。下列函数会以排序顺序传回值，以及Map的累积分配：

import numpy as np

def ecdf(a):
    x, counts = np.unique(a, return_counts=True)
    cusum = np.cumsum(counts)
    return x, cusum / cusum[-1]

要绘制经验CDF，您可以使用matplotlib的plot()函数。选项drawstyle='steps-post'确保跳转发生在正确的位置。但是，您需要在最小数据值处强制跳转，因此有必要在x和y前面插入一个额外的元素。

import matplotlib.pyplot as plt

def plot_ecdf(a):
    x, y = ecdf(a)
    x = np.insert(x, 0, x[0])
    y = np.insert(y, 0, 0.)
    plt.plot(x, y, drawstyle='steps-post')
    plt.grid(True)
    plt.savefig('ecdf.png')

示例用法：

xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)

df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])

输出为：

赞(0）回复(0）举报 2022-11-09

f8rj6qna4#

这里有一个计算经验CDF的Pandas解决方案，首先使用pd.cut将数据分类到均匀分布的区间中，然后使用cumsum计算分布。

def empirical_cdf(s: pd.Series, n_bins: int = 100):
    # Sort the data into `n_bins` evenly spaced bins:
    discretized = pd.cut(s, n_bins)
    # Count the number of datapoints in each bin:
    bin_counts = discretized.value_counts().sort_index().reset_index()
    # Calculate the locations of each bin as just the mean of the bin start and end:
    bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
    # Compute the CDF with cumsum:
    return bin_counts.set_index("loc").iloc[:, -1].cumsum()

下面是使用函数将10000个数据点离散化到100个均匀分布的区间中的示例：

s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)

赞(0）回复(0）举报 2022-11-09

hpcdzsge5#

计算离散数数组的CDF：

import numpy as np
pdf, bin_edges = np.histogram(
   data,        # array of data
   bins=500,    # specify the number of bins for distribution function
   density=True # True to return probability density function (pdf) instead of count
   )

cdf = np.cumsum(pdf*np.diff(bins_edges))

请注意，返回数组pdf的长度为bins（此处为500），bin_edges的长度为bins+1（此处为501）。
因此，要计算CDF（即PDF分布曲线下方的面积），我们只需使用Numpy cumsum函数计算条柱宽度的累积和（np.diff(bins_edges)）乘以pdf

赞(0）回复(0）举报 2022-11-09

ou6hu8tu6#

第一个

cdf = get_discrete_cdf(rand_values)

x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])

x = [it[0] for it in x_p]
y = [it[1] for it in x_p]

_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")

赞(0）回复(0）举报 2022-11-09

我来回答

scipy 在Python中计算累积分布函数(CDF)

6条答案

相关问题

热门标签

最新问答