import numpy as np
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
import numpy as np
def mean_confidence_interval(data, confidence: float = 0.95) -> tuple[float, np.ndarray]:
"""
Returns (tuple of) the mean and confidence interval for given data.
Data is a np.arrayable iterable.
ref:
- https://stackoverflow.com/a/15034143/1601580
- https://github.com/WangYueFt/rfs/blob/f8c837ba93c62dd0ac68a2f4019c619aa86b8421/eval/meta_eval.py#L19
"""
import scipy.stats
import numpy as np
a: np.ndarray = 1.0 * np.array(data)
n: int = len(a)
if n == 1:
import logging
logging.warning('The first dimension of your data is 1, perhaps you meant to transpose your data? or remove the'
'singleton dimension?')
m, se = a.mean(), scipy.stats.sem(a)
tp = scipy.stats.t.ppf((1 + confidence) / 2., n - 1)
h = se * tp
return m, h
def ci_test_float():
import numpy as np
# - one WRONG data set of size 1 by N
data = np.random.randn(1, 30) # gives an error becuase len sets n=1, so not this shape!
m, ci = mean_confidence_interval(data)
print('-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of '
'length 1.')
print(m, ci)
# right data as N by 1
data = np.random.randn(30, 1)
m, ci = mean_confidence_interval(data)
print('-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)')
print(m, ci)
# multiple data sets (7) of size N (=30)
data = np.random.randn(30, 7)
print('-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)'
'due to the CLT.')
m, ci = mean_confidence_interval(data)
print(m, ci)
ci_test_float()
输出:
-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of length 1.
0.1431623130952463 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan]
-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)
0.04947206018132864 [0.40627264]
-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)due to the CLT.
-0.03585104402718902 [0.31867309 0.35619134 0.34860011 0.3812853 0.44334033 0.35841138
0.40739732]
我认为Num_samples by Num_datasets是正确的,但如果它不是让我在评论部分知道。
"""
Review for confidence intervals. Confidence intervals say that the true mean is inside the estimated confidence interval
(the r.v. the user generates). In particular it says:
Pr[mu^* \in [mu_n +- t.val(p) * std_n / sqrt(n) ] ] >= p
e.g. p = 0.95
This does not say that for a specific CI you compute the true mean is in that interval with prob 0.95. Instead it means
that if you surveyed/sampled 100 data sets D_n = {x_i}^n_{i=1} of size n (where n is ideally >=30) then for 95 of those
you'd expect to have the truee mean inside the CI compute for that current data set. Note you can never check for which
ones mu^* is in the CI since mu^* is unknown. If you knew mu^* you wouldn't need to estimate it. This analysis assumes
that the the estimator/value your estimating is the true mean using the sample mean (estimator). Since it usually uses
the t.val or z.val (second for the standardozed r.v. of a normal) then it means the approximation that mu_n ~ gaussian
must hold. This is most likely true if n >= 0. Note this is similar to statistical learning theory where we use
the MLE/ERM estimator to choose a function with delta, gamma etc reasoning. Note that if you do algebra you can also
say that the sample mean is in that interval but wrt mu^* but that is borning, no one cares since you do not know mu^*
so it's not helpful.
An example use could be for computing the CI of the loss (e.g. 0-1, CE loss, etc). The mu^* you want is the expected
risk. So x_i = loss(f(x_i), y_i) and you are computing the CI for what is the true expected risk for that specific loss
function you choose. So mu_n = emperical mean of the loss and std_n = (unbiased) estimate of the std and then you can
simply plug in the values.
Assumptions for p-CI:
- we are making a statement that mu^* is in mu+-pCI = mu+-t_p * sig_n / sqrt n, sig_n ~ Var[x] is inside the CI
p% of the time.
- we are estimating mu^, a mean
- since the quantity of interest is mu^, then the z_p value (or p-value, depending which one is the unknown), is
computed using the normal distribution.
- p(mu) ~ N(mu; mu_n, sig_n/ sqrt n), vial CTL which holds for sample means. Ideally n >= 30.
- x ~ p^*(x) are iid.
Std_n vs t_p*std_n/ sqrt(n)
- std_n = var(x) is more pessimistic but holds always. Never shrinks as n->infity
- but if n is small then pCI might be too small and your "lying to yourself". So if you have very small data
perhaps doing std_n for the CI is better. That holds with prob 99.9%. Hopefuly std is not too large for your
experiments to be invalidated.
ref:
- https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1
- https://stackoverflow.com/questions/70356922/what-is-the-proper-way-to-compute-95-confidence-intervals-with-pytorch-for-clas
- https://www.youtube.com/watch?v=MzvRQFYUEFU&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=205
"""
from scipy.stats import norm
alpha = 0.95
# Define our z
ci = alpha + (1-alpha)/2
# Lower Interval, where n is sample siz
c_lb = sample_mean - norm.ppf(ci)*((sigma/(n**0.5)))
c_ub = sample_mean + norm.ppf(ci)*((sigma/(n**0.5)))
6条答案
按热度按时间q8l4jmvw1#
你可以这样算。
9avjhtql2#
这里是shasan代码的一个简化版本,计算数组
a
的平均值的95%置信区间:但使用StatsModels的
tconfint_mean
无疑更好:两者的基本假设都是样本(数组
a
)是从标准差未知的正态分布中独立抽取的(参见MathWorld或Wikipedia)。对于大样本量n,样本均值呈正态分布,可以使用
st.norm.interval()
计算其置信区间(如Jaime的注解中所建议的)。但是上述解对于小n也是正确的,其中st.norm.interval()
给出的置信区间太窄(也就是“假自信”)。更多细节请参见我的answer,类似的问题(以及Russ的评论)。下面是一个例子,其中正确的选项给予(基本上)相同的置信区间:
最后,使用
st.norm.interval()
得到的结果不正确:rjjhvcjd3#
从
Python 3.8
开始,标准库提供NormalDist
对象作为statistics
模块的一部分:这一点:
NormalDist.from_samples(data)
)创建一个NormalDist
对象,这样我们就可以通过NormalDist.mean
和NormalDist.stdev
访问样本的平均值和标准差。inv_cdf
),基于给定置信度的标准正态分布(由NormalDist()
表示)计算Z-score
。假设样本量足够大(假设超过~100个点),以便使用标准正态分布而不是学生t分布来计算
z
值。jgzswidk4#
首先从look-up table中查找z-value,得到所需的置信区间。置信区间为
mean +/- z*sigma
,其中sigma
是样本平均值的估计标准差,由sigma = s / sqrt(n)
给出,其中s
是根据样本数据计算的标准差,n
是样本大小。yqlxgs2m5#
在原文的基础上却加上了一些具体的例子:
输出:
我认为Num_samples by Num_datasets是正确的,但如果它不是让我在评论部分知道。
它适用于哪种类型的数据?
我认为它可以用于任何数据,因为以下:
我认为这是好的,因为平均值和标准差是针对一般数值数据计算的,z_p/t_p值只考虑置信区间和数据大小,因此它与数据分布的假设无关。
因此,我相信它可以用于回归和分类。
作为一个额外的好处,一个几乎只使用 Torch 的 Torch 实现:
关于CI的一些注解(或参见https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1):
0ve6wy6x6#
关于Ulrich的答案--即使用t值。当真实方差未知时,我们使用t值。这是当你仅有的数据是样本数据时。
对于bogatron的回答,这涉及到z表,z表是在方差已知并已提供的情况下使用的,这样你也有样本数据,Sigma不是样本平均值的估计标准差,它是已知的。
假设方差已知,我们希望置信度为95%:
如果只有样本数据和未知方差(意味着方差只能从样本数据中计算),Ulrich的答案是完美的。但是,您可能需要指定置信区间。如果数据是a,并且您希望置信区间为0.95: