Jak obliczyć wariancję zbiorczą dwóch lub więcej grup, biorąc pod uwagę znane wariancje grup, średnie i wielkości próbek?

Powiedzmy, że istnieje $m+n$ elementów podzielonych na dwie grupy ( i ). Wariancja pierwszej grupy to a wariancja drugiej grupy to . Zakłada się, że same elementy są nieznane, ale znam środki i . $m$ $n$ $\sigma_m^2$ $\sigma^2_n$ $\mu_m$ $\mu_n$

Czy istnieje sposób obliczenia łącznej wariancji ? $\sigma^2_{(m+n)}$

Wariancja nie musi być obiektywna, więc mianownik to a nie . $(m+n)$ $(m+n-1)$

variance pooling użytkownik1809989
źródło

Kiedy mówisz, że znasz średnie i wariancje tych grup, czy są to parametry czy wartości próbek? Jeśli są to średnie próbki / wariancje, nie należy używać

μ

$\mu$ i

σ

$\sigma$ ...

Jonathan Christensen

Właśnie użyłem symboli jako przedstawienia. W przeciwnym razie trudno byłoby wyjaśnić mój problem.

user1809989,

Dla przykładowych wartości zwykle używamy liter łacińskich (np.

m

$m$ i

s

$s$ ). Litery greckie są zwykle zarezerwowane dla parametrów. Użycie „poprawnych” (oczekiwanych) symboli pomoże ci wyraźniej komunikować się.

Jonathan Christensen

Nie martw się, odtąd będę to śledzić! Pozdrawiam

użytkownik1809989,

@Jathanathan Ponieważ nie jest to pytanie o próbki lub oszacowanie, można słusznie uznać, że

są prawdziwą średnią i wariancją empirycznego rozkładu partii danych, tym samym uzasadniając konwencjonalne użycie liter greckich zamiast łacińskie litery, które się do nich odnoszą.

μ

$\mu$

σ^{2}

$\sigma^2$

whuber

Odpowiedzi:

Użyj definicji średniej

μ_{1 : n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

$\mu_{1:n} = \frac{1}{n}\sum_{i=1}^n x_i$

i wariancja próbki

σ_{1 : n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - μ_{1 : n})}^{2} = \frac{n - 1}{n} (\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - μ_{1 : n})}^{2})

$\sigma_{1:n}^2 = \frac{1}{n}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2 = \frac{n-1}{n}\left(\frac{1}{n-1}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2\right)$

(the last term in parentheses is the unbiased variance estimator often computed by default in statistical software) to find the sum of squares of all the data $x_i$ . Let's order the indexes $i$ so that $i=1,\ldots,n$ designates elements of the first group and $i=n+1,\ldots,n+m$ designates elements of the second group. Break that sum of squares by group and re-express the two pieces in terms of the variances and means of the subsets of the data:

\begin{aligned} (m + n) (σ_{1 : m + n}^{2} + μ_{1 : m + n}^{2}) & = \sum_{i = 1}^{1 : n + m} x_{i}^{2} \\ = \sum_{i = 1}^{n} x_{i}^{2} + \sum_{i = n + 1}^{n + m} x_{i}^{2} \\ = n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2}) . \end{aligned}

$\eqalign{ (m+n)(\sigma^2_{1:m+n} + \mu_{1:m+n}^2) &= \sum_{i=1}^{1:n+m} x_i^2 \\ &= \sum_{i=1}^n x_i^2 + \sum_{i=n+1}^{n+m} x_i^2 \\ &= n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2). }$

Algebraically solving this for $\sigma^2_{m+n}$ in terms of the other (known) quantities yields

σ_{1 : m + n}^{2} = \frac{n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2})}{m + n} - μ_{1 : m + n}^{2} .

$\sigma^2_{1:m+n} = \frac{n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2)}{m+n} - \mu^2_{1:m+n}.$

Of course, using the same approach, $\mu_{1:m+n} = (n\mu_{1:n} + m\mu_{1+n:m+n})/(m+n)$ can be expressed in terms of the group means, too.

An anonymous contributor points out that when the sample means are equal (so that $\mu_{1:n}=\mu_{1+n:m+n}=\mu_{1:m+n}$ ), the solution for $\sigma^2_{m+n}$ is a weighted mean of the group sample variances.

whuber
źródło

The "homework" tag doesn't mean the question is elementary or stupid: it's used for self-study questions that can even include research-level queries. It distinguishes routine, more or less context-free questions (of the sort that might ordinarily grace the math forum) from specific applied questions.

whuber

I cannot understand your first passage:

n (σ^{2} + μ^{2}) = \sum (x - μ)^{2} + n μ^{2} \overset{?}{=} \sum x^{2}

$n(\sigma^2+\mu^2) = \sum (x - \mu)^2 + n\mu^2 \stackrel{?}{=} \sum x^2$ In particular I get

\sum [(x - μ)^{2} + μ^{2}] = \sum [x^{2} - 2 x μ]

$\sum [(x-\mu)^2+\mu^2] = \sum [x^2-2x\mu]$ which requires

μ = 0

$\mu = 0$ Am I missing something? Could you please explain this?

DarioP

@Dario

\sum (x - μ)^{2} + n μ^{2} = (\sum x^{2} - 2 μ \sum x + n μ^{2}) + n μ^{2} = \sum x^{2} - 2 n μ^{2} + 2 n μ^{2} = \sum x^{2} .

$\sum(x-\mu)^2+n\mu^2=(\sum x^2 - 2\mu\sum x + n \mu^2)+n\mu^2 = \sum x^2 - 2n\mu^2 + 2n\mu^2 = \sum x^2.$

whuber

Oh yes, I did a stupid sign mistake in my derivation, now is clear, thanks!!

DarioP

I guess this can be extended to an arbitrary number of samples as long as you have the mean and variance for each. Calculating pooled (biased) standard deviation in R is simply sqrt(weighted.mean(u^2 + rho^2, n) - weighted.mean(u, n)^2) where n, u and rho are equal-length vectors. E.g. n=c(10, 14, 9) for three samples.

Jonas Lindeløv

I'm going to use standard notation for sample means and sample variances in this answer, rather than the notation used in the question. Using standard notation, another formula for the pooled sample variance of two groups can be found in O'Neill (2014) (Result 1):

\begin{aligned} s_{pooled}^{2} & = \frac{1}{n_{1} + n_{2} - 1} [(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} ({\bar{x}}_{1} - {\bar{x}}_{2})^{2}] . \end{aligned}

$\begin{equation} \begin{aligned} s_\text{pooled}^2 &= \frac{1}{n_1+n_2-1} \Bigg[ (n_1-1) s_1^2 + (n_2-1) s_2^2 + \frac{n_1 n_2}{n_1+n_2} (\bar{x}_1 - \bar{x}_2)^2 \Bigg]. \\[10pt] \end{aligned} \end{equation}$

This formula works directly with the underlying sample means and sample variances of the two subgroups, and does not require intermediate calculation of the pooled sample mean. (Proof of result in linked paper.)

Reinstate Monica
źródło

-3

Yes, given the mean, sample count, and variance or standard deviation of each of two or more groups of samples, you can exactly calculate the variance or standard deviation of the combined group.

This web page describes how to do it, and why it works; it also includes source code in Perl: http://www.burtonsys.com/climate/composite_standard_deviations.html

BTW, contrary to the answer given above,

\begin{aligned} n (σ^{2} + μ^{2}) \neq \sum_{i = 1}^{n} x_{i}^{2} \end{aligned}

$\eqalign{ n(\sigma^2 + \mu^2) \space\space \ne \space\space \sum_{i=1}^n x_i^2 }$

See for yourself, e.g., in R:

> x = rnorm(10,5,2)
> x
 [1] 6.515139 8.273285 2.879483 3.624233 6.199610 3.683164 4.921028 8.084591
 [9] 2.974520 6.049962
> mean(x)
[1] 5.320502
> sd(x)
[1] 2.007519
> sum(x**2)
[1] 319.3486
> 10 * (mean(x)**2 + sd(x)**2)
[1] 323.3787

Dave Burton
źródło

it's because you forgot the n-1 factor, e.g. try with n*(mean(x)**2+sd(x)**2/(n)*(n-1))

user603

user603, what on earth are you talking about?

Dave Burton

Dave, mathematics is a more reliable teacher than software. In this case R computes the unbiased estimate of the standard deviation rather than the standard deviation of the set of numbers. For instance, sd(c(-1,1)) returns 1.414214 rather than 1. Your example needs to use sqrt(9/10)*sd(x) in place of sd(x). Interpreting "

σ

$\sigma$ " as the SD of the data and "

μ

$\mu$ " as the mean of the data, your BTW remark is wrong. A program demonstrating this is n <- 10; x <- rnorm(n,5,2); m <- mean(x); s <- sd(x) * sqrt((n-1)/n); m2 <- sum(x^2); c(lhs=n * (m^2 + s^2), rhs=m2)

whuber