Czy przy obliczaniu macierzy kowariancji próbki można uzyskać macierz symetryczną i dodatnią?
Obecnie mój problem obejmuje próbkę 4600 wektorów obserwacyjnych i 24 wymiary.
sampling
covariance
Morten
źródło
źródło
Odpowiedzi:
Dla próbki wektorówxi=(xi1,…,xik)⊤ , przy i=1,…,n , średni wektor próbki wynosi
x¯=1n∑i=1nxi, a macierz kowariancji próbki wynosi
Q=1n∑i=1n(xi−x¯)(xi−x¯)⊤.
Dla niezerowego wektoray∈Rk mamy
y⊤Qy=y⊤(1n∑i=1n(xi−x¯)(xi−x¯)⊤)y
=1n∑i=1ny⊤(xi−x¯)(xi−x¯)⊤y
=1n∑i=1n((xi−x¯)⊤y)2≥0.(∗)
DlategoQ jest zawsze dodatnimpółokreślonym.
The additional condition forQ to be positive definite was given in whuber's comment bellow. It goes as follows.
Zdefiniujzi=(xi−x¯) , dla i=1,…,n . Dla każdego niezerowego y∈Rk , (∗) wynosi zero wtedy i tylko wtedy, gdy z⊤iy=0 , dla każdego i=1,…,n . Załóżmy, że zestaw {z1,…,zn} obejmuje Rk . Następnie istnieją liczby rzeczywisteα1,…,αn takie, żey=α1z1+⋯+αnzn . Ale wtedy mamyy⊤y=α1z⊤1y+⋯+αnz⊤ny=0 , co daje, żey=0 , sprzeczność. Stąd, jeślirozpiętośćzi Rk , then Q is positive definite. This condition is equivalent to rank[z1…zn]=k .
źródło
A correct covariance matrix is always symmetric and positive *semi*definite.
The covariance between two variables is defied asσ(x,y)=E[(x−E(x))(y−E(y))] .
This equation doesn't change if you switch the positions ofx and y . Hence the matrix has to be symmetric.
It also has to be positive *semi-*definite because:
You can always find a transformation of your variables in a way that the covariance-matrix becomes diagonal. On the diagonal, you find the variances of your transformed variables which are either zero or positive, it is easy to see that this makes the transformed matrix positive semidefinite. However, since the definition of definity is transformation-invariant, it follows that the covariance-matrix is positive semidefinite in any chosen coordinate system.
When you estimate your covariance matrix (that is, when you calculate your sample covariance) with the formula you stated above, it will obv. still be symmetric. It also has to be positive semidefinite (I think), because for each sample, the pdf that gives each sample point equal probability has the sample covariance as its covariance (somebody please verify this), so everything stated above still applies.
źródło
Variance-Covariance matrices are always symmetric, as it can be proven from the actual equation to calculate each term of said matrix.
Also, Variance-Covariance matrices are always square matrices of size n, where n is the number of variables in your experiment.
Eigenvectors of symmetric matrices are always orthogonal.
With PCA, you determine the eigenvalues of the matrix to see if you could reduce the number of variables used in your experiment.
źródło
I would add to the nice argument of Zen the following which explains why we often say that the covariance matrix is positive definite ifn−1≥k .
Ifx1,x2,...,xn are a random sample of a continuous probability distribution then x1,x2,...,xn are almost surely (in the probability theory sense) linearly independent.
Now, z1,z2,...,zn are not linearly independent because ∑ni=1zi=0 , but because of x1,x2,...,xn being a.s. linearly independent, z1,z2,...,zn a.s. span Rn−1 . If n−1≥k , they also span Rk .
To conclude, ifx1,x2,...,xn are a random sample of a continuous probability distribution and n−1≥k , the covariance matrix is positive definite.
źródło
For those with a non-mathematical background like me who don't quickly catch the abstract mathematical formulae, this is a worked out example excel for the most upvoted answer. The covariance matrix can be derived in other ways also.
źródło