zf
Chapter 5 Principal Components Analysis (PCA)
2011-10-16 2
cxt
Presentation Outline
?w What is PCA?
?w Geometrical approach to PCA
?w Analytical approach to PCA
?w Properties of PCA
?w How to determine the number of PC?
?w How to interpret the PC?
?w Use of PC scores
2011-10-16 3
cxt
5.1 reasons for using principal components analysis
? Too Many Variables
2011-10-16 4
cxt
? Stone use 1929一 1938 data in USA,and receive 17 variables
which describe income-pay,He used principle component
analysis and got three new variables F1,F2,F3,F1,total
income;F2,total income increase ratio;F3,economy increase
or decrease,These new variable can use three variables (I、
?I,t ) which can be measured directly,
2011-10-16 5
cxt
F1
F2
F3
i
i
t
F1
1
F2
0
1
F3
0
0
1
i
0.995
-0.041
0.057
l
Δ i
-0.056
0.948
-0.124
-0.102
l
t
-0.369
-0.282
-0.836
-0.414
-0.112
1
2011-10-16 6
cxt
? Solutions
?? Eliminate some redundant variables,
– May lose important information that was uniquely reflected in the
eliminated variables,
?? Create composite scores from variables (sum or average),
– Lost variability among the variables
– Multiple scale scores may still be collinear
?? Create weighted linear combinations of variables while retaining
most of the variability in the data,
– Fewer variables; little or no lost variation
– No collinear scales,
2011-10-16 7
cxt
? An Easy Choice
To retain most of the information in the data while reducing the
number of variables you must deal with,try principal
components analysis,
?? Most of the variability in the original data can be retained,
but…
?? Components may not be directly interpretable,
2011-10-16 8
cxt
? What is PCA?(什么是 主成分分析)
? PCA is a technique for forming new variables which are
linear composites of the original variables,The new
variables are called principal components(PRIN’s),
? The maximum number of PRIN’s that can be formed is
equal to the number of original variables,Usually the
first few PRIN’s represent most of the information in the
original variables and can replace the original variables
and hence achieve data reduction,which is the main
objective of PCA
? The PRIN’s are uncorrelated among themselves and can
be used in regression
2011-10-16 9
cxt
? Principal Components Analysis(PCA)
?? is a dimension reduction method that creates variables
called principal components
?? creates as many components as there are input variables,
? Principal Components
?? are weighted linear combinations of input variables
?? are orthogonal to and independent of other
components
?? are generated so that the first component accounts for
the most variation in the xs,followed by the second
component,and so on,
2011-10-16 10
cxt
平移、旋转坐标轴
?
1x
2F
?
?
? ? ?
?
?
? ? ? ?
? ?
?
? ?
?
? ?
? ? ?
?
?
?
? ?
? ? ?
?
?
? ? ?
?
1F2
x
2011-10-16 11
cxt
2F
?
1x
? ? ? ?
?
?
?
?
? ? ? ?
?
?
?
?
? ?
?
?
? ?
?
? ?
? ? ?
?
?
?
? ? ? ?
?
?
1F
2011-10-16 12
cxt
?
2x
1x
1F
2F
? ?
? ? ? ?
?
?
? ?
?
?
?
?
?
?
?
? ?
?
?
?
? ? ?
? ?
?
? ? ?
?
?
?
2011-10-16 13
cxt
? 1,data screening—Locating possible outlier in the
data
? 2,reduce variables and use these new variables in
clutering,discriminant analysis,regression,
? 3,PCA can help determine whether multicollinearity
occurs among the predictor variables,
2011-10-16 14
cxt
5.2 Objectives of PCA
? 1,reduce the dimensionality of the data set
without losing any information,This smaller
number of variables can then used in ensuing
analyses,
? 2,identify new meaningful underlying
variables,
The new variables are useful for variety of
things including data screening,assumption
checking,and cluster verifying,
2011-10-16 15
cxt
5.3 PCA on variance-covariance matrix
? Analytical approach to PCA
? Assuming that there are p variables,We are interested
in forming the following p principal components,
? where wij is the weight of the jth variable for the ith
principal component,
?
2011-10-16 16
cxt
? The weights,wij,are estimated such that,
? 1,The first principal component,PRIN1,accounts
for the maximum variance in the data,the second
principal component,PRIN2,accounts for the
maximum variance that has not been accounted for
by the first principal component,and so on,
? 2,
? 3,
? 4,
pjijiPR I NPR I NCo v ji,,,,,,),( ?210 ???
)()( 21 pPR I NVa rPR I NVa rPR I NVa r ??? ?)(
2011-10-16 17
cxt
? Properties of Principal Components
? 1,
? 2,
which is equal to p if the X’s have been standardized,
? 3,The proportion of the total variance explained by the first k
PRIN’s
If this proportion is close to one,then the first k PRIN’s can
replace the original p variables without much loss of information,
2011-10-16 18
cxt
? 4,
It is called a loading and plays a big role in interpreting the
meaning of PRINi,If the data are only mean-corrected,
replace li in this formula by li/Var(Xj)
? 5,R-square of (PRINi,Xj), This can be
interpreted in two ways,
? as the proportion of variance of Xj explained by PRINi,or
? as the contribution or importance of Xj to PRINi,
If the data are only mean-corrected,replace in this formula
by
2011-10-16 19
cxt
5.4 PCA on the correlation matrix
2011-10-16 20
cxt
5.5 determining the number of principal components
? Percentage of Variance criterion
Use enough PRIN’s to explain 75-80% of the total variance
? Latent Root criterion,specifies a threshold value for
evaluating the eigenvalues of the derived PRIN’s,If the
variables are standardized,only PRIN’s with an eigenvalue
greater than 1 are significant and will be extracted,
? Scree Test criterion,A scree plot is derived by plotting the
eigenvalues of each PRIN relative to the number of PRIN’s in
order of extraction,The point where the line becomes horizontal
(elbow) is the appropriate number of PRIN’s
? Horn’s,parallel procedure”(formalized scree plot)
2011-10-16 21
cxt
? Scree plot of eigenvalues
? Proportion of variance explained by each component,
? Cumulative variance explained by components,
? Eigenvalue > 1
2011-10-16 22
cxt
? Scree Plot and Parallel analysis
2011-10-16 23
cxt
How do we interpret the principal components?
? Use the loadings or coefficients to determine
which variables are influential in the formation of
principal components,and then assign a meaning
or label to the principal component,
? Recall loadings = coefficient * standard deviation
of PRINi
2011-10-16 24
cxt
? Note SPSS presents loading,SAS PROC
FACTOR EV gives coefficients,The numbers
under the factor pattern in SAS output are
loading
? Researchers have used a loading of 0.5 as
cutoff point
? PC do not necessarily have meaning,They are
mathematical results
2011-10-16 25
cxt
? Interpreting Principal Components
2011-10-16 26
cxt
? Use the principal components scores for further
analyses
? Calculate the principal components scores by applying the
coefficients to mean-corrected data or standardized data
? Replace p variables by k PC (k << p),The scores can be plotted
for further interpreting the results,discovering structure in data
and checking for outlier,groups of observations,,
? They can also be used as input variables for further analyzing
the data using other statistical techniques such as cluster
analysis,regression,and discriminant analysis,
2011-10-16 27
cxt
? Example, Stock price data
The weekly rates of return for five stocks (Allied Chemical,du Pont,Union
Carbide,Exxon and Texaco) listed on the New York Stock Exchange were
determined for the period January 1975 through December 1976,The data
are given in the following table,Perform a principal components analysis on
the data and interpret the results,
2011-10-16 28
cxt
2011-10-16 29
cxt
2011-10-16 30
cxt
? Interpretation
PRIN1 is approximately an equally weighted
average of 5 stocks,It might be called a market
component
PRIN2 contrast between AC,DP,UC (chemical
stocks) versus EX,TE (oil stocks),It might be
called an industry component
2011-10-16 31
cxt
Chapter 5 Principal Components Analysis (PCA)
2011-10-16 2
cxt
Presentation Outline
?w What is PCA?
?w Geometrical approach to PCA
?w Analytical approach to PCA
?w Properties of PCA
?w How to determine the number of PC?
?w How to interpret the PC?
?w Use of PC scores
2011-10-16 3
cxt
5.1 reasons for using principal components analysis
? Too Many Variables
2011-10-16 4
cxt
? Stone use 1929一 1938 data in USA,and receive 17 variables
which describe income-pay,He used principle component
analysis and got three new variables F1,F2,F3,F1,total
income;F2,total income increase ratio;F3,economy increase
or decrease,These new variable can use three variables (I、
?I,t ) which can be measured directly,
2011-10-16 5
cxt
F1
F2
F3
i
i
t
F1
1
F2
0
1
F3
0
0
1
i
0.995
-0.041
0.057
l
Δ i
-0.056
0.948
-0.124
-0.102
l
t
-0.369
-0.282
-0.836
-0.414
-0.112
1
2011-10-16 6
cxt
? Solutions
?? Eliminate some redundant variables,
– May lose important information that was uniquely reflected in the
eliminated variables,
?? Create composite scores from variables (sum or average),
– Lost variability among the variables
– Multiple scale scores may still be collinear
?? Create weighted linear combinations of variables while retaining
most of the variability in the data,
– Fewer variables; little or no lost variation
– No collinear scales,
2011-10-16 7
cxt
? An Easy Choice
To retain most of the information in the data while reducing the
number of variables you must deal with,try principal
components analysis,
?? Most of the variability in the original data can be retained,
but…
?? Components may not be directly interpretable,
2011-10-16 8
cxt
? What is PCA?(什么是 主成分分析)
? PCA is a technique for forming new variables which are
linear composites of the original variables,The new
variables are called principal components(PRIN’s),
? The maximum number of PRIN’s that can be formed is
equal to the number of original variables,Usually the
first few PRIN’s represent most of the information in the
original variables and can replace the original variables
and hence achieve data reduction,which is the main
objective of PCA
? The PRIN’s are uncorrelated among themselves and can
be used in regression
2011-10-16 9
cxt
? Principal Components Analysis(PCA)
?? is a dimension reduction method that creates variables
called principal components
?? creates as many components as there are input variables,
? Principal Components
?? are weighted linear combinations of input variables
?? are orthogonal to and independent of other
components
?? are generated so that the first component accounts for
the most variation in the xs,followed by the second
component,and so on,
2011-10-16 10
cxt
平移、旋转坐标轴
?
1x
2F
?
?
? ? ?
?
?
? ? ? ?
? ?
?
? ?
?
? ?
? ? ?
?
?
?
? ?
? ? ?
?
?
? ? ?
?
1F2
x
2011-10-16 11
cxt
2F
?
1x
? ? ? ?
?
?
?
?
? ? ? ?
?
?
?
?
? ?
?
?
? ?
?
? ?
? ? ?
?
?
?
? ? ? ?
?
?
1F
2011-10-16 12
cxt
?
2x
1x
1F
2F
? ?
? ? ? ?
?
?
? ?
?
?
?
?
?
?
?
? ?
?
?
?
? ? ?
? ?
?
? ? ?
?
?
?
2011-10-16 13
cxt
? 1,data screening—Locating possible outlier in the
data
? 2,reduce variables and use these new variables in
clutering,discriminant analysis,regression,
? 3,PCA can help determine whether multicollinearity
occurs among the predictor variables,
2011-10-16 14
cxt
5.2 Objectives of PCA
? 1,reduce the dimensionality of the data set
without losing any information,This smaller
number of variables can then used in ensuing
analyses,
? 2,identify new meaningful underlying
variables,
The new variables are useful for variety of
things including data screening,assumption
checking,and cluster verifying,
2011-10-16 15
cxt
5.3 PCA on variance-covariance matrix
? Analytical approach to PCA
? Assuming that there are p variables,We are interested
in forming the following p principal components,
? where wij is the weight of the jth variable for the ith
principal component,
?
2011-10-16 16
cxt
? The weights,wij,are estimated such that,
? 1,The first principal component,PRIN1,accounts
for the maximum variance in the data,the second
principal component,PRIN2,accounts for the
maximum variance that has not been accounted for
by the first principal component,and so on,
? 2,
? 3,
? 4,
pjijiPR I NPR I NCo v ji,,,,,,),( ?210 ???
)()( 21 pPR I NVa rPR I NVa rPR I NVa r ??? ?)(
2011-10-16 17
cxt
? Properties of Principal Components
? 1,
? 2,
which is equal to p if the X’s have been standardized,
? 3,The proportion of the total variance explained by the first k
PRIN’s
If this proportion is close to one,then the first k PRIN’s can
replace the original p variables without much loss of information,
2011-10-16 18
cxt
? 4,
It is called a loading and plays a big role in interpreting the
meaning of PRINi,If the data are only mean-corrected,
replace li in this formula by li/Var(Xj)
? 5,R-square of (PRINi,Xj), This can be
interpreted in two ways,
? as the proportion of variance of Xj explained by PRINi,or
? as the contribution or importance of Xj to PRINi,
If the data are only mean-corrected,replace in this formula
by
2011-10-16 19
cxt
5.4 PCA on the correlation matrix
2011-10-16 20
cxt
5.5 determining the number of principal components
? Percentage of Variance criterion
Use enough PRIN’s to explain 75-80% of the total variance
? Latent Root criterion,specifies a threshold value for
evaluating the eigenvalues of the derived PRIN’s,If the
variables are standardized,only PRIN’s with an eigenvalue
greater than 1 are significant and will be extracted,
? Scree Test criterion,A scree plot is derived by plotting the
eigenvalues of each PRIN relative to the number of PRIN’s in
order of extraction,The point where the line becomes horizontal
(elbow) is the appropriate number of PRIN’s
? Horn’s,parallel procedure”(formalized scree plot)
2011-10-16 21
cxt
? Scree plot of eigenvalues
? Proportion of variance explained by each component,
? Cumulative variance explained by components,
? Eigenvalue > 1
2011-10-16 22
cxt
? Scree Plot and Parallel analysis
2011-10-16 23
cxt
How do we interpret the principal components?
? Use the loadings or coefficients to determine
which variables are influential in the formation of
principal components,and then assign a meaning
or label to the principal component,
? Recall loadings = coefficient * standard deviation
of PRINi
2011-10-16 24
cxt
? Note SPSS presents loading,SAS PROC
FACTOR EV gives coefficients,The numbers
under the factor pattern in SAS output are
loading
? Researchers have used a loading of 0.5 as
cutoff point
? PC do not necessarily have meaning,They are
mathematical results
2011-10-16 25
cxt
? Interpreting Principal Components
2011-10-16 26
cxt
? Use the principal components scores for further
analyses
? Calculate the principal components scores by applying the
coefficients to mean-corrected data or standardized data
? Replace p variables by k PC (k << p),The scores can be plotted
for further interpreting the results,discovering structure in data
and checking for outlier,groups of observations,,
? They can also be used as input variables for further analyzing
the data using other statistical techniques such as cluster
analysis,regression,and discriminant analysis,
2011-10-16 27
cxt
? Example, Stock price data
The weekly rates of return for five stocks (Allied Chemical,du Pont,Union
Carbide,Exxon and Texaco) listed on the New York Stock Exchange were
determined for the period January 1975 through December 1976,The data
are given in the following table,Perform a principal components analysis on
the data and interpret the results,
2011-10-16 28
cxt
2011-10-16 29
cxt
2011-10-16 30
cxt
? Interpretation
PRIN1 is approximately an equally weighted
average of 5 stocks,It might be called a market
component
PRIN2 contrast between AC,DP,UC (chemical
stocks) versus EX,TE (oil stocks),It might be
called an industry component
2011-10-16 31
cxt