Introduction to Scientific
Computing
-- A Matrix Vector Approach Using Matlab
Written by Charles F.Van Loan
陈 文 斌
Wbchen@fudan.edu.cn
复旦大学
Chapter 7 The QR and Cholesky
Factorizations
? Least Squares Fitting
? The QR factorization
? The Cholesky Factorization
? High-Performance Cholesky
Least Squares Fitting
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
1
1
0
65
43
21
21 xx
2
m in im iz e
t o f i n d,a n d G iv e n
bAx
RxRbRA nmnm
?
??? ?
overdetermined
Setting Up Least Squares Problems
]1,25[.,)( ?? xxxf
? ??
?
???
m
i
iim xx
1
2)(),( ?????
2
2
2
1
2
1
1
1
1
),(
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mm
m
f
f
f
x
x
x
??? ?
?
???
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
1.1
1.2
m = 100,alpha = 0.369810,beta = 0.652299
Matlab’s Least Squares Tools
xLs=A\b;
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
m = 2,alpha = 0.333333,beta = 0.666667
? ? ? ?? ? ),()(75.
1
1
25.
2 ???????
?
?
??????? ? dxxxxxm
m
i
ii
0,0 ?
?
??
?
? ??
?
?
?
?
Minimizer
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
80
31
12
7
64
21
32
15
32
15
4
3
*
*
?
?
??
?
??
??
??
?
??
?
6 5 1 8 5 1 8 5 1.0
3 7 0 3 7 0 3 7.0
*
*
?
?
polyfit If we try to fit data point in the least squares sense
with a polynomial of degree d,the m-by-(d+1) least
squares problems arises.
Matlab’s Least Squares Tools
2m in bAxnRx ??
is equivalent to a transformed problem
? ? ? ? 2m in bQxAQ TT
Rx n
?
?
o r t h o g o n a lIQQQQ TT,??
2
2
2
2 rrQ
T ? ?
?
?
?
?
?
?
?
)c o s ()s in (
)s in ()c o s (
??
??
Q
The column of an orthogonal matrix define an orthonormal basis
? ? ? ?
2
4
3
2
1
2
122
1211
22
00
00
0
m in
m inm in
2
22
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
???
?
??
c
c
c
c
x
xr
rr
bQxAQbAx
Rx
TT
RxRx
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
2
1
2
1
22
1211
0 c
c
x
x
r
rr
2
4
2
3222m i n ccbAxbAx LsRx ??????
QRA ? QR factorization problem
To find an orthonormal basis for the subspace defined by
the columns of A
),(*)(,,),2(*)2(,,),1(*)1(,,)(,,jjRjQjRQjRQjA ???? ?
Any column of A is in the span of {Q(:,1),…,Q(:,n)}
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
00
150
63
3/13/23/2
3/23/13/2
3/23/23/1
142
12
81
Rotations
The Q in the QR factorization can be computed using a
special family of orthogonal matrices that are called rotations.
)s i n (),c o s (,?? ???
?
?
??
?
?
? sc
cs
sc
G
2
2
2
12
2
2
2
11 /,/ xxxsxxxc ????
0,2 ?? yGxy
function [c,s] = Rotate(x1,x2)
if x2==0
c = 1; s = 0;
else
if abs(x2)>=abs(x1)
cotangent = x1/x2;
s = 1/sqrt(1+cotangent^2); c = s*cotangent;
else
tangent = x2/x1;
c = 1/sqrt(1+tangent^2); s = c*tangent;
end
end
)s in (),c o s (,
1000
00
0010
00
?? ??
?
?
?
?
?
?
?
?
?
?
?
?
?
? sc
cs
sc
G
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
4
31
2
31
4
3
2
1
),3,1(
x
cxsx
x
sxcx
x
x
x
x
G ?
[c,s]=rotate(x(1),x(3))
Reduction to Upper Triangular Form
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
???
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
???
???
?
?
?
?
?
?
?
?
?
?
?
?
?
???
???
???
???
000
00
0
00
00
0
00
0
0
0
0
0
0
0
0
r) t r i an g u l au p p er(1 RAGG t ??
RAQGGQ TtT ??,1? QRA ?
function [Q,R] = QRRot(A)
[m,n] = size(A); Q = eye(m,m);
for j=1:n
for i=m:-1:j+1
[c,s] = Rotate(A(i-1,j),A(i,j));
A(i-1:i,j:n) = [c s; -s c]*A(i-1:i,j:n);
Q(:,i-1:i) = Q(:,i-1:i)*[c s; -s c]';
end;end
R = triu(A);
(9m-4n)n2 flops
n=m,this is about
seven times as
costly as LU
factorization
The Least Squares Solution
function [xLS,res] = LSq(A,b)
[m,n] = size(A);
for j=1:n,for i=m:-1:j+1
[c,s] = Rotate(A(i-1,j),A(i,j));
A(i-1:i,j:n) = [c s; -s c]*A(i-1:i,j:n);
b(i-1:i) = [c s; -s c]*b(i-1:i);
end;end
xLS = UTriSol(A(1:n,1:n),b(1:n));
if m==n,res = 0;
Else,res = norm(b(n+1:m));
end
(3m-2n)n2 flops
Q is not explicitly
formed
m=n,50% more than
what is required by GE
Solution Sensitivity
Instead of condition,we use an equivalent formulation that
makes sense for both square and rectangular matrices
2
2
)(
2
m i n
)(
1
A
E
A nEAr a n k ??
?
?
Roughly speaking,the inverse of the condition is the relative
distance to the nearest rank deficient matrix,If the columns of A
are nearly dependent,then its condition number is large
22222,)()(m i n be p sbAe p sAbbxAA ?????????
))()((
~
2
22 AAe p sx
xx
LS
LS
LSLS ??? ???
2bAx LSLS ???
ShowLSq.m
The Cholesky Factorization
?If A is SPD,then its diagonal entries are positive and for all i and j
2/)(|| jjiiij aaa ??
?A is strictly diagonally dominant,If A is also symmetric with
positive diagonal entries,then it is SPD,
?If A is SPD,then we have Cholesky factorization,
?We can solve LS problem by ATAx=ATb
Positive Definiteness
)()()()]()([ xrxuxqxDuxpD ???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
nn
nn
de
ed
de
ed
T
1
11
22
21
00
0
0
00
?
??
????
?
?
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Solution to -D 2 u + u = 2xsin(x) - 2cos(x),u(0)=u(pi)=0
n = 100 norm(u - xsin(x),'inf') = 0.000112
Five Implementations for Full Matrics
??
?
??
?????
1
11
j
k
ijjkikijjjij
j
k
jkikij sggagggga
??
?
?
?
?
?
?
jigs
jis
g
jjij
ij
ij
/
CholScalar
CholDot
CholSax
CholGax
CholRecur
function G = CholScalar(A)
[n,n] = size(A); G = zeros(n,n);
for i=1:n,for j=1:i
s = A(j,i);
for k=1:j-1
s = s - G(j,k)*G(i,k);
end
if j < i
G(i,j) = s/G(j,j);
else
G(i,i) = sqrt(s);
end; end;end
function G = CholDot(A)
[n,n] = size(A); G = zeros(n,n);
for i=1:n,for j=1:i
if j==1
s = A(j,i);
else
s = A(j,i) - G(j,1:j-1)*G(i,1:j-1)'; end
if j < i
G(i,j) = s/G(j,j);
else
G(i,i) = sqrt(s);
end; end;end
An inner product is
an example of a
level-1 linear
algebra operation,
Level-1 operations
involve O(n) work
and O(n)data,Inner
products,vector
scaling,vector
addition,and
saxpys are level-1
operations.
Column-oriented version
?
?
?
j
k
kjGkGjA
1
),()(,,)(,,
skjGkGjAjjGkG
j
k
??? ?
?
?
1
1
),()(,,)(,,),()(,,
):(),(),:(),:(),(),:(
1
1
njskjGknjGjnjAjjGknjG
j
k
??? ?
?
?
)(/):(),:(,)(),( jsnjsjnjGjsjjGG ??
function G = CholSax(A)
[n,n] = size(A);
G = zeros(n,n);
s = zeros(n,1);
for j=1:n
s(j:n) = A(j:n,j);
for k=1:j-1
s(j:n) = s(j:n) - G(j:n,k)*G(j,k);
end
G(j:n,j) = s(j:n)/sqrt(s(j));
end
Vector←Vector + Matrix × Vector
gaxpy operation
TjjGjnjGnjsnjs )1:1,()1:1,:():():( ????
function G = CholGax(A)
[n,n] = size(A);G = zeros(n,n);s = zeros(n,1);
for j=1:n
if j==1
s(j:n) = A(j:n,j);
else
s(j:n) = A(j:n,j) - G(j:n,1:j-1)*G(j,1:j-1)';
end
G(j:n,j) = s(j:n)/sqrt(s(j));end
Level-2 operation
are characterized by
quadratic work and
quadratic data
recursive implementation
?
?
?
?
?
?
??
?
?
?
?
?
?
?? w
G
G
v
vB
A
0
,1
T
w
G
w
G
v
vB
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
???
00 11
wwwGvGGB TT ???? 2111,,??
ww T?? ?0
wwvwGGGB TT ???? ??,,111
function G = CholRecur(A)
[n,n] = size(A);
G = zeros(n,n);
if n==1
G = sqrt(A);
else
G(1:n-1,1:n-1) = CholRecur(A(1:n-1,1:n-1));
G(n,1:n-1) = LTriSol(G(1:n-1,1:n-1),A(1:n-1,n))';
G(n,n) = sqrt(A(n,n) - G(n,1:n-1)*G(n,1:n-1)');
end
Efficiency,Stability,Accuracy
n = 100
Algorithm Relative Time
--------------------------
CholScalar 1.000
CholDot 0.187
CholSax 0.188
CholGax 0.034
CholRecur 0.250
Relative time = Time/CholScalar Time
Different implementations of
the same matrix algorithm
can have different levels of
performance.
In GE we had to worry about large multipliers,This is not an issue
in the Cholesky factorization
ijij
i
j
ijii agga ?? ?
?
|| i m p l i e s,
1
2
Thus no entry in G can be larger than the square root of A’s largest
diagonal entry,This does not imply that all symmetric positive
definite systems are well conditioned,If is the computed vector
produced by
G = cholScalar(A);
y = LTriSol(G,b);
x = UtriSol(G’,y);
Ae p sEbxEA ???,?)(
)(
?
Ae p s
x
xx
??
?
x?
CholErr.m
clc
disp(' n cond(A) relerr ')
disp('----------------------------------')
for n = 2:12
A = hilb(n); b = randn(n,1);
G = CholScalar(A); y = LTriSol(G,b);
x = UTriSol(G',y); condA = cond(A);
xTrue = invhilb(n)*b;
relerr = norm(x-xTrue)/norm(xTrue);
disp(sprintf(' %2.0f %10.3e %10.3e ',n,condA,relerr))
end
n cond(A) relerr
2 1.928e+001 9.225e-017
3 5.241e+002 1.061e-015
4 1.551e+004 1.390e-013
5 4.766e+005 1.299e-012
6 1.495e+007 8.886e-011
7 4.754e+008 3.399e-009
8 1.526e+010 7.022e-009
9 4.932e+011 3.594e-006
10 1.603e+013 1.241e-004
11 5.224e+014 2.788e-003
12 1.795e+016 5.706e-002
Matlab CHOL function and LS Problem
Chol(A) returns an upper triangular matrix R so that A=RTR
If A has full column rank and A=QR is its QR factorization,then
=R(1:n,1:n) defines the Matlab cholesky factor of ATA
RRRRQRQRAA TTTT ~~)()( ???
R~
RnQQRA ~):1(,,??Note that
And so the LS minimizer of ||Ax-b||2 is given by
bAAAbARRbRARbnQRx TTTTTTLS 11111 )()~~()~(~):1(,,~ ????? ????
Then x LS is the solution to the SPD system
bAAxA TT ?
Normal equation
High-Performance Cholesky
We discuss two implementations of the Cholesky factorization that
shed light on what it is like to design a linear equation solver that
runs fast in an advanced computing environment.
The block Cholesky algorithm shows how to make a linear algebra
computation rich in matrix-matrix multiplication,This has the effect
of reducing the amount of data motion,
The shared memory implementation shows how the work involved
in a matrix factorization can be subdivided into roughly equal parts
that can be simultaneously computed by separate processors.
Level-3 implementation
Level-3 operations involve a quadratic amount of data and
cubic amount of work,i.e,Matrix multiplication
In the n-by-n,C=AB case,there are 2n2 input values and 2n3
flops to perform,Unlike level-1 and level-2 operations,the ratio
work to data grows with problem size in the level-3 setting.
This makes it possible to bury data motion overheads in a
computation that is rich in level-3 operations,Algorithms with
property usually run very fast on advanced architectures.
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mmm
m
mmm
m
GG
GG
G
AA
AA
A
?
??
?
?
??
?
1
111
1
111
,
n=pm
Block version of CholScalar
?
?
?
j
k
T
jkikij GGA
1
ij
j
k
T
jkikij
T
jjij SGGAGG ??? ?
?
?
1
1
for i=1:m,for j=1:i
computer
if i<j
else
end
end
end
ijS
ijTjjij SXGG ? ofs o l u t i o n t h eis
iiii SG off a c t o r C h o l e s k y t h eis
function G = CholBlock(A,p)
[n,n] = size(A);m = n/p; A = MakeBlock(A,p);G = cell(m,m);
for i=1:m,for j=1:i
S = A{i,j};
for k=1:j-1
S = S - G{i,k}*G{j,k}'; end
if j < i
G{i,j} = (G{j,j}\S')';
else
G{i,i} = CholScalar(S); end; end;end
G = MakeScalar(G);
function G=CholBlock1(A,p)
[n,n]=size(A); m=n/p; G=zeros(n,n);
for i=1:m,ivec=((i-1)*p+1):(i*p);
for j=1:i,jvec=((j-1)*p+1):(j*p);S=A(ivec,jvec);
for k=1:j-1
kvec=((k-1)*p+1):(k*p); S=S-G(ivec,kvec)*G(jvec,kvec)’;
end
if j<i
for q=1:p
G(q+(i-1)*p,jvec)=(LTriSol(G(jvec,jvec),S(q,:)'))'; end
else
G(ivec,jvec)=CholScalar(S); end; end;end
Block Size Time / Unblocked Time
1 1.000
2 0.296
3 0.122
4 0.079
6 0.055
8 0.050
12 0.046
16 0.044
24 0.046
32 0.048
48 0.057
64 0.070
96 0.106
n = 192
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Computing
-- A Matrix Vector Approach Using Matlab
Written by Charles F.Van Loan
陈 文 斌
Wbchen@fudan.edu.cn
复旦大学
Chapter 7 The QR and Cholesky
Factorizations
? Least Squares Fitting
? The QR factorization
? The Cholesky Factorization
? High-Performance Cholesky
Least Squares Fitting
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
1
1
0
65
43
21
21 xx
2
m in im iz e
t o f i n d,a n d G iv e n
bAx
RxRbRA nmnm
?
??? ?
overdetermined
Setting Up Least Squares Problems
]1,25[.,)( ?? xxxf
? ??
?
???
m
i
iim xx
1
2)(),( ?????
2
2
2
1
2
1
1
1
1
),(
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mm
m
f
f
f
x
x
x
??? ?
?
???
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
1.1
1.2
m = 100,alpha = 0.369810,beta = 0.652299
Matlab’s Least Squares Tools
xLs=A\b;
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
m = 2,alpha = 0.333333,beta = 0.666667
? ? ? ?? ? ),()(75.
1
1
25.
2 ???????
?
?
??????? ? dxxxxxm
m
i
ii
0,0 ?
?
??
?
? ??
?
?
?
?
Minimizer
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
80
31
12
7
64
21
32
15
32
15
4
3
*
*
?
?
??
?
??
??
??
?
??
?
6 5 1 8 5 1 8 5 1.0
3 7 0 3 7 0 3 7.0
*
*
?
?
polyfit If we try to fit data point in the least squares sense
with a polynomial of degree d,the m-by-(d+1) least
squares problems arises.
Matlab’s Least Squares Tools
2m in bAxnRx ??
is equivalent to a transformed problem
? ? ? ? 2m in bQxAQ TT
Rx n
?
?
o r t h o g o n a lIQQQQ TT,??
2
2
2
2 rrQ
T ? ?
?
?
?
?
?
?
?
)c o s ()s in (
)s in ()c o s (
??
??
Q
The column of an orthogonal matrix define an orthonormal basis
? ? ? ?
2
4
3
2
1
2
122
1211
22
00
00
0
m in
m inm in
2
22
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
???
?
??
c
c
c
c
x
xr
rr
bQxAQbAx
Rx
TT
RxRx
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
2
1
2
1
22
1211
0 c
c
x
x
r
rr
2
4
2
3222m i n ccbAxbAx LsRx ??????
QRA ? QR factorization problem
To find an orthonormal basis for the subspace defined by
the columns of A
),(*)(,,),2(*)2(,,),1(*)1(,,)(,,jjRjQjRQjRQjA ???? ?
Any column of A is in the span of {Q(:,1),…,Q(:,n)}
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
00
150
63
3/13/23/2
3/23/13/2
3/23/23/1
142
12
81
Rotations
The Q in the QR factorization can be computed using a
special family of orthogonal matrices that are called rotations.
)s i n (),c o s (,?? ???
?
?
??
?
?
? sc
cs
sc
G
2
2
2
12
2
2
2
11 /,/ xxxsxxxc ????
0,2 ?? yGxy
function [c,s] = Rotate(x1,x2)
if x2==0
c = 1; s = 0;
else
if abs(x2)>=abs(x1)
cotangent = x1/x2;
s = 1/sqrt(1+cotangent^2); c = s*cotangent;
else
tangent = x2/x1;
c = 1/sqrt(1+tangent^2); s = c*tangent;
end
end
)s in (),c o s (,
1000
00
0010
00
?? ??
?
?
?
?
?
?
?
?
?
?
?
?
?
? sc
cs
sc
G
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?
?
4
31
2
31
4
3
2
1
),3,1(
x
cxsx
x
sxcx
x
x
x
x
G ?
[c,s]=rotate(x(1),x(3))
Reduction to Upper Triangular Form
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
??
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
??
???
???
?
?
?
?
?
?
?
?
?
?
?
?
?
??
???
???
???
?
?
?
?
?
?
?
?
?
?
?
?
?
???
???
???
???
000
00
0
00
00
0
00
0
0
0
0
0
0
0
0
r) t r i an g u l au p p er(1 RAGG t ??
RAQGGQ TtT ??,1? QRA ?
function [Q,R] = QRRot(A)
[m,n] = size(A); Q = eye(m,m);
for j=1:n
for i=m:-1:j+1
[c,s] = Rotate(A(i-1,j),A(i,j));
A(i-1:i,j:n) = [c s; -s c]*A(i-1:i,j:n);
Q(:,i-1:i) = Q(:,i-1:i)*[c s; -s c]';
end;end
R = triu(A);
(9m-4n)n2 flops
n=m,this is about
seven times as
costly as LU
factorization
The Least Squares Solution
function [xLS,res] = LSq(A,b)
[m,n] = size(A);
for j=1:n,for i=m:-1:j+1
[c,s] = Rotate(A(i-1,j),A(i,j));
A(i-1:i,j:n) = [c s; -s c]*A(i-1:i,j:n);
b(i-1:i) = [c s; -s c]*b(i-1:i);
end;end
xLS = UTriSol(A(1:n,1:n),b(1:n));
if m==n,res = 0;
Else,res = norm(b(n+1:m));
end
(3m-2n)n2 flops
Q is not explicitly
formed
m=n,50% more than
what is required by GE
Solution Sensitivity
Instead of condition,we use an equivalent formulation that
makes sense for both square and rectangular matrices
2
2
)(
2
m i n
)(
1
A
E
A nEAr a n k ??
?
?
Roughly speaking,the inverse of the condition is the relative
distance to the nearest rank deficient matrix,If the columns of A
are nearly dependent,then its condition number is large
22222,)()(m i n be p sbAe p sAbbxAA ?????????
))()((
~
2
22 AAe p sx
xx
LS
LS
LSLS ??? ???
2bAx LSLS ???
ShowLSq.m
The Cholesky Factorization
?If A is SPD,then its diagonal entries are positive and for all i and j
2/)(|| jjiiij aaa ??
?A is strictly diagonally dominant,If A is also symmetric with
positive diagonal entries,then it is SPD,
?If A is SPD,then we have Cholesky factorization,
?We can solve LS problem by ATAx=ATb
Positive Definiteness
)()()()]()([ xrxuxqxDuxpD ???
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
??
nn
nn
de
ed
de
ed
T
1
11
22
21
00
0
0
00
?
??
????
?
?
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Solution to -D 2 u + u = 2xsin(x) - 2cos(x),u(0)=u(pi)=0
n = 100 norm(u - xsin(x),'inf') = 0.000112
Five Implementations for Full Matrics
??
?
??
?????
1
11
j
k
ijjkikijjjij
j
k
jkikij sggagggga
??
?
?
?
?
?
?
jigs
jis
g
jjij
ij
ij
/
CholScalar
CholDot
CholSax
CholGax
CholRecur
function G = CholScalar(A)
[n,n] = size(A); G = zeros(n,n);
for i=1:n,for j=1:i
s = A(j,i);
for k=1:j-1
s = s - G(j,k)*G(i,k);
end
if j < i
G(i,j) = s/G(j,j);
else
G(i,i) = sqrt(s);
end; end;end
function G = CholDot(A)
[n,n] = size(A); G = zeros(n,n);
for i=1:n,for j=1:i
if j==1
s = A(j,i);
else
s = A(j,i) - G(j,1:j-1)*G(i,1:j-1)'; end
if j < i
G(i,j) = s/G(j,j);
else
G(i,i) = sqrt(s);
end; end;end
An inner product is
an example of a
level-1 linear
algebra operation,
Level-1 operations
involve O(n) work
and O(n)data,Inner
products,vector
scaling,vector
addition,and
saxpys are level-1
operations.
Column-oriented version
?
?
?
j
k
kjGkGjA
1
),()(,,)(,,
skjGkGjAjjGkG
j
k
??? ?
?
?
1
1
),()(,,)(,,),()(,,
):(),(),:(),:(),(),:(
1
1
njskjGknjGjnjAjjGknjG
j
k
??? ?
?
?
)(/):(),:(,)(),( jsnjsjnjGjsjjGG ??
function G = CholSax(A)
[n,n] = size(A);
G = zeros(n,n);
s = zeros(n,1);
for j=1:n
s(j:n) = A(j:n,j);
for k=1:j-1
s(j:n) = s(j:n) - G(j:n,k)*G(j,k);
end
G(j:n,j) = s(j:n)/sqrt(s(j));
end
Vector←Vector + Matrix × Vector
gaxpy operation
TjjGjnjGnjsnjs )1:1,()1:1,:():():( ????
function G = CholGax(A)
[n,n] = size(A);G = zeros(n,n);s = zeros(n,1);
for j=1:n
if j==1
s(j:n) = A(j:n,j);
else
s(j:n) = A(j:n,j) - G(j:n,1:j-1)*G(j,1:j-1)';
end
G(j:n,j) = s(j:n)/sqrt(s(j));end
Level-2 operation
are characterized by
quadratic work and
quadratic data
recursive implementation
?
?
?
?
?
?
??
?
?
?
?
?
?
?? w
G
G
v
vB
A
0
,1
T
w
G
w
G
v
vB
?
?
?
?
?
?
?
?
?
?
?
?
??
?
?
?
?
?
???
00 11
wwwGvGGB TT ???? 2111,,??
ww T?? ?0
wwvwGGGB TT ???? ??,,111
function G = CholRecur(A)
[n,n] = size(A);
G = zeros(n,n);
if n==1
G = sqrt(A);
else
G(1:n-1,1:n-1) = CholRecur(A(1:n-1,1:n-1));
G(n,1:n-1) = LTriSol(G(1:n-1,1:n-1),A(1:n-1,n))';
G(n,n) = sqrt(A(n,n) - G(n,1:n-1)*G(n,1:n-1)');
end
Efficiency,Stability,Accuracy
n = 100
Algorithm Relative Time
--------------------------
CholScalar 1.000
CholDot 0.187
CholSax 0.188
CholGax 0.034
CholRecur 0.250
Relative time = Time/CholScalar Time
Different implementations of
the same matrix algorithm
can have different levels of
performance.
In GE we had to worry about large multipliers,This is not an issue
in the Cholesky factorization
ijij
i
j
ijii agga ?? ?
?
|| i m p l i e s,
1
2
Thus no entry in G can be larger than the square root of A’s largest
diagonal entry,This does not imply that all symmetric positive
definite systems are well conditioned,If is the computed vector
produced by
G = cholScalar(A);
y = LTriSol(G,b);
x = UtriSol(G’,y);
Ae p sEbxEA ???,?)(
)(
?
Ae p s
x
xx
??
?
x?
CholErr.m
clc
disp(' n cond(A) relerr ')
disp('----------------------------------')
for n = 2:12
A = hilb(n); b = randn(n,1);
G = CholScalar(A); y = LTriSol(G,b);
x = UTriSol(G',y); condA = cond(A);
xTrue = invhilb(n)*b;
relerr = norm(x-xTrue)/norm(xTrue);
disp(sprintf(' %2.0f %10.3e %10.3e ',n,condA,relerr))
end
n cond(A) relerr
2 1.928e+001 9.225e-017
3 5.241e+002 1.061e-015
4 1.551e+004 1.390e-013
5 4.766e+005 1.299e-012
6 1.495e+007 8.886e-011
7 4.754e+008 3.399e-009
8 1.526e+010 7.022e-009
9 4.932e+011 3.594e-006
10 1.603e+013 1.241e-004
11 5.224e+014 2.788e-003
12 1.795e+016 5.706e-002
Matlab CHOL function and LS Problem
Chol(A) returns an upper triangular matrix R so that A=RTR
If A has full column rank and A=QR is its QR factorization,then
=R(1:n,1:n) defines the Matlab cholesky factor of ATA
RRRRQRQRAA TTTT ~~)()( ???
R~
RnQQRA ~):1(,,??Note that
And so the LS minimizer of ||Ax-b||2 is given by
bAAAbARRbRARbnQRx TTTTTTLS 11111 )()~~()~(~):1(,,~ ????? ????
Then x LS is the solution to the SPD system
bAAxA TT ?
Normal equation
High-Performance Cholesky
We discuss two implementations of the Cholesky factorization that
shed light on what it is like to design a linear equation solver that
runs fast in an advanced computing environment.
The block Cholesky algorithm shows how to make a linear algebra
computation rich in matrix-matrix multiplication,This has the effect
of reducing the amount of data motion,
The shared memory implementation shows how the work involved
in a matrix factorization can be subdivided into roughly equal parts
that can be simultaneously computed by separate processors.
Level-3 implementation
Level-3 operations involve a quadratic amount of data and
cubic amount of work,i.e,Matrix multiplication
In the n-by-n,C=AB case,there are 2n2 input values and 2n3
flops to perform,Unlike level-1 and level-2 operations,the ratio
work to data grows with problem size in the level-3 setting.
This makes it possible to bury data motion overheads in a
computation that is rich in level-3 operations,Algorithms with
property usually run very fast on advanced architectures.
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
mmm
m
mmm
m
GG
GG
G
AA
AA
A
?
??
?
?
??
?
1
111
1
111
,
n=pm
Block version of CholScalar
?
?
?
j
k
T
jkikij GGA
1
ij
j
k
T
jkikij
T
jjij SGGAGG ??? ?
?
?
1
1
for i=1:m,for j=1:i
computer
if i<j
else
end
end
end
ijS
ijTjjij SXGG ? ofs o l u t i o n t h eis
iiii SG off a c t o r C h o l e s k y t h eis
function G = CholBlock(A,p)
[n,n] = size(A);m = n/p; A = MakeBlock(A,p);G = cell(m,m);
for i=1:m,for j=1:i
S = A{i,j};
for k=1:j-1
S = S - G{i,k}*G{j,k}'; end
if j < i
G{i,j} = (G{j,j}\S')';
else
G{i,i} = CholScalar(S); end; end;end
G = MakeScalar(G);
function G=CholBlock1(A,p)
[n,n]=size(A); m=n/p; G=zeros(n,n);
for i=1:m,ivec=((i-1)*p+1):(i*p);
for j=1:i,jvec=((j-1)*p+1):(j*p);S=A(ivec,jvec);
for k=1:j-1
kvec=((k-1)*p+1):(k*p); S=S-G(ivec,kvec)*G(jvec,kvec)’;
end
if j<i
for q=1:p
G(q+(i-1)*p,jvec)=(LTriSol(G(jvec,jvec),S(q,:)'))'; end
else
G(ivec,jvec)=CholScalar(S); end; end;end
Block Size Time / Unblocked Time
1 1.000
2 0.296
3 0.122
4 0.079
6 0.055
8 0.050
12 0.046
16 0.044
24 0.046
32 0.048
48 0.057
64 0.070
96 0.106
n = 192
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1