> @=y Ȝ?
:!]=x{pwLo eBydU UJlP ʍ :H#ձC;:bPQ
RѪ_LA<,_˜=b< W1#Ǎ@82
/@JG캏>
zX]GfhqrQX,yL<=H1݄{<O{a8Xa_lYŸ[,Rly(\Ǜ!TڬO뿾aXKн{;}mYTCqt_^Z_FLZHm^0S`owyo tT]ϑQrs.QwP0a}>ab+֢QPpj
ڹ*x5/G][kl7"\T*8ς[^}țk&u:f>\·Y"3kܾ&y?6Y+5Ըs~ԷLZ^CjZ]O%y>%+R>5NQX\.v4,]4Pz/qlq W:9Xmu}ު[>ǪgoUѪߪߪhO{A]d<"WK1^~Uݪ$+]+V{waf/]&ʗuxGLb\,xWOྡྷ(fuq2q=pC>[b 17"ǊrǃЀS:SS#V)YQ܅Sgk&Xưe!~/`1sV9,Xưǘ1*;J폸QcүWTvsmUmk5V+V9)XO`=FJ0)/l3Tvڪ6/lcXX߆e4HaڟqmU$KecX/K2D}hlUGΗưvIvYeb[& XoaeKm ELc}1XߏӯB ~ت6 te_`W<] 1jEa})0nxx
s(1j+XgưN:~'ت6ydfvS]0HLc\&Xa1\(bVS:EGiZLm
̺2t ޱ'(֥w8uZRlu*ceVr7ɶxFǺaRRK1#1b}`Ugb!k2z`
%Nb*Xkk(aRcHm
5z24%4%X+X,V2z`^K VŚ6e(̚CQ8Fћ
V3qj)fZhXF_&XgR@YlcXF_Ai
eخF_eVVg2\%Xk:e8뵔:Zo6a}3@c}2JZq`XFoKL q\%r
m3G"x\ggN˽/")p1=I@OJe=k6Iv*yN%괺>f L̳k#{%{Sv^xTd~~~Yo}Iw%ݹ8Ii^K&;Hɧv/w;sL>❒hsNMwWOr5x^u$*ߞe]ky7w5Ԯқg,qn8\A8t.gvs$*ǅ_{m/ɾ}}5:~:症燙:U6l9AoFf6z0òYI+Q0@}+?g^<of"{0πYy>=}3o`ʀy5~YJfAA<mŎbBӧEk9qL]fe]]ӚiꜴ~LĶzp7er/qdluytKyQ~>a]+SJqS~̇ JH㚽]Ej!_(:g=Դ_ukĽ)?p,k$Q.%B`s>ݏ7_UE}ϼt`)0ګ}k;u݅ak+q'ql.,8/^ Ct"?y'̏tEN>t)N}Fϩr{\~Sk7ksY~9@φ4*+(Ht)/ͺY4/?弔p0l+qfGωT:ٻK`J@Oɂ=Z6O콈/m}39n`ɬg!7qof3NZ"}Ǉվb
NU]IXq8ˍyb861] Z_;=t=hYIdepc Nҕ4QZ˧GpDh5b}`'dMoJwu2t8ӵ4Q5bʹXFMf6Moaf,ѣ@&ʷ+j,kl}sy]*ͥt*Xs&U캇c(}#vs5YRn$56vsE3ǒ}78v$[g7T5֘^߿Iӫv^'\kl3YITg:f}w)1f2w6>bX϶f=b^wݷ}GӤ_p_Voy.WSvq##yO0k(
,
:A
Chart MSGraph.Chart.804Microsoft Graph 2000 Chartb/0DTimes New Romanl50v00(
0 DSymbolew Romanl50v00(
0
`.
@n?" dd@ @@``8F U
"
/X$"$y Ȝ?
c$@7uʚ;2Nʚ;g4BdBdv0$ppp
<4!d!d`
0X6<4dddd`
0X6?%O=GdDistributed Linear AlgebraOPeter L. Montgomery
Microsoft Research, Redmond, USA
RSA 2000
January 17, 2000/)Role of matrices in factoring n Sieving finds many xj2 pieij (mod n).
Raise jth to power sj = 0 or 1, multiply.
Left side always a perfect square. Right is a square if exponents jeijsj are even for all i.
Matrix equation Es 0 (mod 2), E known.
Knowing x2y2 (mod n), test GCD(xy, n).
Matrix rows represent primes pi. Entries are exponents eij. Arithmetic is over GF(2). l\Z Y
b Y81Matrix growth on RSA ChallengeRSA 140
JanFeb, 1999
4 671 181 4 704 451
Weight 151 141 999
Omit primes < 40
99 CrayC90 hours
75% of 800 Mb for matrix storage*!b RSA 155
August, 1999
6 699 191 6 711 336
Weight 417 132 631
Omit primes < 40
224 CrayC90 hours
85% of 1960 Mb for matrix storage*dRegular LanczostA positive definite (real, symmetric)
nn matrix.
Given b, want to solve Ax = b for x.
Set w0 = b.
wi+1= Awi 0ji cijwj if i 0
cij = wjTA2wi / wjTAwj
Stop when wi+1 = 0.'* ((( ( (
&>
ClaimswjTAwj 0 if wj 0 (A is positive definite).
wjTAwi = 0 whenever i j
(by choice of cij and symmetry of A).
Eventually some wi+1= 0, say for i = m
(otherwise too many Aorthogonal vectors).
x = 0jm(wjT b / wjTAwj) wj satisfies Ax=b (error u=Ax b is in space spanned by wj s but orthogonal to all wj, so uTu=0 and u=0).OZ)Z'Z.ZZ
(((
*%N5
Simplifying cij when i > j+1L
wjTAwjcij = wjTA2wi = (Awj)T (Awi)
= (wj+1 + linear comb. of w0 to wj)T (Awi)
= 0 (Aorthogonality).
Recurrence simplifies to
wi+1= Awi ciiwi ci,i 1wi 1 when i 1.
Little history to save as i advances.#N1&+( ( ( (( ((
=Major operations neededPremultiply wi by A.
Inner products such as wjTAwj and
wjTA2wi = (Awj)T (Awi).
Add scalar multiple of one vector to another. 68/
2b
270Adapting to Bx=0 over GF(2)X( B is n1n2 with n1< n2, not symmetric. Solve Ax = 0 where A = BTB. A is n2n2. BT has small nullspace in practice.
Right side zero, so Lanczos gives x = 0. Solve Ax = Ay where y is random.
uTu and uTAu can vanish when u 0. Solved by Block Lanczos (Eurocrypt 1995).VD,bD X1
Block Lanczos summarypLet N be the machine word length (typically 32 or 64) or a small multiple thereof.
Vectors are n1N or n2N over GF(2).
Exclusive OR and other hardware bitwise instructions operate on Nbit data.
Recurrences similar to regular Lanczos.
Approximately n1/(N 0.76) iterations.
Up to N independent solutions of Bx=0.&9ZYMA (,Block Lanczos major operationsPremultiply n2N vector by B.
Premultiply n1N vector by BT.
NN inner product of two n2N vectors.
Postmultiply n2N vector by NN matrix.
Add two n2N vectors.
How do we parallelize these?*
4&!Assumed processor topologyAssume a g1g2 toroidal grid of processors.
A torus is a rectangle with its top connected to its bottom, and left to right (doughnut).
Need fast communication to/from immediate neighbors north, south, east, or west.
Processor names are prc where r is modulo g1 and c is modulo g2.
Set gridrow(prc) = r and gridcol(prc) = c.F
. *$A torus of processors
Matrix row and column guardiansFor 0 i < n1, a processor rowguard(i) is responsible for entry i, in all n1N vectors.
For 0 j < n2, a processor colguard(j) is responsible for entry j, in all n2N vectors.
Processorassignment algorithms aim for load balancing. A,QmThree major operationsxVector addition is pointwise. When adding two n2 N vectors, processor colguard(j) does the jth entries. Data is local.
Likewise for n2N vector by NN matrix.
Processors form partial NN inner products. Central processor sums them.
These operations need little communication.
Workloads are O(#columns assigned).*=Z
)
!> ,Allocating B among processors$dLet B = (bij) for 0 i < n1 and 0 j < n2.
Processor prc is responsible for all bij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c.
When premultiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r."3Z ,9="fMultiplying u = Bv where u is n1N and v is n2N 4"2Distribute each v[j] to all prc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid.
Each prc processes all of its bij, building partial u[i] outputs.
Partial u[i] values are summed as they advance along a grid row to rowguard(i).
Individual workloads depend upon B.3Z &8%tLd*Actions by prc during multiply<dSend/receive all v[j] with gridcol(colguard(j)) = c.
Zero all u[i] with rowguard(i) = pr,c+1.
At time t where 1 t g2, adjust all u[i] with rowguard(i) = pr,c+t (t nodes east).
.If t < g2, ship these u[i] west to pr,c 1 and receive other u[i] from pr,c+1 on the east.
Want balanced workloads at each t.3Z
.P>Multiplication by BT .Reverse roles of matrix rows and columns.
Reverse roles of grid rows and columns.
BT and B can share storage since same processor handles (B)ij during multiply by B as handles (BT)ji during multiply by BT.S1
,$!Memory requirementsMatrix data is split amongst processors.
By using 6553665536 blocks, each entry needs only 16bit row and column offsets. These blocks can be cachefriendly too.
Each processor needs one vector of length max(n1/g1, n2/g2) and a few of length n2/g1g2, with N bits per entry.
Central processor needs some of length n2. 8AZ88(#)Major communications during multiply by B*(@Broadcast each v[j] along entire grid column. Ship n2N bits to each of g1 1 destinations.
Forward partial u[i] along grid row, one node at a time. Total (g2 1)n1N bits.
When n2 n1, communication for B and BT is 2(g1+g2 2)n1N bits per iteration.
2(g1+g2 2)n12 bits after n1/N iterations.,!Z!"
0,Choosing grid size>Large enough that matrix fits in memory.
Matrix storage is about 4w/g1g2 bytes per processor, where w is total matrix weight.
Try to balance I/O and computation times.
Multiply cost is O(n1w/g1g2) per processor.
Communications cost O((g1+g2 2)n12).
Prefer a square grid, to reduce g1+g2.B BV(#'Choice of N and matrix$ ~Prefer smaller but heavier matrix if it fits, to lessen communications.
Higher N yield more dependencies, letting you omit the heaviest rows from the matrix.
Larger N means fewer but longer messages.
Size of vector elements affects cache.
When N is large, inner products and postmultiplies by NN matrices are slower.@ZOUO1
Cambridge cluster configurationtMicrosoft Research, Cambridge, UK.
16 dualCPU 300 MHz Pentium II s.
Each node
384 MB RAM
4 GB local disk
Networks
Dedicated fast ethernet (100 Mb/sec)
Myrinet, M2MOCTSW8 (1.28 Gb/sec)TOZZ ZHZO HZ?? 92Message Passing Interface (MPI) Industry Standard
MPI implementations:
exist for the majority of parallel systems & interconnects
public domain (e.g. mpich) or commercial (e.g. MPI PRO)
Supports many communications primitives including virtual topologies (e.g. torus).6'sS'sS@u j ,Performance data from MSR Cambridge cluster  &
` ` ̙33` 333MMM` ff3333f` f` f` 3>?" dd@,?" dd@ " @ ` n?" dd@ @@``PR @ ` `p>>f(
6l P
T Click to edit Master title style!
!
0o
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0`t ``
>*
0y `
@*
0~ `
@*H
0h ? ̙33 Default Design0zl(
l
l
0(V. P
L
R*
l
0T. L
T*
d
l
c$ ?
L
l
0<.
@L
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
l
6. `P .
R*
l
6L ` .
T*
H
l0h ? ̙33@(
0L P
L
@*
0L L
B*
6$L `P L
@*
6`L ` L
B*
H
0h ? ̙330$(
r
SH.p.
r
Sش. `
.
H
0h ? ̙33
$(
r
SdLP
L
r
S LL
H
0h ? ̙33p
@(
x
c$P8,P
,
x
c$9,,
x
c$9,p,
H
0h ? ̙33
$(
r
SXP
r
S
H
0h ? ̙33
0(
x
c$`L.P
.
x
c$L..
H
0h ? ̙33
($(
(r
( SP
r
( S
H
(0h ? ̙33
P$$(
$r
$ S.P
.
r
$ S@..
H
$0h ? ̙33
0(
x
c$XP
x
c$
H
0h ? ̙33
0$(
0r
0 SP
r
0 SLL
H
00h ? ̙33
p8$(
8r
8 SDpLP
L
r
8 SqLL
H
80h ? ̙33
`$(
r
SBLP
L
r
SBLL
H
0h ? ̙33
DT
(
r
S)P
R
s*p` PR
s*pP
0PR
s*p
PLB
c$D`` P
`LB
c$D`0
`R
s*p` P
R
s*pP
0P
R
s*p
P
LB
c$D` ` P
` LB
c$D` 0
` R
s*p` P
R
s*pP
0P
R
s*p
P
LB
c$D`` P
`LB
c$D`0
`LB
c$DPpppLB
c$DP@@pLB
c$DPpLB
c$DP
pppLB
c$DP
@@pLB
c$DP
p
0L
FP7 2
0x,
FP8 2
0,h
FP9 2
!
0,
FP4 2
"
0L,
FP5 2
#
0`ؾh
FP6 2
$
0,پ
FP1 2
%
0ھ
FP2 2
&
0h
FP3 2 8 `
,`
TB
'
c$D``
TB
(
c$D`TB
)
c$DpTB
*
c$D
`
TB
+
c$DP
F `
@
TB
.
c$D``
TB
/
c$D`TB
0
c$DpTB
1
c$D
`
TB
2
c$DP
F `
3p
TB
4
c$D``
TB
5
c$D`TB
6
c$DpTB
7
c$D
`
TB
8
c$DP
8 `
C``B
<
0ZD`aTB
?
c$D`TB
@
c$D`TB
A
c$D``TB
B
c$D F `
D`
ZB
E
s*ZD`aTB
F
c$D`TB
G
c$D`TB
H
c$D``TB
I
c$DRB
L
s*D`
RB
M
s*D`
RB
N
s*D``RB
O
s*D
LB
P
c$D``
T
<tѾP
[Example: 3x3 torus system&
H
0h ? ̙33
T$(
Tr
T S2P
r
T Sp3
H
T0h ? ̙33
P$(
Pr
P S9P
r
P S:
H
P0h ? ̙33
\$(
\r
\ S?P
r
\ S?
H
\0h ? ̙33
0`$(
`r
` S\P
r
` S\
H
`0h ? ̙33
@h$(
hr
h SxP
r
h S4
H
h0h ? ̙33
Pp$(
pr
p SP
r
p S`
H
p0h ? ̙33
`$(
r
SP
r
S8
H
0h ? ̙33
p$(
r
SDP
r
S쭳
H
0h ? ̙33
$(
r
SѳP
r
S`ҳ
H
0h ? ̙33
$(
r
SP
r
S$
H
0h ? ̙33
x$(
xr
x SP
r
x S
H
x0h ? ̙33
$(
r
S.P
.
r
S@..
H
0h ? ̙33
(
r
 S1.P
.
l

0A??p``oH
0h ? ̙3358x[l~ri)*X(8(ThӎKLT0DqXщs,7u0)Ĺd3Թe^L8{M}<=YOJ.LR؇uňIL&N~}} !$k!@C!a8d8dy@X[r>ٳ܃cKF<;\Ik)++Jsqghʱs@Kj/vZ߲֜7Y%~[__ 3!%RHR)T@FC@*!c! U!!߀LTC&B.@&A&Cj!Sy&ۋ2rmhx>`x c/}XߺN{㆖555*\oۛ'Iqrȋ1aڦs؊xx.?[(mTj᤹(G8%/K1SS![2pRۧEE`=;}5Ԁ]r 鰫jv
> ȍ`c ?~ʯc<ľo@q>=Qþ2VNBy
Io$,ׇtYms&ڇ8#sz؞
ij?xѠTkӑyORD%(}c>+
%b 9pز7e
I?=OYl&2ƧVރ"nUq3tcy1ylBpRuHqXBygB&oyy(ͫ<̌prlʡwVa=* :_Ƌj (j+!G/m6
ةEOc@Z#
~Pk+z6ڦ!*_ղoU]2ȌjatĤs}N9`z2&Vy:7BB/2eרBX'LYNun'ol6+/%*@LaV"_^b55~c0}d#;~:mr7MG6#ˍ# v㣣>Vϯ
E<ڳT{UDWܬ=asGg:̻@V33"(Ny:.ls"Dx(cIxL e)۹T*Ã{蘘JcJiVV\T0!im;{Zm!kRlFLR9\}^jAFoFJK\8a9"
v`N{,aa[U:a^۞=bZ^uX$mj_Q}/aJ
I;(PB+ΚR?XMܮGWxQ#3=ߡ6~6.Z{:r;?}Hkqo6>L!!v9#l?c
c q#G~')Z;J;=УZsܟ8{iwgqn{@On3\(_1o8eC p90!ɴ#78h0rZ.qQ 1i;'osH0 q?7HkO8q&1"0Hkm8ˍ'
y8p=ir#y(8ьH{v 1cb%G_seI RXːEH[mGjbŎ;q:T.VR^[%@=+
))ZqrܫZ]ۈTpKxECI)R丛YI)P{긷"MI)PpKNǽ{$&2u>Tqtܝ]J@=.M8RO+)ZƖ>"OIPOxZz̧ݒ=WTn5+5ntXmhrMZt&jMY.R^C/~sNi;2i_/{:<ɜVCt=A*uMYXOo2K0S,663e,,2m_Mсy3ET^z[WWMz[}b9b`b`Kll?@짾5!FۓQDƗ?ZLrHA&;$Iu+␇
2!!
Md3>:ퟹw3imWw !Nh¼s>=
f㗳3UWeK,lo[G(_l3}zƯM<.{S{[ySK}S{<6~89UK/n]T~AykgS{cgkUK39{MތNM:)T"d o#y4Ǿ{:)p}'hzv1WJyȅU;9z(pOjF1Lv@dNdo?
?f9OF:%ot4;J6Bsw.@j=ǽqF:%{wFj=ms6:lq;
(O_?o܄H_(;=wrsQb8cG׳;?9;3}(QSz{=%=vBEgQ_1\ٿg:QetiI_L_F'
r08k}ރ 0ʅ
~ '; 0s0sKú&j(*V, y/ z7@f_1#
n;x&k(
,
:A
Chart MSGraph.Ch Role of matrices in factoring nMatrix growth on RSA ChallengeRegular LanczosClaimsSimplifying cij when i > j+1Major operations neededAdapting to Bx=0 over GF(2)Block Lanczos summaryBlock Lanczos major operationsAssumed processor topologyA torus of processors Matrix row and column guardiansThree major operationsAllocating B among processors8Multiplying u = Bv where u is n1N and v is n2N Actions by prc during multiplyMultiplication by BT Major memory requirements*Major communications during multiply by BChoosing grid sizeChoice of N and matrix Cambridge cluster configuration Message Passing Interface (MPI)Performance data from MSR Cambridge clusterFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles(_NPeter MontgomeryPeter Montgomerymanwgw
 ..2
ADistributed Linear Algebra+"*.Q1 @Times New Romanwgw
 .$2
)Peter L. Montgomery&!. .(
0 DSymbolew Romanl5dv0(
0
`.
@n?" dd@ @@``8F
U"
/X$"$y Ȝ?
c$@7uʚ;2Nʚ;g4BdBdv0pppp
<4!d!d`
0,X6<4dddd`
0,X6?%O=dDistributed Linear AlgebraOPeter L. Montgomery
Microsoft Research, Redmond, USA
RSA 2000
January 17, 2000/)Role of matrices in factoring n Sieving finds many xj2 pieij (mod n).
Raise jth to power sj = 0 or 1, multiply.
Left side always a perfect square. Right is a square if exponents jeijsj are even for all i.
Matrix equation Es 0 (mod 2), E known.
Knowing x2y2 (mod n), test GCD(xy, n).
Matrix rows represent primes pi. Entries are exponents eij. Arithmetic is over GF(2). l\Z Y
b Y81Matrix growth on RSA ChallengeRSA 140
JanFeb, 1999
4 671 181 4 704 451
Weight 151 141 999
Omit primes < 40
99 CrayC90 hours
75% of 800 Mb for matrix storage$!b RSA 155
August, 1999
6 699 191 6 711 336
Weight 417 132 631
Omit primes < 40
224 CrayC90 hours
85% of 1960 Mb for matrix storage$dRegular LanczostA positive definite (real, symmetric)
nn matrix.
Given b, want to solve Ax = b for x.
Set w0 = b.
wi+1= Awi 0ji cijwj if i 0
cij = wjTA2wi / wjTAwj
Stop when wi+1 = 0.'* ((( ( (
&>
ClaimswjTAwj 0 if wj 0 (A is positive definite).
wjTAwi = 0 whenever i j
(by choice of cij and symmetry of A).
Eventually some wi+1= 0, say for i = m
(otherwise too many Aorthogonal vectors).
x = 0jm(wjT b / wjTAwj) wj satisfies Ax=b (error u=Ax b is in space spanned by wj s but orthogonal to all wj, so uTu=0 and u=0).OZ)Z'Z.ZZ
(((
*%N5
Simplifying cij when i > j+1L
wjTAwjcij = wjTA2wi = (Awj)T (Awi)
= (wj+1 + linear comb. of w0 to wj)T (Awi)
= 0 (Aorthogonality).
Recurrence simplifies to
wi+1= Awi ciiwi ci,i 1wi 1 when i 1.
Little history to save as i advances.#N1&+( ( ( (( ((
=Major operations neededPremultiply wi by A.
Inner products such as wjTAwj and
wjTA2wi = (Awj)T (Awi).
Add scalar multiple of one vector to another. 68/
2b
270Adapting to Bx=0 over GF(2)X( B is n1n2 with n1< n2, not symmetric. Solve Ax = 0 where A = BTB. A is n2n2. BT has small nullspace in practice.
Right side zero, so Lanczos gives x = 0. Solve Ax = Ay where y is random.
uTu and uTAu can vanish when u 0. Solved by Block Lanczos (Eurocrypt 1995).VD
!"#$%&'()*+,./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrsuvxyz{}~wtRoot EntrydO)C@PicturesYH%Current Useru
7PSummaryInformation(PowerPoint Document(rDocumentSummaryInformation8!Part.804Microsoft Graph 2000 Chartb/0DTimes New Romanl5dv0(
0DSymbolew Romanl5dv0(
0
`.
@n?" dd@ @@``8F
U"
/X$"$y Ȝ?
c$@7uʚ;2Nʚ;g4BdBdv0pppp
<4!d!d`
0,X6<4dddd`
0,X6?%O==dDistributed Linear AlgebraOPeter L. Montgomery
Microsoft Research, Redmond, USA
RSA 2000
January 17, 2000/)Role of matrices in factoring n Sieving finds many xj2 pieij (mod n).
Raise jth to power sj = 0 or 1, multiply.
Left side always a perfect square. Right is a square if exponents jeijsj are even for all i.
Matrix equation Es 0 (mod 2), E known.
Knowing x2y2 (mod n), test GCD(xy, n).
Matrix rows represent primes pi. Entries are exponents eij. Arithmetic is over GF(2). l\Z Y
b Y81Matrix growth on RSA ChallengeRSA 140
JanFeb, 1999
4 671 181 4 704 451
Weight 151 141 999
Omit primes < 40
99 CrayC90 hours
75% of 800 Mb for matrix storage$!b RSA 155
August, 1999
6 699 191 6 711 336
Weight 417 132 631
Omit primes < 40
224 CrayC90 hours
85% of 1960 Mb for matrix storage$dRegular LanczostA positive definite (real, symmetric)
nn matrix.
Given b, want to solve Ax = b for x.
Set w0 = b.
wi+1= Awi 0ji cijwj if i 0
cij = wjTA2wi / wjTAwj
Stop when wi+1 = 0.'* ((( ( (
&>
ClaimswjTAwj 0 if wj 0 (A is positive definite).
wjTAwi = 0 whenever i j
(by choice of cij and symmetry of A).
Eventually some wi+1= 0, say for i = m
(otherwise too many Aorthogonal vectors).
x = 0jm(wjT b / wjTAwj) wj satisfies Ax=b (error u=Ax b is in space spanned by wj s but orthogonal to all wj, so uTu=0 and u=0).OZ)Z'Z.ZZ
(((
*%N5
Simplifying cij when i > j+1L
wjTAwjcij = wjTA2wi = (Awj)T (Awi)
= (wj+1 + linear comb. of w0 to wj)T (Awi)
= 0 (Aorthogonality).
Recurrence simplifies to
wi+1= Awi ciiwi ci,i 1wi 1 when i 1.
Little history to save as i advances.#N1&+( ( ( (( ((
=Major operations neededPremultiply wi by A.
Inner products such as wjTAwj and
wjTA2wi = (Awj)T (Awi).
Add scalar multiple of one vector to another. 68/
2b
270Adapting to Bx=0 over GF(2)X( B is n1n2 with n1< n2, not symmetric. Solve Ax = 0 where A = BTB. A is n2n2. BT has small nullspace in practice.
Right side zero, so Lanczos gives x = 0. Solve Ax = Ay where y is random.
uTu and uTAu can vanish when u 0. Solved by Block Lanczos (Eurocrypt 1995).VD,bD X1
Block Lanczos summarypLet N be the machine word length (typically 32 or 64) or a small multiple thereof.
Vectors are n1N or n2N over GF(2).
Exclusive OR and other hardware bitwise instructions operate on Nbit data.
Recurrences similar to regular Lanczos.
Approximately n1/(N 0.76) iterations.
Up to N independent solutions of Bx=0.&9ZYMA (,Block Lanczos major operationsPremultiply n2N vector by B.
Premultiply n1N vector by BT.
NN inner product of two n2N vectors.
Postmultiply n2N vector by NN matrix.
Add two n2N vectors.
How do we parallelize these?*
4&!Assumed processor topologyAssume a g1g2 toroidal grid of processors.
A torus is a rectangle with its top connected to its bottom, and left to right (doughnut).
Need fast communication to/from immediate neighbors north, south, east, and west.
Processor names are prc where r is modulo g1 and c is modulo g2.
Set gridrow(prc) = r and gridcol(prc) = c.G
. *$A torus of processors
Matrix row and column guardiansFor 0 i < n1, a processor rowguard(i) is responsible for entry i, in all n1N vectors.
For 0 j < n2, a processor colguard(j) is responsible for entry j, in all n2N vectors.
Processorassignment algorithms aim for load balancing. A,QmThree major operationsxVector addition is pointwise. When adding two n2 N vectors, processor colguard(j) does the jth entries. Data is local.
Likewise for n2N vector by NN matrix.
Processors form partial NN inner products. Central processor sums them.
These operations need little communication.
Workloads are O(#columns assigned).*=Z
)
!> ,Allocating B among processors$dLet B = (bij) for 0 i < n1 and 0 j < n2.
Processor prc is responsible for all bij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c.
When premultiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r."3Z ,9="fMultiplying u = Bv where u is n1N and v is n2N 4"2Distribute each v[j] to all prc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid.
Each prc processes all of its bij, building partial u[i] outputs.
Partial u[i] values are summed as they advance along a grid row to rowguard(i).
Individual workloads depend upon B.3Z &8%tLd*Actions by prc during multiply<dSend/receive all v[j] with gridcol(colguard(j)) = c.
Zero all u[i] with rowguard(i) = pr,c+1.
At time t where 1 t g2, adjust all u[i] with rowguard(i) = pr,c+t (t nodes east).
.If t < g2, ship these u[i] west to pr,c 1 and receive other u[i] from pr,c+1 on the east.
Want balanced workloads at each t.3Z
.P>Multiplication by BT .Reverse roles of matrix rows and columns.
Reverse roles of grid rows and columns.
BT and B can share storage since same processor handles (B)ij during multiply by B as handles (BT)ji during multiply by BT.S1
,$!Memory requirementsMatrix data is split amongst processors.
By using 6553665536 blocks, each entry needs only 16bit row and column offsets. These blocks can be cachefriendly too.
Each processor needs one vector of length max(n1/g1, n2/g2) and a few of length n2/g1g2, with N bits per entry.
Central processor needs some of length n2. 8AZ88(#)Major communications during multiply by B*(@Broadcast each v[j] along entire grid column. Ship n2N bits to each of g1 1 destinations.
Forward partial u[i] along grid row, one node at a time. Total (g2 1)n1N bits.
When n2 n1, communication for B and BT is 2(g1+g2 2)n1N bits per iteration.
2(g1+g2 2)n12 bits after n1/N iterations.,!Z!"
0,Choosing grid size>Large enough that matrix fits in memory.
Matrix storage is about 4w/g1g2 bytes per processor, where w is total matrix weight.
Try to balance I/O and computation times.
Multiply cost is O(n1w/g1g2) per processor.
Communications cost O((g1+g2 2)n12).
Prefer a square grid, to reduce g1+g2.B BV(#'Choice of N and matrix$ ~Prefer smaller but heavier matrix if it fits, to lessen communications.
Higher N yield more dependencies, letting you omit the heaviest rows from the matrix.
Larger N means fewer but longer messages.
Size of vector elements affects cache.
When N is large, inner products and postmultiplies by NN matrices are slower.@ZOUO1
Cambridge cluster configurationtMicrosoft Research, Cambridge, UK.
16 dualCPU 300 MHz Pentium II s.
Each node
384 MB RAM
4 GB local disk
Networks
Dedicated fast ethernet (100 Mb/sec)
Myrinet, M2MOCTSW8 (1.28 Gb/sec)TOZZ ZHZO HZ?? 92Message Passing Interface (MPI) Industry Standard
MPI implementations:
exist for the majority of parallel systems & interconnects
public domain (e.g. mpich) or commercial (e.g. MPI PRO)
Supports many communications primitives including virtual topologies (e.g. torus).6'sS'sS@v j ,Performance data from MSR Cambridge cluster  &
$(
r
SP
r
S4
H
0h ? ̙33rN&B
*hD;xk(
,
:A
Chart MSGraph.Chart.804Microsoft Graph 2000 Chartb/0DTimes New Romanl5dv0
"#$%&'()*+,./01234568Oh+'0`h
Distributed Linear AlgebraPeter Montgomeryr APeter Montgomeryr A37eMicrosoft PowerPointgeb@`pay@^@`Gg Q7& &&#TNPP2OMi
&
TNPP &&TNPP
 !&G&
wwgw
 &Gy& iyH @Times New Romanwgw
 ..2
ADistributed Linear Algebra+"*.Q1 @Times New Romanwgw
 .$2
)Peter L. Montgomery&!. .72
Microsoft Research, Redmond, USA&
. .2
A RSA 2000. .2
~MJanuary 17, 2000."Systemwfu
&TNPP &՜.+,0
Onscreen ShowMicrosoft Corp.r
Times New RomanSymbolDefault DesignMicrosoft Graph 2000 ChartDistributed Linear Algebra,bD X1
Block Lanczos summarypLet N be the machine word length (typically 32 or 64) or a small multiple thereof.
Vectors are n1N or n2N over GF(2).
Exclusive OR and other hardware bitwise instructions operate on Nbit data.
Recurrences similar to regular Lanczos.
Approximately n1/(N 0.76) iterations.
Up to N independent solutions of Bx=0.&9ZYMA (,Block Lanczos major operationsPremultiply n2N vector by B.
Premultiply n1N vector by BT.
NN inner product of two n2N vectors.
Postmultiply n2N vector by NN matrix.
Add two n2N vectors.
How do we parallelize these?*
4&!Assumed processor topologyAssume a g1g2 toroidal grid of processors.
A torus is a rectangle with its top connected to its bottom, and left to right (doughnut).
Need fast communication to/from immediate neighbors north, south, east, and west.
Processor names are prc where r is modulo g1 and c is modulo g2.
Set gridrow(prc) = r and gridcol(prc) = c.F
. *$A torus of processors
Matrix row and column guardiansFor 0 i < n1, a processor rowguard(i) is responsible for entry i, in all n1N vectors.
For 0 j < n2, a processor colguard(j) is responsible for entry j, in all n2N vectors.
Processorassignment algorithms aim for load balancing. A,QmThree major operationsxVector addition is pointwise. When adding two n2 N vectors, processor colguard(j) does the jth entries. Data is local.
Likewise for n2N vector by NN matrix.
Processors form partial NN inner products. Central processor sums them.
These operations need little communication.
Workloads are O(#columns assigned).*=Z
)
!> ,Allocating B among processors$dLet B = (bij) for 0 i < n1 and 0 j < n2.
Processor prc is responsible for all bij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c.
When premultiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r."3Z ,9="fMultiplying u = Bv where u is n1N and v is n2N 4"2Distribute each v[j] to all prc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid.
Each prc processes all of its bij, building partial u[i] outputs.
Partial u[i] values are summed as they advance along a grid row to rowguard(i).
Individual workloads depend upon B.3Z &8%tLd*Actions by prc during multiply<dSend/receive all v[j] with gridcol(colguard(j)) = c.
Zero all u[i] with rowguard(i) = pr,c+1.
At time t where 1 t g2, adjust all u[i] with rowguard(i) = pr,c+t (t nodes east).
.If t < g2, ship these u[i] west to pr,c 1 and receive other u[i] from pr,c+1 on the east.
Want balanced workloads at each t.3Z
.P>Multiplication by BT .Reverse roles of matrix rows and columns.
Reverse roles of grid rows and columns.
BT and B can share storage since same processor handles (B)ij during multiply by B as handles (BT)ji during multiply by BT.S1
,$!Major memory requirementsjMatrix data is split amongst processors.
With 6553665536 cachefriendly blocks, an entry needs only two 16bit offsets.
Each processor needs one vector of length max(n1/g1, n2/g2) and a few of length n2/g1g2, with N bits per entry.
Central processor needs one vector of length n2 plus rowguard and colguard.6Z4 t > , (#)Major communications during multiply by B*(@Broadcast each v[j] along entire grid column. Ship n2N bits to each of g1 1 destinations.
Forward partial u[i] along grid row, one node at a time. Total (g2 1)n1N bits.
When n2 n1, communication for B and BT is 2(g1+g2 2)n1N bits per iteration.
2(g1+g2 2)n12 bits after n1/N iterations.,!Z!"
0,Choosing grid size>Large enough that matrix fits in memory.
Matrix storage is about 4w/g1g2 bytes per processor, where w is total matrix weight.
Try to balance I/O and computation times.
Multiply cost is O(n1w/g1g2) per processor.
Communications cost O((g1+g2 2)n12).
Prefer a square grid, to reduce g1+g2.B BV(#'Choice of N and matrix$ ~Prefer smaller but heavier matrix if it fits, to lessen communications.
Higher N yield more dependencies, letting you omit the heaviest rows from the matrix.
Larger N means fewer but longer messages.
Size of vector elements affects cache.
When N is large, inner products and postmultiplies by NN matrices are slower.@ZOUO1
Cambridge cluster configurationtMicrosoft Research, Cambridge, UK.
16 dualCPU 300 MHz Pentium II s.
Each node
384 MB RAM
4 GB local disk
Networks
Dedicated fast ethernet (100 Mb/sec)
Myrinet, M2MOCTSW8 (1.28 Gb/sec)TOZZ ZHZO HZ?? 92Message Passing Interface (MPI) Industry Standard
MPI implementations:
exist for the majority of parallel systems & interconnects
public domain (e.g. mpich) or commercial (e.g. MPI PRO)
Supports many communications primitives including virtual topologies (e.g. torus).6'sS'sS@v j ,Performance data from MSR Cambridge cluster  &
$(
r
SP
r
S4
H
0h ? ̙33
P$(
r
S8P
r
SP9
H
0h ? ̙33rD!B&V
D.;x