Abstract
We discuss some performance issues of the tiled Cholesky factorization on non-uniform memory access-time (NUMA) shared memory machines. We show how to optimize thread and data placement in order to improve performance. The final result is 50\% faster than PLASMA and 75\% faster than MKL.
Keywords
Get full access to this article
View all access options for this article.
References
1.
Agullo
E
Dongarra
J
Hadri
B
(2010 ) PLASMA users guide . Technical Report, Innovative Computing Laboratory,
University of Tennessee
, TN.
2.
Bosilca
G
Bouteiller
A
Danalis
A
(2012 ) DAGuE: A generic distributed DAG engine for high performance computing . Parallel Computing 38
(1–2) : 27 –51 .
3.
Cosnard
M
Loi
M
(1995 ) Automatic task graph generation techniques . Parallel Processing Letters 5 (4 ): 527 –538 .
4.
Cosnard
M
Loi
M
(1996 ) A simple algorithm for the generation of efficient loop structures . International Journal of Parallel Programming 24 (3 ): 265 –289 .
5.
Cosnard
M
Jeannot
E
Yang
T
(1999 ) SLC: Symbolic scheduling for executing parameterized task graphs on multiprocessors . In: International conference on parallel processing (ICPP’99) , Aizu Wakamatsu, Japan .
6.
Cosnard
M
Jeannot
E
Yang
T
(2004 ) Compact DAG representation and its symbolic scheduling . Journal of Parallel and Distributed Computing 64 (8 ): 921 –935 .
7.
Feautrier
P
(1991 ) Dataflow analysis of array and scalar references . International Journal of Parallel Programming 20 (1 ): 23 –53 .
8.
Feautrier
P
(1994 ) Toward automatic distribution . Parallel Processing Letters 4 (3 ): 233 –244 .
9.
Intel
R
(2012 ) Intel math kernel library reference manual. Technical report no. 630813-051US . Available at : http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/mklman.pdf (accessed 7 June 2013) .
10.
YarKhan
A
Kurzak
J
Dongarra
J
(2011 ) QUARK users’ guide: Queueing and runtime for kernels . Technical report no. ICL-UT-11-02, Innovative Computing Laboratory,
University of Tennessee
, TN.
