Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Testing pypastix and no speed up observed on multiple cores

Monitor Forum | Start New Thread Start New Thread

Testing pypastix and no speed up observed on multiple cores
By: Maciek Sykulski on 2016-04-07 20:20
[forum:150440]


Hello,
I'm testing my pypastix installation, and I'm not sure if everything is ok, since when I run ./example/src/pypastix/pastix.py on 1 core it's faster than when I run it on 10 cores. Why I do not observe any speed up when solving a matrix on many cores?

time mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
real 0m6.564s
user 0m6.330s
sys 0m0.210s

vs

time mpiexec -np 10 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
real 0m7.119s
user 1m3.980s
sys 0m1.560s

Below are full outputs from these two runs.

$ time mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
1e-12
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
pouet 1e-12
AUTOSPLIT_COMM : global rank : 0, inter node rank 0, intra node rank 0, threads 1
+--------------------------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+--------------------------------------------------------------------+
Matrix size 200000 x 200000
Number of nonzeros in A 399999
+--------------------------------------------------------------------+
+ Options +
+--------------------------------------------------------------------+
Version : 5.2.2.22
SMP_SOPALIN : Defined
VERSION MPI : Defined
PASTIX_DYNSCHED : Not defined
STATS_SOPALIN : Not defined
NAPA_SOPALIN : Defined
TEST_IRECV : Not defined
TEST_ISEND : Defined
TAG : Exact Thread
FORCE_CONSO : Not defined
RECV_FANIN_OR_BLOCK : Not defined
OUT_OF_CORE : Not defined
DISTRIBUTED : Defined
METIS : Not defined
WITH_SCOTCH : Defined
INTEGER TYPE : int
PASTIX_FLOAT TYPE : double
+--------------------------------------------------------------------+
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
Ordering :
> Symmetrizing graph
> Removing diag
> Initiating ordering
Scotch direct strategy
Time to compute ordering 0.554 s
Symbolic Factorization :
Analyse :
Number of cluster 1
Number of processor per cluster 1
Number of thread number per MPI process 1
Building elimination graph
Building cost matrix
Building elimination tree
Total cost of the elimination tree 0.13597
Spliting initial partition
Using proportionnal mapping
Total cost of the elimination tree 0.13597
** New Partition: cblknbr= 16365 bloknbr= 49024 ratio=2.995661 **
Factorization of the new symbol matrix by Crout blok algo takes : 2.23203e+07
Re-Building elimination graph
Building task graph
Number of tasks 16365
Distributing partition
0 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 0 BPFTMAX 0 NBFTMAX 0 ARFTMAX 0
** End of Partition & Distribution phase **
Time to analyze 0.0246 s
Number of nonzeros in factorized matrix 1732552
Fill-in 4.33139
Number of operations (LLt) 2.2871e+07
Prediction Time to factorize (AMD 6180 MKL) 0.136 s
0 : SolverMatrix size (without coefficients) 5.22 Mo
0 : Number of nonzeros (local block structure) 3265317
Maximum coeftab size (cefficients) 24.9 Mo
Numerical Factorization (LDLt) :
Time to fill internal csc 0.0161 s
--- Sopalin : Allocation de la structure globale ---
--- Fin Sopalin Init ---
--- Initialisation des tableaux globaux ---
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Launching 1 threads (1 commputation, 0 communication, 0 out-of-core)
--- Sopalin : Local structure allocation ---
--- Sopalin : Threads are binded ---
--- Sopalin Begin ---
--- Sopalin End ---
[0][0] Factorization communication time : 0 s
--- Fin Sopalin Init ---
0:0 up_down_smp
--- Sopalin : Local structure allocation ---
[0][0] Solve initialization time : 0.000991344 s
--- Down Step ---
--- Diag Step ---
--- Up Step ---
[0][0] Solve communication time : 0 s
- iteration 1 :
time to solve 0 s
total iteration time 0.0422 s
error 4.0898e-14
Static pivoting 0
Inertia 200000
Time to factorize 0.0469 s
FLOPS during factorization 464.96 MFLOPS
Time to solve 0.0328 s
Refinement 1 iterations, norm=4.09e-14
Time for refinement 0.0502 s
||b-Ax||/||b|| 4.09e-14
max_i(|b-Ax|_i/(|b| + |A||x|)_i 1.67e-16
Number of iterations : 1
Relative error : 4.08981e-14
Scaled residual : 1.66533e-16
||Ax-b||/||b|| = 6.66133814775e-16
max_i(|Ax-b|_i/(|A||x| + |b|)_i) 1.66533459315e-16

real 0m6.564s
user 0m6.330s
sys 0m0.210s

-------------------------------------------------------------------------------------------------------

$ time mpiexec -np 10 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
AUTOSPLIT_COMM : global rank : 0, inter node rank 0, intra node rank 0, threads 1
+--------------------------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+--------------------------------------------------------------------+
Matrix size 200000 x 200000
Number of nonzeros in A 399999
+--------------------------------------------------------------------+
+ Options +
+--------------------------------------------------------------------+
Version : 5.2.2.22
SMP_SOPALIN : Defined
AUTOSPLIT_COMM : global rank : 1, inter node rank 1, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 9, inter node rank 9, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 5, inter node rank 5, intra node rank 0, threads 1
VERSION MPI : Defined
PASTIX_DYNSCHED : Not defined
STATS_SOPALIN : Not defined
NAPA_SOPALIN : Defined
TEST_IRECV : Not defined
TEST_ISEND : Defined
TAG : Exact Thread
FORCE_CONSO : Not defined
RECV_FANIN_OR_BLOCK : Not defined
OUT_OF_CORE : Not defined
DISTRIBUTED : Defined
METIS : Not defined
WITH_SCOTCH : Defined
INTEGER TYPE : int
PASTIX_FLOAT TYPE : double
+--------------------------------------------------------------------+
Check : Numbering OK
AUTOSPLIT_COMM : global rank : 8, inter node rank 8, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 4, inter node rank 4, intra node rank 0, threads 1
Check : Sort CSC OK
Check : Duplicates OK
AUTOSPLIT_COMM : global rank : 3, inter node rank 3, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 2, inter node rank 2, intra node rank 0, threads 1
pouet 1e-12
AUTOSPLIT_COMM : global rank : 6, inter node rank 6, intra node rank 0, threads 1
pouet 1e-12
Ordering :
> Symmetrizing graph
AUTOSPLIT_COMM : global rank : 7, inter node rank 7, intra node rank 0, threads 1
> Removing diag
> Initiating ordering
Scotch direct strategy
Time to compute ordering 0.576 s
Symbolic Factorization :
Analyse :
Number of cluster 10
Number of processor per cluster 1
Number of thread number per MPI process 1
Total cost of the elimination tree 0.13597
Total cost of the elimination tree 0.13597
Building elimination graph
Total cost of the elimination tree 0.13597
Building cost matrix
Total cost of the elimination tree 0.13597
Total cost of the elimination tree 0.13597
Distributing partition
7 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
3 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
8 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
9 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
1 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
5 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
4 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
6 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
2 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
0 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 128 (2 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
7 : SolverMatrix size (without coefficients) 594 Ko
7 : Number of nonzeros (local block structure) 325530
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
3 : SolverMatrix size (without coefficients) 593 Ko
3 : Number of nonzeros (local block structure) 325888
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
8 : SolverMatrix size (without coefficients) 593 Ko
8 : Number of nonzeros (local block structure) 328055
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
1 : SolverMatrix size (without coefficients) 592 Ko
1 : Number of nonzeros (local block structure) 326502
9 : SolverMatrix size (without coefficients) 592 Ko
9 : Number of nonzeros (local block structure) 325139
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
5 : SolverMatrix size (without coefficients) 591 Ko
5 : Number of nonzeros (local block structure) 327208
4 : SolverMatrix size (without coefficients) 595 Ko
4 : Number of nonzeros (local block structure) 326298
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
6 : SolverMatrix size (without coefficients) 593 Ko
6 : Number of nonzeros (local block structure) 325749
2 : SolverMatrix size (without coefficients) 593 Ko
2 : Number of nonzeros (local block structure) 327128
** End of Partition & Distribution phase **
Time to analyze 0.0371 s
Number of nonzeros in factorized matrix 1732552
Fill-in 4.33139
Number of operations (LLt) 2.2871e+07
Prediction Time to factorize (AMD 6180 MKL) 0.0136 s
0 : SolverMatrix size (without coefficients) 591 Ko
0 : Number of nonzeros (local block structure) 327820
Maximum coeftab size (cefficients) 2.5 Mo
Numerical Factorization (LDLt) :
Time to fill internal csc 0.00655 s
--- Sopalin : Allocation de la structure globale ---
--- Fin Sopalin Init ---
--- Initialisation des tableaux globaux ---
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Launching 1 threads (1 commputation, 0 communication, 0 out-of-core)
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
--- Sopalin : Local structure allocation ---
[5][0] Factorization communication time : 0 s
[8][0] Factorization communication time : 0 s
[4][0] Factorization communication time : 0 s
[7][0] Factorization communication time : 0 s
[1][0] Factorization communication time : 0 s
[3][0] Factorization communication time : 0 s
[6][0] Factorization communication time : 0 s
[9][0] Factorization communication time : 0 s
[2][0] Factorization communication time : 0 s
--- Fin Sopalin Init ---
4:0 up_down_smp
7:0 up_down_smp
1:0 up_down_smp
3:0 up_down_smp
6:0 up_down_smp
0:0 up_down_smp
--- Sopalin : Local structure allocation ---
8:0 up_down_smp
9:0 up_down_smp
2:0 up_down_smp
[2][0] Solve initialization time : 0.00108504 s
5:0 up_down_smp
[5][0] Solve initialization time : 0.00100255 s
[9][0] Solve initialization time : 0.000620127 s
[0][0] Solve initialization time : 0.00103211 s
--- Down Step ---
[8][0] Solve initialization time : 0.00063777 s
[6][0] Solve initialization time : 0.000626802 s
[4][0] Solve initialization time : 0.000999451 s
[7][0] Solve initialization time : 0.000626564 s
[1][0] Solve initialization time : 0.00107503 s
[3][0] Solve initialization time : 0.00100756 s
--- Diag Step ---
--- Up Step ---
[9][0] Solve communication time : 5.05447e-05 s
[7][0] Solve communication time : 1.90735e-06 s
[8][0] Solve communication time : 1.90735e-06 s
[6][0] Solve communication time : 0.000179529 s
[4][0] Solve communication time : 9.53674e-07 s
[1][0] Solve communication time : 1.88351e-05 s
[5][0] Solve communication time : 3.8147e-06 s
[2][0] Solve communication time : 1.43051e-06 s
[0][0] Solve communication time : 8.82149e-06 s
[3][0] Solve communication time : 1.19209e-06 s
- iteration 1 :
time to solve 0 s
total iteration time 0.0149 s
error 4.0798e-14
Static pivoting 0
Inertia 200000
Time to factorize 0.00579 s
FLOPS during factorization 3.6803 GFLOPS
Time to solve 0.00561 s
Refinement 1 iterations, norm=4.08e-14
Time for refinement 0.0239 s
||b-Ax||/||b|| 4.08e-14
max_i(|b-Ax|_i/(|b| + |A||x|)_i 1.67e-16
Number of iterations : 1
Relative error : 4.07979e-14
Scaled residual : 1.66533e-16
||Ax-b||/||b|| = 7.77156117238e-16
max_i(|Ax-b|_i/(|A||x| + |b|)_i) 1.94289038414e-16

real 0m7.119s
user 1m3.980s
sys 0m1.560s

Thread View

Thread Author Date
Testing pypastix and no speed up observed on multiple coresMaciek Sykulski2016-04-07 20:20
      RE: Testing pypastix and no speed up observed on multiple coresMathieu Faverge2016-04-07 20:43
            RE: Testing pypastix and no speed up observed on multiple coresMaciek Sykulski2016-04-12 17:25
                  RE: Testing pypastix and no speed up observed on multiple coresPierre Ramet2016-04-13 15:01
                        RE: Testing pypastix and no speed up observed on multiple coresMaciek Sykulski2016-04-16 13:34
                        RE: Testing pypastix and no speed up observed on multiple coresMaciek Sykulski2016-04-19 14:48

Post a comment to this message

Subject * :

Message * : Notepad

Attachments
Use the “Browse” button to find the file you want to attach
File to upload:

You are posting anonymously because you are not logged in