Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Forum: help

Monitor Forum | Start New Thread Start New Thread
RE: Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Maciek Sykulski on 2016-04-19 14:48
[forum:150467]
Hello Pierre,

When running
mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym --iparm IPARM_THREAD_NBR 24

or when setting pastix.setIparm(pastix.API.IPARM_THREAD_NBR, 24), the computing process still uses only one core to do the computation. So there is no speed up.

Best regards,
Maciek

RE: Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Maciek Sykulski on 2016-04-16 13:34
[forum:150466]
Hello Pierre,

When I run
mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym --iparm IPARM_THREAD_NBR 24
or when I set pastix.setIparm(pastix.API.IPARM_THREAD_NBR, 24) I observe only one core doing all the computation.

Best regards,
Maciek

RE: Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Pierre Ramet on 2016-04-13 15:01
[forum:150460]
Hello Maciek,
before using bindtab (maybe not yet available in the python API of PaStiX), could you first try to run pypastix within a single MPI process and increase the number of threads (from 1 to 24 cores). For instance :
mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym --iparm IPARM_THREAD_NBR 1
mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym --iparm IPARM_THREAD_NBR 24

You can also add the following line in the driver pastix.py :
pastix.setIparm(pastix.API.IPARM_THREAD_NBR, 24)

Best,
Pierre.

RE: Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Maciek Sykulski on 2016-04-12 17:25
[forum:150459]
Hello Mathieu,

I am on a single node with 24 cores. When I inspect htop during the computation many cores are in use at the same time. (>1000% processor usage).

If this indeed is a problem with thread binding, can you give me some more clues how to change it? Where and how do I use bindtab array?

Best regards,
Maciek

RE: Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Mathieu Faverge on 2016-04-07 20:43
[forum:150441]
Hello,

Are you on one node or multiple nodes ? If you are using only one node, it is due to the thread binding. they arre bound to the same core. You can change that by using the bindtab array and specify where to bind the threads of each of your processes.

Best
Mathieu

Testing pypastix and no speed up observed on multiple cores [ Reply ]
By: Maciek Sykulski on 2016-04-07 20:20
[forum:150440]
Hello,
I'm testing my pypastix installation, and I'm not sure if everything is ok, since when I run ./example/src/pypastix/pastix.py on 1 core it's faster than when I run it on 10 cores. Why I do not observe any speed up when solving a matrix on many cores?

time mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
real 0m6.564s
user 0m6.330s
sys 0m0.210s

vs

time mpiexec -np 10 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
real 0m7.119s
user 1m3.980s
sys 0m1.560s

Below are full outputs from these two runs.

$ time mpiexec -np 1 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
1e-12
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
pouet 1e-12
AUTOSPLIT_COMM : global rank : 0, inter node rank 0, intra node rank 0, threads 1
+--------------------------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+--------------------------------------------------------------------+
Matrix size 200000 x 200000
Number of nonzeros in A 399999
+--------------------------------------------------------------------+
+ Options +
+--------------------------------------------------------------------+
Version : 5.2.2.22
SMP_SOPALIN : Defined
VERSION MPI : Defined
PASTIX_DYNSCHED : Not defined
STATS_SOPALIN : Not defined
NAPA_SOPALIN : Defined
TEST_IRECV : Not defined
TEST_ISEND : Defined
TAG : Exact Thread
FORCE_CONSO : Not defined
RECV_FANIN_OR_BLOCK : Not defined
OUT_OF_CORE : Not defined
DISTRIBUTED : Defined
METIS : Not defined
WITH_SCOTCH : Defined
INTEGER TYPE : int
PASTIX_FLOAT TYPE : double
+--------------------------------------------------------------------+
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
Ordering :
> Symmetrizing graph
> Removing diag
> Initiating ordering
Scotch direct strategy
Time to compute ordering 0.554 s
Symbolic Factorization :
Analyse :
Number of cluster 1
Number of processor per cluster 1
Number of thread number per MPI process 1
Building elimination graph
Building cost matrix
Building elimination tree
Total cost of the elimination tree 0.13597
Spliting initial partition
Using proportionnal mapping
Total cost of the elimination tree 0.13597
** New Partition: cblknbr= 16365 bloknbr= 49024 ratio=2.995661 **
Factorization of the new symbol matrix by Crout blok algo takes : 2.23203e+07
Re-Building elimination graph
Building task graph
Number of tasks 16365
Distributing partition
0 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 0 BPFTMAX 0 NBFTMAX 0 ARFTMAX 0
** End of Partition & Distribution phase **
Time to analyze 0.0246 s
Number of nonzeros in factorized matrix 1732552
Fill-in 4.33139
Number of operations (LLt) 2.2871e+07
Prediction Time to factorize (AMD 6180 MKL) 0.136 s
0 : SolverMatrix size (without coefficients) 5.22 Mo
0 : Number of nonzeros (local block structure) 3265317
Maximum coeftab size (cefficients) 24.9 Mo
Numerical Factorization (LDLt) :
Time to fill internal csc 0.0161 s
--- Sopalin : Allocation de la structure globale ---
--- Fin Sopalin Init ---
--- Initialisation des tableaux globaux ---
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Launching 1 threads (1 commputation, 0 communication, 0 out-of-core)
--- Sopalin : Local structure allocation ---
--- Sopalin : Threads are binded ---
--- Sopalin Begin ---
--- Sopalin End ---
[0][0] Factorization communication time : 0 s
--- Fin Sopalin Init ---
0:0 up_down_smp
--- Sopalin : Local structure allocation ---
[0][0] Solve initialization time : 0.000991344 s
--- Down Step ---
--- Diag Step ---
--- Up Step ---
[0][0] Solve communication time : 0 s
- iteration 1 :
time to solve 0 s
total iteration time 0.0422 s
error 4.0898e-14
Static pivoting 0
Inertia 200000
Time to factorize 0.0469 s
FLOPS during factorization 464.96 MFLOPS
Time to solve 0.0328 s
Refinement 1 iterations, norm=4.09e-14
Time for refinement 0.0502 s
||b-Ax||/||b|| 4.09e-14
max_i(|b-Ax|_i/(|b| + |A||x|)_i 1.67e-16
Number of iterations : 1
Relative error : 4.08981e-14
Scaled residual : 1.66533e-16
||Ax-b||/||b|| = 6.66133814775e-16
max_i(|Ax-b|_i/(|A||x| + |b|)_i) 1.66533459315e-16

real 0m6.564s
user 0m6.330s
sys 0m0.210s

-------------------------------------------------------------------------------------------------------

$ time mpiexec -np 10 python ./example/src/pypastix/pastix.py --size 200000 --type double --sym
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
1e-12
Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
pouet 1e-12
AUTOSPLIT_COMM : global rank : 0, inter node rank 0, intra node rank 0, threads 1
+--------------------------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+--------------------------------------------------------------------+
Matrix size 200000 x 200000
Number of nonzeros in A 399999
+--------------------------------------------------------------------+
+ Options +
+--------------------------------------------------------------------+
Version : 5.2.2.22
SMP_SOPALIN : Defined
AUTOSPLIT_COMM : global rank : 1, inter node rank 1, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 9, inter node rank 9, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 5, inter node rank 5, intra node rank 0, threads 1
VERSION MPI : Defined
PASTIX_DYNSCHED : Not defined
STATS_SOPALIN : Not defined
NAPA_SOPALIN : Defined
TEST_IRECV : Not defined
TEST_ISEND : Defined
TAG : Exact Thread
FORCE_CONSO : Not defined
RECV_FANIN_OR_BLOCK : Not defined
OUT_OF_CORE : Not defined
DISTRIBUTED : Defined
METIS : Not defined
WITH_SCOTCH : Defined
INTEGER TYPE : int
PASTIX_FLOAT TYPE : double
+--------------------------------------------------------------------+
Check : Numbering OK
AUTOSPLIT_COMM : global rank : 8, inter node rank 8, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 4, inter node rank 4, intra node rank 0, threads 1
Check : Sort CSC OK
Check : Duplicates OK
AUTOSPLIT_COMM : global rank : 3, inter node rank 3, intra node rank 0, threads 1
AUTOSPLIT_COMM : global rank : 2, inter node rank 2, intra node rank 0, threads 1
pouet 1e-12
AUTOSPLIT_COMM : global rank : 6, inter node rank 6, intra node rank 0, threads 1
pouet 1e-12
Ordering :
> Symmetrizing graph
AUTOSPLIT_COMM : global rank : 7, inter node rank 7, intra node rank 0, threads 1
> Removing diag
> Initiating ordering
Scotch direct strategy
Time to compute ordering 0.576 s
Symbolic Factorization :
Analyse :
Number of cluster 10
Number of processor per cluster 1
Number of thread number per MPI process 1
Total cost of the elimination tree 0.13597
Total cost of the elimination tree 0.13597
Building elimination graph
Total cost of the elimination tree 0.13597
Building cost matrix
Total cost of the elimination tree 0.13597
Total cost of the elimination tree 0.13597
Distributing partition
7 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
3 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
8 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
9 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
1 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
5 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
4 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
6 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
2 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
0 : Genering final SolverMatrix
NUMBER of THREAD 1
NUMBER of BUBBLE 1
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 128 (2 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
7 : SolverMatrix size (without coefficients) 594 Ko
7 : Number of nonzeros (local block structure) 325530
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
3 : SolverMatrix size (without coefficients) 593 Ko
3 : Number of nonzeros (local block structure) 325888
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
8 : SolverMatrix size (without coefficients) 593 Ko
8 : Number of nonzeros (local block structure) 328055
Actual coefmax = 255 (17 x 15)
New suggested coefmax = 1088 (17 x 64)
Max diagblock coefmax without shur = 960 (15 x 64)
Max diagblock on shur = 64 (1 x 64)
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
1 : SolverMatrix size (without coefficients) 592 Ko
1 : Number of nonzeros (local block structure) 326502
9 : SolverMatrix size (without coefficients) 592 Ko
9 : Number of nonzeros (local block structure) 325139
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
5 : SolverMatrix size (without coefficients) 591 Ko
5 : Number of nonzeros (local block structure) 327208
4 : SolverMatrix size (without coefficients) 595 Ko
4 : Number of nonzeros (local block structure) 326298
COEFMAX 1088 CPFTMAX 4 BPFTMAX 0 NBFTMAX 4 ARFTMAX 255
6 : SolverMatrix size (without coefficients) 593 Ko
6 : Number of nonzeros (local block structure) 325749
2 : SolverMatrix size (without coefficients) 593 Ko
2 : Number of nonzeros (local block structure) 327128
** End of Partition & Distribution phase **
Time to analyze 0.0371 s
Number of nonzeros in factorized matrix 1732552
Fill-in 4.33139
Number of operations (LLt) 2.2871e+07
Prediction Time to factorize (AMD 6180 MKL) 0.0136 s
0 : SolverMatrix size (without coefficients) 591 Ko
0 : Number of nonzeros (local block structure) 327820
Maximum coeftab size (cefficients) 2.5 Mo
Numerical Factorization (LDLt) :
Time to fill internal csc 0.00655 s
--- Sopalin : Allocation de la structure globale ---
--- Fin Sopalin Init ---
--- Initialisation des tableaux globaux ---
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Launching 1 threads (1 commputation, 0 communication, 0 out-of-core)
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
Pivoting criterium (||A||*sqrt(epsilon)) = 1.26491e-15
--- Sopalin : Local structure allocation ---
[5][0] Factorization communication time : 0 s
[8][0] Factorization communication time : 0 s
[4][0] Factorization communication time : 0 s
[7][0] Factorization communication time : 0 s
[1][0] Factorization communication time : 0 s
[3][0] Factorization communication time : 0 s
[6][0] Factorization communication time : 0 s
[9][0] Factorization communication time : 0 s
[2][0] Factorization communication time : 0 s
--- Fin Sopalin Init ---
4:0 up_down_smp
7:0 up_down_smp
1:0 up_down_smp
3:0 up_down_smp
6:0 up_down_smp
0:0 up_down_smp
--- Sopalin : Local structure allocation ---
8:0 up_down_smp
9:0 up_down_smp
2:0 up_down_smp
[2][0] Solve initialization time : 0.00108504 s
5:0 up_down_smp
[5][0] Solve initialization time : 0.00100255 s
[9][0] Solve initialization time : 0.000620127 s
[0][0] Solve initialization time : 0.00103211 s
--- Down Step ---
[8][0] Solve initialization time : 0.00063777 s
[6][0] Solve initialization time : 0.000626802 s
[4][0] Solve initialization time : 0.000999451 s
[7][0] Solve initialization time : 0.000626564 s
[1][0] Solve initialization time : 0.00107503 s
[3][0] Solve initialization time : 0.00100756 s
--- Diag Step ---
--- Up Step ---
[9][0] Solve communication time : 5.05447e-05 s
[7][0] Solve communication time : 1.90735e-06 s
[8][0] Solve communication time : 1.90735e-06 s
[6][0] Solve communication time : 0.000179529 s
[4][0] Solve communication time : 9.53674e-07 s
[1][0] Solve communication time : 1.88351e-05 s
[5][0] Solve communication time : 3.8147e-06 s
[2][0] Solve communication time : 1.43051e-06 s
[0][0] Solve communication time : 8.82149e-06 s
[3][0] Solve communication time : 1.19209e-06 s
- iteration 1 :
time to solve 0 s
total iteration time 0.0149 s
error 4.0798e-14
Static pivoting 0
Inertia 200000
Time to factorize 0.00579 s
FLOPS during factorization 3.6803 GFLOPS
Time to solve 0.00561 s
Refinement 1 iterations, norm=4.08e-14
Time for refinement 0.0239 s
||b-Ax||/||b|| 4.08e-14
max_i(|b-Ax|_i/(|b| + |A||x|)_i 1.67e-16
Number of iterations : 1
Relative error : 4.07979e-14
Scaled residual : 1.66533e-16
||Ax-b||/||b|| = 7.77156117238e-16
max_i(|Ax-b|_i/(|A||x| + |b|)_i) 1.94289038414e-16

real 0m7.119s
user 1m3.980s
sys 0m1.560s