Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Forum: help

Monitor Forum | Start New Thread Start New Thread
RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-28 15:14
[forum:110637]
Nice,

I added the PGI.in file, it will be in next release (along with the pgc++ building fix in the patch).

Have a nice day,

Regards,

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-28 14:32
[forum:110636]

of course, the error at the end turned out to be rather stupid. I was assuming that pastix distributes the matrix arrays to the other ranks. Of course it didn't and all ia, ja, aval, ... were 0 on all ranks but the master rank.

Thanks for your help, and I hope that it was also helpful for your side.
By the way, I'm using pastix now along with mpich-3.0.2. -> it works also with this version!

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-28 07:55
[forum:110632]
Good idea, I'll had the PGI config.in in the repository.

MPI_Init should be sufficient as you are not using threads neither in Scotch or PaStiX.

Maybe you could run your application with valgrind to show if there is any memory corruption.

When you say the matrix is said symmetric with -np 2, it's a PaStiX print which says that (the LLt/LDLt print ) ? Maybe the iparm IPARM_SYM is corrupted ?

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-27 16:19
[forum:110631]
For the example it is working fine. I can use it with mpirun -np [1-4]. However, for the application, the problem still exists. However after applying the patch, by recompiling pastix, there were no warnings anymore using the PGI compiler. This problem seems to be solved. I suggest you to deliver the config.in file with the pastix software package in the future so that it will be straightforward to compile it with the PGI compiler.

I will do some more investigation into why it works for the testcase but not for my application. Mabe you have some additional suggestions. I'm using mpi_init instead of mpi_init_thread but this didn't show up to be a problem also with the test program.
Thank you very much so far!

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 13:32
[forum:110628]

patch.tar.gz (15) downloads
Sorry, should be better now...

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-27 13:26
[forum:110627]
no tarball attached...

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 13:20
[forum:110626]
I used your source.F90 file to build a PaStiX example.

I attached a tarball containing test.F90 and test_pgi.patch.

Can you copy the test.F90 file into examples/src and apply the patch using
cd pastix_release_3999
patch -p0 < test_pgi.patch

(The patch only adds test.F90 to the files to builds and fixes some building errors with pgc++ and complex)

Then build the example using
make examples.

You can then run the example in a directory containing your .csv files using

mpirun -np 4 ../example/bin/test -lap 15129

I did not encountered any problem with that example.

Can you tell me if your problem also appears with this example ?

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-27 12:40
[forum:110625]
Ok, I implemented the duplicates check in my program. With the corrected matrix, no refinement is required anymore and it runs without error using one cpu.
With 2 cpu, I get strange behavior: with one cpu, the matrix is reported to be unsymmetric with 2cpus it is reported symmetric. Moreover, I get the following warning: WARNING: 15129 numerical zeros found on the diagonal.

A bit later, it breaks with the same MPI problems that I described before.

Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(186): MPI_Recv(buf=0x35220f0, count=56, MPI_BYTE, src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x7fff4f89ac90) failed
do_cts(509)..: Message truncated; 211640 bytes received but buffer size is 56

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 10:45
[forum:110623]
One line is missing in my "check code" before first call.

nnz = ia(n+1) - 1

This is why I got the match error in scotch.

Still have other bugs to fix (segfault in BLAS)

Can you try calling this before PaStiX

nnz = iaCCS(n+1)-1
call pastix_fortran_checkmatrix(check_data, pastix_comm, &
verbose, flagsym, API_YES, n, iaCCS, jaCCS, avalsCCS, mun, un)

if (iaCCS(n+1) - 1 /= nnz ) then
deallocate(jaCCS)
deallocate(avalsCCS)
allocate(jaCCS(iaCCS(n+1)-1))
allocate(avalsCCS(iaCCS(n+1)-1))
call pastix_fortran_checkmatrix_end(check_data, &
verbose, jaCCS,avalsCCS, 1)
endif

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 10:33
[forum:110622]
Something I should have noticed from the start, sorry :

In your log you have :

Check : Duplicates KO

You need to remove double entry from the matrix or to run pastix_fortran_checkmatrix() to do so.

For example in your matrix I can see :
in iaCCS.csv
15109,89246
15110,89255

in jaCCS.csv
89246,14987
89247,14988
89248,15110
89249,15110
89250,15110
89251,15110
89252,15110
89253,15110
89254,15111

You can see several 15110, that is not accepted by PaStiX (we should return a more explicit error).

This can be solved by this code :

call pastix_fortran_checkmatrix(check_data, pastix_comm, &
verbose, flagsym, API_YES, n, ia, ja, avals, mun, un)

if (ia(n+1) - 1 /= nnz ) then
deallocate(ja)
deallocate(avals)
allocate(ja(ia(n+1)-1))
allocate(avals(ia(n+1)-1))
call pastix_fortran_checkmatrix_end(check_data, &
verbose, ja,avals, 1)
endif

Did it with your matrix but I still have a problem to fix with scotch (this error occurs when the graph is not symmetric, but it have been checked... So I have to understand why... Error in our check, interface error ?) :

Check : Numbering OK
Check : Sort CSC OK
Check : Duplicates OK
2682 double terms merged
Check : Graph symmetry OK

WARNING: Number of non zeros has changed, don't forget to call pastix_fortran_checkmatrix_end

(0): ERROR: graphCheck: arc data do not match


XL.



RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-27 10:24
[forum:110621]

mat.tar.gz (15) downloads
I'm sorry. My mistake.... a problem with my debugger. I resubmitted the matrix.
It originates from a 2dimensional 123x123 grid, so nmat =15129, nnz=89273, the right hand side is important since we deal with a time-implicit equation system (coefficient matrix and right-hand side change for every time-step)
The nnz is derived from the logical topology. Therfore, it still contains zero values. However, all indexed nnz values could become != zero
I hope that the data it's usable this time...
Thank you

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 09:54
[forum:110620]
Must be wrong, in iaCCS.csv i have :

32763,-1144850943
32764,1794444287
32765,1072693291

So I understand that the matrix is a CCS and I don't have to look at the data after N+1, same for rhs.

XL

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-27 09:47
[forum:110619]
Thanks, it was coming from PT-Thread activated in Scotch in my Makefile.inc

I have a question about your matrice :
Maybe I misunderstood something.

I noticed that all the files where 89273 line length so I deduced that :
iaCSS.csv was the column indices
jaCSS.csv the rows
avalsCSS.csv the values

So I gathered them in a matrixmarket format file (ROW, COL, VALUE).

But I don't understand why there are also 89273 entries in rhs.csv

Maybe I missed something.

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-27 07:44
[forum:110618]

Makefile.inc (16) downloads
I never run into this problem. I could compile pt-scotch with the attached configuration using mvapich2-1.9 and pgi64-12.10 (on CentOs 6.3)



RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-26 16:41
[forum:110610]
Thanks,

I build mpich2 with pgcc on my machine but didn't succeed to build pt-scotch.

Did you also get this type of error ? How did you solve it ?

/lustre/lacoste/scotch_6.0.0/src/libscotch/./graph_match_scan.c:353: undefined reference to `__sync_lock_test_and_set'
/lustre/lacoste/scotch_6.0.0/src/libscotch/./graph_match_scan.c:357: undefined reference to `__sync_lock_test_and_set'
/lustre/lacoste/scotch_6.0.0/src/libscotch/./graph_match_scan.c:358: undefined reference to `__sync_lock_release'

Still investigating.

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-26 13:45
[forum:110594]

mat.tar.gz (16) downloads
Ok, here's the archive containing all the vectors

nmat=15129

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-26 13:13
[forum:110590]
Hello,

I still have no mvapich2 version with PGI... Still waiting for an answer from the support...
I May try to build my own MPICH2 on the machine...
Yes, it would be great if you attached (or email me) the matrix/vector, maybe with that I could reproduce the bug.

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-02-26 12:19
[forum:110589]
Hello Xavier, did you have a chance to test the mpich2 pgi12.10 setup on you machine? I'm sorry, but it's not possible that you access our computer from outside.
I already tested it on our newly installed system using the latest mvapich2 version in combination with the 12.10 PGI compiler, the problem is still the same.
I could send you an example coeff. matrix with matching rhs for testing.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-01 09:33
[forum:110418]
Is there a way to access your machine to debug directly in your environment ?

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-01 09:21
[forum:110417]
Did some tests with NOSMP activated still passing for me.
I asked the support for pgi12.10/mvapich2 and hope I can reproduce the bug with that...

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-02-01 09:13
[forum:110416]
I apologize for the gcc/FORCE_NO_MPI story about config.in...
I compared with an old config.in I downloaded from an other user...
Your config.in is fine to me.

I'll ask for the cluster support team to have latests mpich/pgcc couple to be able to perform some tests..

Thanks,

XL

RE: MPI deadlock with NOSMP version [ Reply ]
By: Sven Dunston on 2013-01-31 17:18
[forum:110415]
I'm sure that I'm producing the MPI version with disabled

SMP.libpastix_64bit_mpi_nosmp_int_double_real_scotch_i686_pc_linux.a

-DNO_MPI_TYPE did not help.


I used: mpicc -O3 -DFORCE_NOSMP -DNO_MPI_TYPE ...

mpirun -np 1 is fine

mpirun -np 2 gives me

Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(186): MPI_Recv(buf=0x3acdef0, count=56, MPI_BYTE,
src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x7fffa138c6f0) failed
do_cts(509)..: Message truncated; 211640 bytes received but buffer size
is 56


that's the stack:

"depth","Processes","Function"
...

"3","2","pastix_fortran_ (pastix_fortran.c:231)"
"4","2", "pastix_fortran (pastix_fortran.c:187)"
"5","2", "pastix (pastix.c:4884)"
"6","2", "pastix_task_sopalin (pastix.c:3523)"
"7","2", "ge_sopalin_thread (sopalin3d.c:1385)"
"8","2", "sopalin_launch_thread (sopalin_thread.c:151)"
"9","1","ge_sopalin_smp (sopalin3d.c:1168)"
"10","1","PMPI_Barrier (barrier.c:410)"
"11","1", "MPIR_Barrier_impl (barrier.c:277)"
"12","1", "MPIR_Barrier_or_coll_fn (barrier.c:120)"
"13","1", "MPIR_Barrier_intra (barrier.c:73)"
"14","1", "MPIC_Sendrecv_ft (helper_fns.c:690)"
"15","1", "MPIC_Sendrecv (helper_fns.c:194)"
"16","1", "MPIC_Wait (helper_fns.c:537)"
"17","1", "MPIDI_CH3I_Progress (ch3_progress.c:396)"
"18","1", "MPID_nem_mpich2_blocking_recv (mpid_nem_inline.h:904)"
"19","1", "MPID_nem_network_poll (mpid_nem_network_poll.c:16)"
"20","1", "MPID_nem_tcp_connpoll (socksm.c:1777)"
"9","1","ge_sopalin_smp (sopalin3d.c:871)"
"10","1","ge_wait_contrib_comp_1d (contrib.c:67)"
"11","1", "ge_recv_waitone_fanin (sopalin_sendrecv.c:584)"
"12","1", "PMPI_Recv (recv.c:187)"
"13","1", "MPIR_Err_return_comm (errutil.c:234)"
"14","1", "handleFatalError (errutil.c:470)"
"15","1", "MPID_Abort (mpid_abort.c:39)"


mpirun -np 3 gives deadlock
mpirun -np 4 gives deadlock as well

"depth","Processes","Function"
...
"3","4","pastix_fortran_ (pastix_fortran.c:231)"
"4","4"," pastix_fortran (pastix_fortran.c:187)"
"5","4"," pastix (pastix.c:4884)"
"6","4"," pastix_task_sopalin (pastix.c:3523)"
"7","4"," ge_sopalin_thread (sopalin3d.c:1385)"
"8","4"," sopalin_launch_thread (sopalin_thread.c:151)"
"9","2","ge_sopalin_smp (sopalin3d.c:1168)"
"10","2", "PMPI_Barrier (barrier.c:410)"
"11","2", "MPIR_Barrier_impl (barrier.c:277)"
"12","2", "MPIR_Barrier_or_coll_fn (barrier.c:120)"
"13","2", "MPIR_Barrier_intra (barrier.c:73)"
"14","2", "MPIC_Sendrecv_ft (helper_fns.c:690)"
"15","2", "MPIC_Sendrecv (helper_fns.c:194)"
"16","2", "MPIC_Wait (helper_fns.c:537)"
"17","2", "MPIDI_CH3I_Progress (ch3_progress.c:396)"
"18","2", "MPID_nem_mpich2_blocking_recv (mpid_nem_inline.h:904)"
"19","2", "MPID_nem_network_poll (mpid_nem_network_poll.c:16)"
"20","1","MPID_nem_tcp_connpoll (socksm.c:1777)"
"20","1", "MPID_nem_tcp_connpoll (socksm.c:1800)"
"21","1"," __poll (poll.c:83)"
"9","2","ge_sopalin_smp (sopalin3d.c:871)"
"10","2","ge_wait_contrib_comp_1d (contrib.c:67)"
"11","2", "ge_recv_waitone_fanin (sopalin_sendrecv.c:584)"
"12","2", "PMPI_Recv (recv.c:156)"
"13","2", "MPIDI_CH3I_Progress (ch3_progress.c:396)"
"14","2", "MPID_nem_mpich2_blocking_recv (mpid_nem_inline.h:893)"
"15","1", "OPA_load_ptr (opa_gcc_intel_32_64_ops.h:47)"



It runs well with mpirun -np 1.


RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-01-31 14:51
[forum:110398]
The error was not coming from my run but from the cluster... It crashed today.

Activating -DNO_MPI_TYPE solved the issue, can you tell me if it's ok for you ?

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-01-31 14:21
[forum:110397]
I had a crash on 4 processors with PGI 12.4/openmpi 1.6.2/scotch 5.1.12b

It crashed on MPI_Type_commit, so I tried to activate -DFORCE_NO_MPI.

Now I get a deadlock...

And no real debug information...
I'll keep investigating.

XL.

RE: MPI deadlock with NOSMP version [ Reply ]
By: Xavier Lacoste on 2013-01-31 13:32
[forum:110396]
I also see that you have FORCE_NOMPI activated, so except you want to run N sequential PaStiX calls you have to rebuild PaStiX without this flag.

But it should still run, just not scale...
There is still an issue.

XL.

Older Messages