Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Forum: open-discussion

Monitor Forum | Start New Thread Start New Thread
RE: Solver error [ Reply ]
By: Serban Georgescu on 2014-08-22 10:00
[forum:148828]
Hello XL,

Thank you!
Problem solved, I did not notice the indexing was from 1.
So after adding 1 to the permutation from PETSc, all works.

Regards,
Serban

RE: Solver error [ Reply ]
By: Xavier Lacoste on 2014-08-22 08:06
[forum:148826]
Hello,

I did something similar to you and replace your ordering reading with :

int iLine;
for(iLine = 0; iLine < ncol; iLine++)
perm[iLine] = iLine+1;

This worked for me (on the developement branch but which is not far from yours).
Can you confirm me that it works also for you ?
Is petsc ordering 1 based ?

Regards,

XL.

PS: when basing my permtab to 0 I get this assertion fail error :

Assertion failed: (perm[i] >= 0), function find_supernodes, file kass/src/find_supernodes.c, line 79.

So it may not be your problem...

RE: Solver error [ Reply ]
By: Serban Georgescu on 2014-08-21 09:08
[forum:148824]
Hi XL,

Made a quick test for the distributed version with 1 MPI process.
In this case, we don't need to worry about local and grobal perms.
I added the code below inside simple_dist.

/*******************************************/
/* Read permutation */
/*******************************************/

perm = malloc(ncol*sizeof(pastix_int_t));
char lineBuffer[255];

if (!mpid)
{
FILE *f = fopen("ordering.dat", "r");
int iLine;
if (f != NULL)
{
printf("Reading custom ordering from ordering.dat\n");

// Skip first 4 rows
for (iLine = 0; iLine < 4; iLine++)
fgets(lineBuffer, 255, f);

// Read the rest
for (iLine = 0; iLine < ncol; iLine++)
{
int ind, val;
fscanf(f, "%d %d", &ind, &val);
perm[iLine] = val;
}

fclose(f);
}
}

iparm[IPARM_ORDERING] = API_ORDER_PERSONAL;

/*******************************************/
/* Call pastix */
/*******************************************/
/* No need to allocate invp in dpastix */

PRINT_RHS("RHS", rhs, ncol, mpid, iparm[IPARM_VERBOSE]);

...

However, this crashes so I guess I am doing something wrong.

WARNING: metis or personal ordering can't be used without kass, forced use of kass.

...
Time to find the supernode (direct) 17.6 s
Number of supernode for direct factorization 10556
*** glibc detected *** /home/serban/Programs/pastix_release_93185ac/src/example/bin/simple_dist: double free or corruption (out): 0x000000000716ff10 ***



Regards,
Serban

RE: Solver error [ Reply ]
By: Xavier Lacoste on 2014-08-21 08:57
[forum:148820]
Hi,

Yes this would disable scotch and use your permutation.
In distributed mode, the permutation is also distributed and their is no need to set the invp).
If you have a global permutation you have to be sure that :
global_perm[loc2glob[iter]-1] = local_perm[iter]

In Centralized mode, the permutation is centralized and you need to give the reverse permutation via invp. (perm[invp[i]-1] = i+1)

Regards,

XL.

RE: Solver error [ Reply ]
By: Serban Georgescu on 2014-08-21 08:15
[forum:148819]
Hello guys,

Thanks for the support.

I have been running using different configuration and I think I have found a pattern. The crash seems to be related to the MPI that I am using.

The MPI that I am using by default is mpi/intel/4.0.1.007.
With this I consistenly get the following two types of errors:

(1): ERROR: hgraphOrderHf: internal error

or

- 3 : Envois 463 - Receptions 1766 -
- 0 : Envois 349 - Receptions 1649 -
Assertion failed in file ../../ch3_progress.c at line 1160: pkt->type >= 0 && pkt->type < MPIDI_NEM_PKT_END
internal ABORT - process 0

If I now switch to mpi/intel/5.0.0.028 or mpi/intel/4.1.3.045, everyting runs fine. I can reproduce the same error for another matrix as well.

I am now trying to bypass Scotch by proving a permutation matrix.
I have generated one from PETSc and saved it in a text file, which I would like to read from Pastix.
Inside simple_dist.c I see a vector "perm" that is malloc-ed here and filled later by Scotch. Is it enough to fill "perm" with the permutation that I read from file and set
"iparm[IPARM_ORDERING] = API_ORDER_PERSONAL" ?

Regards,
Serban


RE: Solver error [ Reply ]
By: Nobody on 2014-08-21 06:39
[forum:148818]
As additional information.

SCOTCH_graphCheck() is called in PaStiX every time (we may disable it for release mode in next release because it can cost much, particularly when doing schur complement computation, but right now it is alway called)

The pastix_checkMatrix() function is called in PaStiX if iparm[IPARM_MATRIX_VERIFICATION] is set to API_YES (which is default).

So either it's a misconfiguration of the graph that is not check using this two functions, or a misconfiguration created by PaStiX and not tested using SCOTCH_graphCheck(), or a bug in HF method.

You should also be able to save the graph using :
iparm[IPARM_IO_STRATEGY] = API_IO_SAVE & API_IO_SAVE_GRAPH

Regards,

XL.

RE: Solver error [ Reply ]
By: Xavier Lacoste on 2014-08-21 07:07
[forum:148817]
As additional information.

SCOTCH_dgraphCheck() is called in PaStiX every time (we may disable it for release mode in next release because it can cost much, particularly when doing schur complement computation, but right now it is alway called)

The pastix_checkMatrix() function is called in PaStiX if iparm[IPARM_MATRIX_VERIFICATION] is set to API_YES (which is default).

So either it's a misconfiguration of the graph that is not check using this two functions, or a misconfiguration created by PaStiX and not tested using SCOTCH_dgraphCheck(), or a bug in HF method.

Unfortunatly, we didn't add option to save distributed graph, so one would have to add a SCOTCH_dgraphSave() near the SCOTCH_dgraphCheck(dgraph) in sopalin/src/pastix.c to be abble to save the distributed graph.

Maybe we could test this for you if you can attach the matrix file.

Regards,

XL.

RE: Solver error [ Reply ]
By: Francois PELLEGRINI on 2014-08-20 17:20
[forum:148814]
To XL : Scotch 6.0 and 5.1.12 do not differ much regarding the sparse matrix ordering routines. Most of the differences concern the static mapping routines. Anyway, it has to be tried. Because of a change in the pseudo-random number generator between the two versions, the configuration that triggers the bug in the HF routine of Scotch 6.0.0 may not show out with Scotch 5.1.12.

However, I would be very interested in discorvering a case that wrecks Scotch and/or the HF routine, and Patrick Amestoy as well, I guess. Hence, Serban, please keep track of how to reproduce this bug. :-)

f.p.

RE: Solver error [ Reply ]
By: Francois PELLEGRINI on 2014-08-20 17:08
[forum:148812]
Hello Serban,

You seem to have triggered an "internal error" bug within Scotch, more specifically in the routine that computes the ordering of the leaves of the incomplete nested dissection using the "approximate halo fill" method (hence the name of the routine: "hgraphOrderHf").
The problem that I have is that this code is a f2c translation of a code written by Patrick Amestoy, Tim Davis and Iain Duff. I am therefore incompetent to debug it.

First, one has to see if there is not a problem in your matrix pattern, as provided to PaStiX and then Scotch. Then, we have to check if there is not a problem when passing the matrix from PaStiX to Scotch. Finally, we will try to capture the graph passed to the HF method and send it to P. Amestoy to see what he can tell about his code.

Alternately, you may try to use MeTiS to get an ordering meanwhile, to see if it crashes, too.

Sorry for the inconvenience,

f.p.

RE: Solver error [ Reply ]
By: Xavier Lacoste on 2014-08-20 17:07
[forum:148811]
Hello Serban,

I don't know exactily what is the error, but I think hgraphOrderHf is a Scotch function.
I would advice you to try Scotch 5.1.12 which is more tested.

Regards,

XL.

Solver error [ Reply ]
By: Serban Georgescu on 2014-08-20 16:02
[forum:148809]
Hi XL,

I am trying to solve some matrices (that is actually a scalability study, flat MPI vs MPI + threads) and I am getting errors in some of the runs.
I am running the latest version of PaStiX (5.2.2.12) compiled with SCOTCH 6.0.0.

For one of my test matrices, it works in the following configuration:
mpirun -np 24 /home/serban/Programs/pastix_release_93185ac/src/example/bin/simple_dist -petsc_u mat.petscbin -t 1

However, using 48 processes returns an error:

...
- 27 : Envois 2311 - Receptions 2163 -
- 24 : Envois 2405 - Receptions 1934 -
- 25 : Envois 1421 - Receptions 1848 -
- 33 : Envois 2020 - Receptions 2241 -
- 31 : Envois 2534 - Receptions 2726 -
- 32 : Envois 2900 - Receptions 2677 -
- 35 : Envois 1384 - Receptions 1419 -
[23] Abort: Unknown packet type -1711650587 in MPID_OFA_Process_send at line 1705 in file ../../ofa_send.c

If I run hybrid mode, I get a different kind of error:
mpirun -ppn 1 -np 2 /home/serban/Programs/pastix_release_93185ac/src/example/bin/simple_dist -petsc_s mat.petscbin -t 12
...
+--------------------------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+--------------------------------------------------------------------+
Matrix size 894555 x 894555
Number of nonzeros in A 28282032
+--------------------------------------------------------------------+
+ Options +
+--------------------------------------------------------------------+
Before running dpastix
Version : exported
SMP_SOPALIN : Defined
VERSION MPI : Defined
PASTIX_DYNSCHED : Not defined
STATS_SOPALIN : Defined
NAPA_SOPALIN : Defined
TEST_IRECV : Not defined
TEST_ISEND : Defined
TAG : Exact Thread
FORCE_CONSO : Not defined
RECV_FANIN_OR_BLOCK : Not defined
OUT_OF_CORE : Not defined
DISTRIBUTED : Defined
METIS : Not defined
WITH_SCOTCH : Defined
INTEGER TYPE : int
PASTIX_FLOAT TYPE : double
+--------------------------------------------------------------------+
(0): ERROR: hgraphOrderHf: internal error
rank 0 in job 1 bx920-01-10_55964 caused collective abort of all ranks
exit status of rank 0: killed by signal 9

Any indea what could be causing this?

Thanks,
Serban