Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Forum: help

Monitor Forum | Start New Thread Start New Thread
RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-25 11:54
[forum:113218]
The following patch should allow LDLt kernel to be runned onto GPU.

I don't think that the problem is coming from the BLAS library as it only occurs with CUDA kernels.
The only library involved in those kernels are starpu and CUDA.... But it could be a side effect from an other library, I really don't know... I hope reinstalling will be helpful...

XL.

--- a/sopalin/src/starpu_submit_tasks.c
+++ b/sopalin/src/starpu_submit_tasks.c
@@ -334,8 +334,15 @@ struct starpu_codelet hetrfsp1d_cl =

struct starpu_codelet hetrfsp1d_gemm_cl =
{
- .where = STARPU_CPU,
+ .where = STARPU_CPU
+# ifdef STARPU_USE_CUDA_GEMM_FUNC
+ | STARPU_CUDA
+# endif
+,
.cpu_funcs[0] = hetrfsp1d_gemm_starpu_cpu,
+# ifdef STARPU_USE_CUDA_GEMM_FUNC
+ .cuda_funcs[0] = hetrfsp1d_gemm_starpu_cuda,
+# endif
.model = &GEMM_model,
.nbuffers = 3,
.modes = {STARPU_R,
@@ -345,8 +352,15 @@ struct starpu_codelet hetrfsp1d_gemm_cl =

struct starpu_codelet hetrfsp1d_sparse_gemm_cl =
{
- .where = STARPU_CPU,
+ .where = STARPU_CPU
+# ifdef STARPU_USE_CUDA_GEMM_FUNC
+ | STARPU_CUDA
+# endif
+,
.cpu_funcs[0] = hetrfsp1d_gemm_starpu_cpu,
+# ifdef STARPU_USE_CUDA_GEMM_FUNC
+ .cuda_funcs[0] = hetrfsp1d_gemm_starpu_cuda,
+# endif
.model = &GEMM_model,
# if (defined STARPU_BLOCKTAB_SELFCOPY)
.nbuffers = 3,

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-15 13:46
[forum:113217]
Oh, I guess my version is quite new.
I have GCC 4.5.3 under Ubuntu 12.04.

So running that executable seems not possible.
I will try to run a few more tests maybe I can find what is wrong (different) with my system.

Cheers,
Serban

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-25 10:13
[forum:113214]
Thanks! After applying the patch, the problem is gone.

I've tried the updated version and the problem is still there.
With StarPU but no GPUs, all works correctly. But as soon as I enable the GPU I get this unpredictable behavior, sometimes I get lots of GMRES iterations, sometimes less, it depends from run to run.

Maybe there is something wrong with my current installation. Next week I will try to recompile everything from scratch, starting with the BLAS. Maybe that will solve the problem.

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-25 09:21
[forum:113210]
The following patch should solve the problem. Anyway I still have to reeanble GPU on LDLt which is currently disabled.

XL.


--- a/blend/src/solverMatrixGen.c
+++ b/blend/src/solverMatrixGen.c
@@ -1124,12 +1124,12 @@ PASTIX_INT *solverMatrixGen(const PASTIX_INT clustnum,
max_n = n;
}
}
- /* kernel_trsm require COLNBR * stride - COLNBR in LDLt */
+ /* kernel_trsm require COLNBR * (stride - COLNBR+1) in LDLt */
/* horizontal dimension */
n = solvmtx->cblktab[itercblk].lcolnum -
solvmtx->cblktab[itercblk].fcolnum + 1;
/* vertical dimension */
- m = stride - n;
+ m = stride - n + 1; /* + 1 for diagonal storage */
delta = m * n;
if(delta > solvmtx->coefmax) {
solvmtx->coefmax = delta;

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-25 06:45
[forum:113208]
Hello,

I'll look at it and try to fix it.
Only looked at -iparm IPARM_FACTORIZATION API_FACT_LLT.
Is it better with that ?

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-24 15:15
[forum:113196]

pastix_lap_1000.txt (28) downloads
Hi Xavier,

Thanks for the update.
I have compiled it and it works well without StarPU.
However, if I enable StarPU, even before I enable the GPU, I get errors like the following:

CORE_gemdm: Illegal value of LWORK
CORE_gemdm: Illegal value of LWORK
CORE_gemdm: Illegal value of LWORK
CORE_gemdm: Illegal value of LWORK
CORE_gemdm: Illegal value of LWORK

Following, I get plenty of GMRES iterations.
Is it possible that my StarPU (v.1.0.5) needs to be updated as well?

The command I am using to run PASTIX is:

./simple -lap 1000 -t 4 -iparm IPARM_STARPU API_YES

I've attached the entire output, maybe it helps.

Cheers,
Serban

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-24 12:40
[forum:113192]

pastix_release_4209.tar.bz2 (15) downloads
Hello,

Can you try this pastix tarball with a modified kernel in it ?
I already applied the patch on simple.c to disable STARPU during solve step.

XL.

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-15 11:57
[forum:112826]
I don't know what I did before, but I retried and got an other error, my GLIBC is too old...

./simple: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./simple)

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-15 10:15
[forum:112822]
Hi Xavier,

This is quite strange.
I do not have those libs either, I tried to "locate" and "find" them and got nothing. Moreover, "ldd simple" returns:

linux-vdso.so.1 => (0x00007fff4a9ff000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fe243d75000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe243a78000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe243870000)
libstarpu-1.0.so.1 => /home/serban/Programs/starpu-1.0.5/build/install/lib/libstarpu-1.0.so.1 (0x00007fe2435a4000)
libcudart.so.5.0 => /usr/local/cuda-5.0/lib64/libcudart.so.5.0 (0x00007fe243349000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe24312c000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe242d6d000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fe242b36000)
libhwloc.so.5 => /usr/local/lib/libhwloc.so.5 (0x00007fe24290c000)
libOpenCL.so.1 => /usr/lib/libOpenCL.so.1 (0x00007fe242707000)
libcublas.so.5.0 => /usr/local/cuda-5.0/lib64/libcublas.so.5.0 (0x00007fe23ed12000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe23eb0e000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe23e80e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe23e5f7000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe2440a0000)

So I cannot see it linking to those libraries.
Really strange ...

RE: PASTIX and StarPU [ Reply ]
By: Nobody on 2013-04-12 15:09
[forum:112496]
I can't run the simple executable :

libimf.so => not found
libsvml.so => not found
libintlc.so.5 => not found

and locate libmf.so return nothing :(

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-12 10:43
[forum:112493]

simple.tar.bz2 (16) downloads
Hi Xavier,

I have attached the compiled "simple".
I am curious if you can reproduce the problem.
If no, than maybe the problem is somewhere in the shared libraries, which in this case will be replaced with yours.

The new GPU kernel sounds good.
If you could let me try when it is ready, it would be great!

Thanks again for all the help!

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-12 05:14
[forum:112440]
Hello,

I can try it.

We have a master student working on a new GPU kernel without texture, I'll send you a patch when it will be ready which shouldn't be long. Maybe this kernel will not be affected by the bug you are seeing.

XL


RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-11 13:09
[forum:112304]
My GPU is a C2050, so I guess it is pretty close to yours.
I have tried ../install/test/test_sparse_gemm_d 1000 100 100 several times and all checks out OK.

Do you think that sending you the compiled "simple" executable might help?
The BLAS is statically linked, so as long as you have CUDA 5.0 and the StarPU shared library in your path it could start.

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-11 12:07
[forum:112294]
I'm also using CUDA 5.0 but with intel compilers (11.1).
Maybe it's a bug that occurs on certain GPU Card but not on the one I have access to...
If you run several time ../install/test_sparse_gemm_d 1000 100 100 do you have runs with wrong results ? (i.e. bigger norm, and wrong values indicated with ( XX != YY ))

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-11 10:27
[forum:112277]
This is what I thought as well, so I ran with v -3.
In both cases (1 GMRES iteration and 3 GMRES iterations) I could see that something was done on GPU (i.e., a non-zero exec time for CUDA0 worker)

I am using -DCUDA_SM_VERSION=20.
I am using CUDA 5.0 and I am compiling with gcc / gfortran 4.5.3.

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-11 10:09
[forum:112274]
The test_sparse_gemm output is fine (even for double complex where the was the bug I was speaking about)

Maybe the GMRES iteration depends on if StarPU decides to use a GPU or not.

I ran the code with -lap 10000 and -lap 100000 without any problem.
My GPU card is : Worker CUDA 0 (Tesla M2070 5.2 GiB 06:00.0):

Are you using -DCUDA_SM_VERSION=20 (it's the default value in config.in I think) in config.in to be sure we have the same CUDA kernel.

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-11 08:22
[forum:112264]

pastix_test_sparse_gemm.txt (27) downloads
Hi,

So, I ran exactly what you did and the behavior is kind of strange.
For the same command, sometimes I get 1 GMRES iterations, and for that case, the dumped solutions match, and sometimes I get 3 GMRES iterations, for which case the solutions do not match.

For -lap 10000 I always get may GMRES iterations for IPARM_CUDA_NBR 1 and just one for IPARM_CUDA_NBR 0.For these cases the solutions do not match.
Can you reproduce the problem for -lap 10000 case?

I am attaching the output of test_sparse_gemm.

Cheers,
Serban

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-10 15:28
[forum:112171]
I tried to run the 5.2.1 on our GPU machine and got no problem with the GMRES while running

./example/bin/simple -lap 1000 -iparm IPARM_STARPU API_YES -iparm IPARM_CUDA_NBR 1 -v 4 -iparm IPARM_FACTORIZATION API_FACT_LLT

and
./example/bin/simple -lap 1000 -iparm IPARM_STARPU API_YES -iparm IPARM_CUDA_NBR 0 -v 4 -iparm IPARM_FACTORIZATION API_FACT_LLT

the GMRES converged and the dump where equals.

When you did the comparison did you runned all with -iparm IPARM_FACTORIZATION API_FACT_LLT ?

Can you run make test_sparse_gemm and send me the output ?

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-10 12:11
[forum:112153]
Thank you!

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-10 11:12
[forum:112147]
Ok, I'll have a look at it, thanks for the report.
I'll send you a repository snapshot when I'll have found the issue.

Can you give me the output of make test_sparse_gemm.

Thanks,

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-10 10:50
[forum:112146]

sol_lap10.xls (21) downloads
Hello,

Well, I guess the fact that the CPU is not that fast (Xeon E6520 @ 2.4GHz) makes the GPU look good :)

Regarding the results, I have recompiled with "-DPASTIX_DUMP_FACTO" and compared the results. I did not use the previous test matrix because it was rather large, however the issue can be reproduce on something as simple as Laplace 10.

As you can see in the attached XLS, results for PASTIX default and PASTIX StarPU are identical, but there is a significant difference if GPU is enabled. I guess for large matrices this explodes into NANs and such.

Btw, I am using double precision, no complex involved.

Thanks,
Serban



RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-10 09:41
[forum:112138]
Just to be sure, i recently found a bug with double complex kernel, are you using double complex ?

XL.

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-10 09:33
[forum:112137]
Hello,

You achieved pretty good performance :) I didn't managed to do so with our cards, nice !

We clearly have an issue here with the numerical results....
If your matrix is not too big you can dump the matrix using -DPASTIX_DUMP_FACTO compilation flag and compare the CPU results with the CPU+GPU one to be sure the factorization was performed correctly.
Only the last solv1.0 files are useful here you can ignore the csc0.0 solv0.0 smb0.0 files which correspond to the state before factorization.
If you compare solv1.0 from CPU and CPU+GPU you should have the same results (you can use a tool such as numdiff to check it http://www.nongnu.org/numdiff/).

Anyway, I'll check if I have the same numerical behavior, and why it occurs.

thanks,

XL.

RE: PASTIX and StarPU [ Reply ]
By: Serban Georgescu on 2013-04-10 09:08
[forum:112135]
Hello Xavier,

Thanks again for the help.
I have used the iparm and code that you have provided me with and the results are now much better. I have use one of my test matrices for the test, and these are the results:

1. PASTIX default (4 cores, no GPU)
simple -mm test.mm -t 4
Time to factorize 66.3 s

2. PASTIX with StarPU (4 cores, no GPU)
simple -mm test.mm -t 4 -iparm IPARM_STARPU API_YES
Time to factorize 85.7 s

3. PASTIX with StarPU and GPU (4 cores, 1xC2050)
simple -mm test.mm -t 4 -iparm IPARM_STARPU API_YES -iparm IPARM_CUDA_NBR 1 -iparm IPARM_FACTORIZATION API_FACT_LLT
Time to factorize 46.7 s

So the GPU has a clear effect on the results. I can also see this in the verbose output:

Worker CUDA 0 (Tesla C2050 2.6 GiB 02:00.0):
Avg. delay on XXMM : 31464867.24 us, 30159 tasks
Avg. length on XXMM : 72.20 us
total time : 46385.55 ms
exec time : 2178.08 ms (4.70 %)
blocked time : 0.00 ms (0.00 %)

Moreover, the memory usage is the same as for the default scheduler. So this part seems to be working, thank you!

I have one more question. If I use the GPU, the GMRES that follows the solve starts having problems:

GMRES :
--- Sopalin : Local structure allocation ---
--- Sopalin : Threads are binded ---
--- Down Step ---
--- Up Step ---
- iteration 1 :
time to solve 0.648 s
total iteration time 0.736 s
error nan
||r|| nan
||b|| 9.2336e+12
||r||/||b|| nan
--- Sopalin : Local structure allocation ---
--- Down Step ---
--- Up Step ---
- iteration 2 :
time to solve 0.649 s
total iteration time 0.69 s
error nan
||r|| nan
||b|| 9.2336e+12
||r||/||b|| nan
--- Sopalin : Local structure allocation ---
--- Down Step ---
--- Up Step ---
- iteration 3 :
time to solve 0.648 s
total iteration time 0.691 s
error nan
||r|| nan
||b|| 9.2336e+12
||r||/||b|| nan
.....

Could it be that that some data is not copied back from the GPU?

Cheers,
Serban

RE: PASTIX and StarPU [ Reply ]
By: Xavier Lacoste on 2013-04-09 07:18
[forum:111995]
Hello,

You are right, the CUDA worker did not perform any work.
I had some problem with the LDLt kernel and disabled it, can you please try with
-iparm IPARM_FACTORIZATION API_FACT_LLT
Moreover, the solve step with StarPU is working but very slow, I have to work on it, can you edit example/src/simple.c and replace the second call to pastix :

pastix(&pastix_data, MPI_COMM_WORLD,
ncol, colptr, rows, values,
perm, invp, rhs, nbrhs, iparm, dparm);

by :

iparm[IPARM_END_TASK] = API_TASK_NUMFACT;
pastix(&pastix_data, MPI_COMM_WORLD,
ncol, colptr, rows, values,
perm, invp, rhs, nbrhs, iparm, dparm);
iparm[IPARM_STARPU] = API_NO;
iparm[IPARM_END_TASK] = API_TASK_CLEAN;
pastix(&pastix_data, MPI_COMM_WORLD,
ncol, colptr, rows, values,
perm, invp, rhs, nbrhs, iparm, dparm);

This way the solve step will use classical engine.

StarPU 1.0.5 should be recent enough to run PaStiX and I think the memory leak might come from my developments regarding StarPU, not to StarPU itself, maybe -- I hope -- from the solve step that wasn't test very much...

XL.

Older Messages