Home My Page Projects PaStiX
Summary Activity Forums Lists Docs News Files

Forum: open-discussion

Monitor Forum | Start New Thread Start New Thread
RE: MPI + threads [ Reply ]
By: Serban Georgescu on 2014-08-27 15:03
[forum:148844]
Hello XL,

Understood.
I have compiled a new version of pastix with single threaded Scotch 5.1.12b.
I will use this for further testing.
The results are similar with what I've got with 6.0 though.

Regarding the matrix data, there are still some formalities but they should be done in the next few days.

Best regards,
Serban

RE: MPI + threads [ Reply ]
By: Xavier Lacoste on 2014-08-25 16:50
[forum:148841]
Dear Serban,

Indeed you are using distributed version of Scotch.
If you are using Scotch 5.1.12b we would advice you to deactivate thread support.
And we would advice to wait next release before trying 6.X version of Scotch.

Regards,

XL.

RE: MPI + threads [ Reply ]
By: Serban Georgescu on 2014-08-25 16:31
[forum:148840]
Dear Pierre,

I ran the scotch_dist example so I guess I am running the distributed mode.

I compilde scotch with threading support. In the scotch documentation it is written that one needs to specify the number of threads scotch should use at compile time (it seems you can't set this at run time). However, if I compile with "-DSCOTCH_PTHREAD_NUMBER=12" the performance is VERY bad, so I left the default options.

Regarding the matrices, I think I can send them but I need to get formal permission first from the owners. This usually takes a few days. I'll send an email now.

Regards,
Serban

RE: MPI + threads [ Reply ]
By: Pierre Ramet on 2014-08-25 09:38
[forum:148835]
Dear Serban,
thank you for these results. I suppose you used the sequential version of Scotch (centralized mode in PaStiX).
In terms of performances, we have also the same conclusion that using 1 thread per core inside one socket is the best configuration. In terms of memory, you should always used one thread per core available inside one node.
But we would expect a better scalability when you increase the number of ressources, both for threads or for MPI processes.
Is it possible to have access to your matrices (matrix market format for instance), so we can run benchmarks on our machines ?
Regards,
Pierre.

RE: MPI + threads [ Reply ]
By: Serban Georgescu on 2014-08-22 16:21
[forum:148834]

pcb_benchmark.xlsx (20) downloads
Hi XL,

I've run benchmarks for 3 matrices of increasing sizes.
They come from FEM - the thermal stress simulation of a printed circuit board (so they have a shell-like structure, which is challenging for solvers).

For each problem I am using increasing (total) numbers of cores and for each number of cores I try various combinations of MPI process and threads.

You can see the results in the attached XLS sheet.
Basically, if I remove the cost of SCOTCH from the total time, I can at best double the performance by increasing the number of cores that I use. It seems the best combination is 6 threads / MPI process, which is consistent with my 2 socket x 6 cores Westmere configuration.

Do you find the results expected?
Any extra knobs that can be turned for extra performance?

Cheers,
Serban

RE: MPI + threads [ Reply ]
By: Serban Georgescu on 2014-08-22 10:02
[forum:148829]
Thank you,

I am following these guidlines already.
I am now running benchmark and will post some results soon.

Regards,
Serban

MPI + threads [ Reply ]
By: Nobody on 2014-08-21 15:22
[forum:148825]
On 2014-08-21 17:09, Xavier Lacoste wrote:
> Dear Ricardo,
>
> We have several remarks to make after reading your mail.
>
> - To perform a scaling study you shouldn't use -lap because it
> generate a matrix badly suited for parallel implementation of a direct
> solver.
> You could use a larger matrix that you can find on the Tim Davis
> collection
> (http://www.cise.ufl.edu/research/sparse/matrices/list_by_id.html
> [10]).
> (for example :
> http://www.cise.ufl.edu/research/sparse/matrices/GHS_psdef/audikw_1.html
> [11])
>
> - To use threads inside PaStiX you cannot use OpenMP, you have to tell
> it explicitly to the library : iparm[IPARM_THREAD_NBR] = X, or via
> ./simple -t X
>
> - You need to be sure to use sequential BLAS implementation to avoid
> conflict with PaStiX threads.
>
> - We would advice you to enable HwLOC in your config.in, I don't know
> how it perform in cray but on other platform it is aware of the
> scheduler bindings and we obtain a correct binding.
>
> Regards,
>
> Pierre and Xavier.
>
> PS: If you want you can continue the discussion on the forum.
> https://gforge.inria.fr/forum/forum.php?forum_id=598&group_id=186 [12]
>
> Le 21 août 2014 à 16:41, Riccardo Rossi <rrossi@cimne.upc.edu> a
> écrit :
>
>> Dear Pierre, dear Xavier,
>> we are trying to use the pastix solver within a preconditioner we
>> are writing, unfortunately we
>> are facing a performance problem when mixing MPI and threads on a
>> cray XC30.
>>
>> thing is that performance is around a factor 6 slower when i try to
>> combine openmp and mpi on the test system.
>>
>> for example running the test "cppsimple" of pastix with -lap=262144
>>
>> gives the following output (the system has 2 IvyBridge processors
>> with 12 cores each)
>>
>> *********12 MPI - OMP_NUM_THREADS=1
>>
>>
> hlogin2:/gfs1/work/nimssk01/amgcl_pastix/pastix/pastix_release_93185ac/src/example/bin
>> $ more output_mpi12_omp1.log
>> MPI_Init_thread level = MPI_THREAD_SERIALIZED
>> driver Laplacian
>> Check : Numbering OK
>> Check : Sort CSC OK
>> Check : Duplicates OK
>>
> +--------------------------------------------------------------------+
>> + PaStiX : Parallel Sparse matriX package +
>>
> +--------------------------------------------------------------------+
>> Matrix size 262144 x 262144
>> Number of nonzeros in A 524287
>>
> +--------------------------------------------------------------------+
>> + Options +
>>
> +--------------------------------------------------------------------+
>> Version : exported
>> SMP_SOPALIN : Defined
>> VERSION MPI : Defined
>> PASTIX_DYNSCHED : Not defined
>> STATS_SOPALIN : Not defined
>> NAPA_SOPALIN : Defined
>> TEST_IRECV : Not defined
>> TEST_ISEND : Defined
>> TAG : Exact Thread
>> FORCE_CONSO : Not defined
>> RECV_FANIN_OR_BLOCK : Not defined
>> OUT_OF_CORE : Not defined
>> DISTRIBUTED : Defined
>> METIS : Not defined
>> WITH_SCOTCH : Defined
>> INTEGER TYPE : int32_t
>> PASTIX_FLOAT TYPE : double
>>
> +--------------------------------------------------------------------+
>> Time to compute ordering 0.942 s
>> Time to analyze 0.0599 s
>> Number of nonzeros in factorized matrix 2261087
>> Fill-in 4.31269
>> Number of operations (LLt) 2.99509e+07
>> Prediction Time to factorize (AMD 6180 MKL) 0.0166 s
>> --- Sopalin : Threads are binded ---
>> - iteration 1 :
>> time to solve 0 s
>> total iteration time 0.0123 s
>> error 4.7574e-14
>> Static pivoting 0
>> Inertia 262144
>> Time to factorize 0.0187 S
>> FLOPS during factorization 1.4931 GFLOPS
>> Time to solve 0.00457 S
>> Refinement 1 iterations, norm=4.76e-14
>> Time for refinement 0.0265 s
>> Precision : ||ax -b||/||b|| = 5.51436e-14
>> Application 657399 resources: utime ~13s, stime ~2s, Rss ~41624,
>> inblocks ~1331, outblocks ~49
>>
>> *********12 MPI - OMP_NUM_THREADS=2
>>
>>
> hlogin2:/gfs1/work/nimssk01/amgcl_pastix/pastix/pastix_release_93185ac/src/example/bin
>> $ more output_mpi12_omp2.log
>> MPI_Init_thread level = MPI_THREAD_SERIALIZED
>> driver Laplacian
>> Check : Numbering OK
>> Check : Sort CSC OK
>> Check : Duplicates OK
>>
> +--------------------------------------------------------------------+
>> + PaStiX : Parallel Sparse matriX package +
>>
> +--------------------------------------------------------------------+
>> Matrix size 262144 x 262144
>> Number of nonzeros in A 524287
>>
> +--------------------------------------------------------------------+
>> + Options +
>>
> +--------------------------------------------------------------------+
>> Version : exported
>> SMP_SOPALIN : Defined
>> VERSION MPI : Defined
>> PASTIX_DYNSCHED : Not defined
>> STATS_SOPALIN : Not defined
>> NAPA_SOPALIN : Defined
>> TEST_IRECV : Not defined
>> TEST_ISEND : Defined
>> TAG : Exact Thread
>> FORCE_CONSO : Not defined
>> RECV_FANIN_OR_BLOCK : Not defined
>> OUT_OF_CORE : Not defined
>> DISTRIBUTED : Defined
>> METIS : Not defined
>> WITH_SCOTCH : Defined
>> INTEGER TYPE : int32_t
>> PASTIX_FLOAT TYPE : double
>>
> +--------------------------------------------------------------------+
>> Time to compute ordering 0.928 s
>> Time to analyze 0.0484 s
>> Number of nonzeros in factorized matrix 2261087
>> Fill-in 4.31269
>> Number of operations (LLt) 2.99509e+07
>> Prediction Time to factorize (AMD 6180 MKL) 0.0166 s
>> --- Sopalin : Threads are binded ---
>> - iteration 1 :
>> time to solve 0 s
>> total iteration time 0.112 s
>> error 4.7659e-14
>> Static pivoting 0
>> Inertia 262144
>> Time to factorize 0.839 S
>> FLOPS during factorization 34.056 MFLOPS
>> Time to solve 0.105 S
>> Refinement 1 iterations, norm=4.77e-14
>> Time for refinement 0.127 s
>> Precision : ||ax -b||/||b|| = 5.52255e-14
>> Application 657401 resources: utime ~20s, stime ~6s, Rss ~41752,
>> inblocks ~1331, outblocks ~49
>>
>> the config.in [1] used in the compilation follows. As you will see
>> hwloc is deactivated as activating it lead to a large performance
>> drop...
>> my guess is that the queuing system on the cray already takes care
>> of affinity, and probably conflicts with the hwloc, nevertheless i
>> don't havea clue on how to improve the situation.
>>
>> libsci (which contains the cray vendor blas is used)
>>
>>
> hlogin2:/gfs1/work/nimssk01/amgcl_pastix/pastix/pastix_release_93185ac/src
>> $ more config.in [1]
>> HOSTARCH = i686_pc_linux
>> VERSIONBIT = _64bit
>> EXEEXT =
>> OBJEXT = .o
>> LIBEXT = .a
>> CCPROG = cc -Wall
>> CFPROG = ftn
>> CF90PROG = ftn -ffree-form
>> MCFPROG = ftn
>> CF90CCPOPT = -ffree-form -x f95-cpp-input
>> # Compilation options for optimization (make expor)
>> CCFOPT = -O3
>> # Compilation options for debug (make | make debug)
>> CCFDEB = -g3
>> CXXOPT = -O3
>> NVCCOPT = -O3
>>
>> LKFOPT =
>> MKPROG = make
>> MPCCPROG = cc -Wall
>> MPCXXPROG = CC -Wall
>> CPP = cpp
>> ARFLAGS = ruv
>> ARPROG = ar
>> EXTRALIB = -lgfortran -lm -lrt
>> CTAGSPROG = ctags
>>
>> VERSIONMPI = _mpi
>> VERSIONSMP = _smp
>> VERSIONSCH = _static
>> VERSIONINT = _int
>> VERSIONPRC = _simple
>> VERSIONFLT = _real
>> VERSIONORD = _scotch
>>
>> ###################################################################
>> # SETTING INSTALL DIRECTORIES #
>> ###################################################################
>> # ROOT = /path/to/install/directory
>> # INCLUDEDIR = ${ROOT}/include
>> # LIBDIR = ${ROOT}/lib
>> # BINDIR = ${ROOT}/bin
>> # PYTHON_PREFIX = ${ROOT}
>>
>> ###################################################################
>> # SHARED LIBRARY GENERATION #
>> ###################################################################
>> SHARED=1
>> SOEXT=.so
>> SHARED_FLAGS = -shared -Wl,-soname,__SO_NAME__
>> CCFDEB := ${CCFDEB} -fPIC
>> CCFOPT := ${CCFOPT} -fPIC
>> CFPROG := ${CFPROG} -fPIC
>>
>> ###################################################################
>> # INTEGER TYPE #
>> ###################################################################
>> # Uncomment the following lines for integer type support (Only 1)
>>
>> #VERSIONINT = _long
>> #CCTYPES = -DFORCE_LONG -DINTSIZELONG
>> #---------------------------
>> VERSIONINT = _int32
>> CCTYPES = -DINTSIZE32
>> #---------------------------
>> ##VERSIONINT = _int64
>> ##CCTYPES = -DINTSSIZE64
>>
>> ###################################################################
>> # FLOAT TYPE #
>> ###################################################################
>> CCTYPESFLT =
>> # Uncomment the following lines for double precision support
>> VERSIONPRC = _double
>> CCTYPESFLT := $(CCTYPESFLT) -DPREC_DOUBLE
>>
>> # Uncomment the following lines for float=complex support
>> #VERSIONFLT = _complex
>> #CCTYPESFLT := $(CCTYPESFLT) -DTYPE_COMPLEX
>>
>> ###################################################################
>> # MPI/THREADS #
>> ###################################################################
>>
>> # Uncomment the following lines for sequential (NOMPI) version
>> #VERSIONMPI = _nompi
>> #CCTYPES := $(CCTYPES) -DFORCE_NOMPI
>> #MPCCPROG = $(CCPROG)
>> #MCFPROG = $(CFPROG)
>>
>> # Uncomment the following lines for non-threaded (NOSMP) version
>> #VERSIONSMP = _nosmp
>> #CCTYPES := $(CCTYPES) -DFORCE_NOSMP
>>
>> # Uncomment the following line to enable a progression thread,
>> # then use IPARM_THREAD_COMM_MODE
>> #CCPASTIX := $(CCPASTIX) -DPASTIX_THREAD_COMM
>>
>> # Uncomment the following line if your MPI doesn't support
>> MPI_THREAD_MULTIPLE level,
>> # then use IPARM_THREAD_COMM_MODE
>> ##CCPASTIX := $(CCPASTIX) -DPASTIX_FUNNELED
>>
>> # Uncomment the following line if your MPI doesn't support
>> MPI_Datatype correctly
>> #CCPASTIX := $(CCPASTIX) -DNO_MPI_TYPE
>>
>> # Uncomment the following line if you want to use semaphore barrier
>> # instead of MPI barrier (with IPARM_AUTOSPLIT_COMM)
>> #CCPASTIX := $(CCPASTIX) -DWITH_SEM_BARRIER
>>
>> # Uncomment the following lines to enable StarPU.
>> #CCPASTIX := $(CCPASTIX) `pkg-config libstarpu --cflags`
>> -DWITH_STARPU
>> #EXTRALIB := $(EXTRALIB) `pkg-config libstarpu --libs`
>> # Uncomment the correct 2 lines
>> #CCPASTIX := $(CCPASTIX) -DCUDA_SM_VERSION=11
>> #NVCCOPT := $(NVCCOPT) -maxrregcount 32 -arch sm_11
>> #CCPASTIX := $(CCPASTIX) -DCUDA_SM_VERSION=13
>> #NVCCOPT := $(NVCCOPT) -maxrregcount 32 -arch sm_13
>> CCPASTIX := $(CCPASTIX) -DCUDA_SM_VERSION=20
>> NVCCOPT := $(NVCCOPT) -arch sm_20
>>
>> # Uncomment the following line to enable StarPU profiling
>> # ( IPARM_VERBOSE > API_VERBOSE_NO ).
>> #CCPASTIX := $(CCPASTIX) -DSTARPU_PROFILING
>>
>> # Uncomment the following line to disable CUDA (StarPU)
>> CCPASTIX := $(CCPASTIX) -DFORCE_NO_CUDA
>>
>> ###################################################################
>> # Options #
>> ###################################################################
>>
>> # Show memory usage statistics
>> #CCPASTIX := $(CCPASTIX) -DMEMORY_USAGE
>>
>> # Show memory usage statistics in solver
>> #CCPASTIX := $(CCPASTIX) -DSTATS_SOPALIN
>>
>> # Uncomment following line for dynamic thread scheduling support
>> #CCPASTIX := $(CCPASTIX) -DPASTIX_DYNSCHED
>>
>> # Uncomment the following lines for Out-of-core
>> #CCPASTIX := $(CCPASTIX) -DOOC -DOOC_NOCOEFINIT
>> -DOOC_DETECT_DEADLOCKS
>>
>> ###################################################################
>> # GRAPH PARTITIONING #
>> ###################################################################
>>
>> # Uncomment the following lines for using metis ordering
>> #VERSIONORD = _metis
>> #METIS_HOME = ${HOME}/metis-4.0
>> #CCPASTIX := $(CCPASTIX) -DMETIS -I$(METIS_HOME)/Lib
>> #EXTRALIB := $(EXTRALIB) -L$(METIS_HOME) -lmetis
>>
>> # Scotch always needed to compile
>> SCOTCH_HOME ?= /opt/cray/tpsl/default/GNU/48/ivybridge/
>> SCOTCH_INC ?= $(SCOTCH_HOME)/include
>> SCOTCH_LIB ?= $(SCOTCH_HOME)/lib
>> # Uncomment on of this blocks
>> #scotch
>> #CCPASTIX := $(CCPASTIX) -I$(SCOTCH_INC) -DWITH_SCOTCH
>> #EXTRALIB := $(EXTRALIB) -L$(SCOTCH_LIB) -lscotch -lscotcherrexit
>> #ptscotch
>> CCPASTIX := $(CCPASTIX) -I$(SCOTCH_INC) -DDISTRIBUTED -DWITH_SCOTCH
>> #if scotch >= 6.0
>> EXTRALIB := $(EXTRALIB) -L$(SCOTCH_LIB) -lptscotch -lscotch
>> -lptscotcherrexit
>> #else
>> #EXTRALIB := $(EXTRALIB) -L$(SCOTCH_LIB) -lptscotch
>> -lptscotcherrexit
>>
>> ###################################################################
>> # Portable Hardware Locality #
>> ###################################################################
>> # By default PaStiX uses hwloc to bind threads,
>> # comment this lines if you don't want it (not recommended)
>> #HWLOC_HOME ?= /sw/tools/hwloc/1.8.0/generic/
>> #HWLOC_INC ?= $(HWLOC_HOME)/include
>> #HWLOC_LIB ?= $(HWLOC_HOME)/lib
>> #CCPASTIX := $(CCPASTIX) -I$(HWLOC_INC) -DWITH_HWLOC
>> #EXTRALIB := $(EXTRALIB) -L$(HWLOC_LIB) -lhwloc
>>
>> ###################################################################
>> # MARCEL #
>> ###################################################################
>>
>> # Uncomment following lines for marcel thread support
>> #VERSIONSMP := $(VERSIONSMP)_marcel
>> #CCPASTIX := $(CCPASTIX) `pm2-config --cflags`
>> -I${PM2_ROOT}/marcel/include/pthread
>> #EXTRALIB := $(EXTRALIB) `pm2-config --libs`
>> # ---- Thread Posix ------
>> EXTRALIB := $(EXTRALIB) -lpthread
>>
>> # Uncomment following line for bubblesched framework support (need
>> marcel support)
>> #VERSIONSCH = _dyn
>> #CCPASTIX := $(CCPASTIX) -DPASTIX_BUBBLESCHED
>>
>> ###################################################################
>> # BLAS #
>> ###################################################################
>>
>> # Choose Blas library (Only 1)
>> # Do not forget to set BLAS_HOME if it is not in your environnement
>> BLAS_HOME=/opt/cray/libsci/default/GNU/48/ivybridge/lib
>> #---- Blas ----
>> BLASLIB = -L${BLAS_HOME} -lsci_gnu
>> #---- Gotoblas ----
>> #BLASLIB = -L${BLAS_HOME} -lgoto
>> #---- MKL ----
>> #Uncomment the correct line
>> #BLASLIB = -L$(BLAS_HOME) -lmkl_intel_lp64 -lmkl_sequential
>> -lmkl_core
>> #BLASLIB = -L$(BLAS_HOME) -lmkl_intel -lmkl_sequential -lmkl_core
>> #---- Acml ----
>> #BLASLIB = -L$(BLAS_HOME) -lacml
>>
>> ###################################################################
>> # MURGE #
>> ###################################################################
>> # Uncomment if you need MURGE interface to be thread safe
>> # CCPASTIX := $(CCPASTIX) -DMURGE_THREADSAFE
>> # Uncomment this to have more timings inside MURGE
>> # CCPASTIX := $(CCPASTIX) -DMURGE_TIME
>>
>> ###################################################################
>> # DO NOT TOUCH #
>> ###################################################################
>>
>> FOPT := $(CCFOPT)
>> FDEB := $(CCFDEB)
>> CCHEAD := $(CCPROG) $(CCTYPES) $(CCFOPT)
>> CCFOPT := $(CCFOPT) $(CCTYPES) $(CCPASTIX)
>> CCFDEB := $(CCFDEB) $(CCTYPES) $(CCPASTIX)
>> NVCCOPT := $(NVCCOPT) $(CCTYPES) $(CCPASTIX)
>>
>> ###################################################################
>> # MURGE COMPATIBILITY #
>> ###################################################################
>>
>> MAKE = $(MKPROG)
>> CC = $(MPCCPROG)
>> CFLAGS = $(CCFOPT) $(CCTYPESFLT)
>> FC = $(MCFPROG)
>> FFLAGS = $(CCFOPT)
>> LDFLAGS = $(EXTRALIB) $(BLASLIB)
>> CTAGS = $(CTAGSPROG)
>>
>> for now i will deactivate SMP, nevertheless if you can hint any
>> solution i wouldbe very grateful
>>
>> thanks in advance
>> Riccardo
>>
>> P.S: should i post this message somewhere else?
>>
>> --
>>
>> Riccardo Rossi
>>
>> PhD, Civil Engineer
>>
>> member of the Kratos Team: www.cimne.com/kratos [2]
>>
>> lecturer at Universitat Politècnica de Catalunya, BarcelonaTech
>> (UPC)
>>
>> Research fellow at International Center for Numerical Methods in
>> Engineering (CIMNE)
>>
>> C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
>> 08034 – Barcelona – Spain – www.cimne.com [3] -
>> T.(+34) 93 401 56 96 skype: ROUGERED4
>>
>> [3]
>>
>> [4] [5] [6] [7] [8] [9]
>> Les dades personals contingudes en aquest missatge són tractades
>> amb la finalitat de mantenir el contacte professional entre CIMNE i
>> voste. Podra exercir els drets d'accés, rectificació,
>> cancel·lació i oposició, dirigint-se a cimne@cimne.upc.edu. La
>> utilització de la seva adreça de correu electronic per part de
>> CIMNE queda subjecte a les disposicions de la Llei 34/2002, de
>> Serveis de la Societat de la Informació i el Comerç Electronic.
>> Imprimiu aquest missatge, només si és estrictament necessari. [3]
>
>
>
> Links:
> ------
> [1] http://config.in/
> [2] http://www.cimne.com/kratos
> [3] http://www.cimne.com/
> [4] https://www.facebook.com/cimne
> [5] http://blog.cimne.com/
> [6] http://vimeo.com/cimne
> [7] http://www.youtube.com/user/CIMNEvideos
> [8] http://www.linkedin.com/company/cimne
> [9] https://twitter.com/cimne
> [10] http://www.cise.ufl.edu/research/sparse/matrices/list_by_id.html
> [11] http://www.cise.ufl.edu/research/sparse/matrices/GHS_psdef/audikw_1.html
> [12] https://gforge.inria.fr/forum/forum.php?forum_id=598&amp;group_id=186