Home My Page Projects PM2: Parallel Multithreaded Machine
Summary Activity Tracker Lists Tasks Docs News SCM Files

[#11793] Segfault on lt_dlclose()

Date:
2011-01-18 17:13
Priority:
3
State:
Open
Submitted by:
Samuel Thibault (thibault)
Assigned to:
Nobody (None)
Category:
none
Group:
none
Resolution:
none
Summary:
Segfault on lt_dlclose()

Detailed description
When openmpi calls lt_dlclose() to shutdown one of its components, the process crashes:

(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7df2245 in _dl_close_worker (map=<value optimized out>) at dl-close.c:497
#2 0x00007ffff7df230e in _dl_close (_map=<value optimized out>) at dl-close.c:742
#3 0x00007ffff7dec546 in _dl_catch_error (objname=<value optimized out>, errstring=<value optimized out>, mallocedp=<value optimized out>, operate=<value optimized out>, args=<value optimized out>) at dl-error.c:178
#4 0x00007ffff72c92ec in _dlerror_run (operate=0x7ffff72c8fe0 <dlclose_doit>, args=0x7ffff03a5bd0) at dlerror.c:164
#5 0x00007ffff72c900f in __dlclose (handle=<value optimized out>) at dlclose.c:48
#6 0x00007ffff766735c in vm_close (loader_data=0x0, module=0x7ffff03a5bd0) at loaders/dlopen.c:212
#7 0x00007ffff764be7c in lt_dlclose (handle=0x7ffff03bb7a0) at ltdl.c:1939
#8 0x00007ffff7657882 in ri_destructor (obj=0x7ffff03a63f0) at mca_base_component_repository.c:384
#9 0x00007ffff7656b1f in opal_obj_run_destructors (object=0x7ffff03a63f0) at ../../../opal/class/opal_object.h:448
#10 0x00007ffff7657411 in mca_base_component_repository_release (component=0x7ffff5016100) at mca_base_component_repository.c:244
#11 0x00007ffff7623170 in mca_base_components_close (output_id=-1, components_available=0x7ffff792e0a0, skip=0x7ffff5218040) at mca_base_components_close.c:64
#12 0x00007ffff76283ba in mca_base_select (type_name=0x7ffff76c3e48 "carto", output_id=-1, components_available=0x7ffff792e0a0, best_module=0x7fffffffdaa8, best_component=0x7fffffffdab0) at mca_base_components_select.c:128
#13 0x00007ffff763ac1f in opal_carto_base_select () at base/carto_base_select.c:47
#14 0x00007ffff767916b in opal_init (pargc=0x0, pargv=0x0) at runtime/opal_init.c:368
#15 0x00007ffff75ad96e in orte_init (pargc=0x0, pargv=0x0, flags=32) at runtime/orte_init.c:78
#16 0x00007ffff752d11a in ompi_mpi_init (argc=0, argv=0x0, requested=3, provided=0x7fffffffdd6c) at runtime/ompi_mpi_init.c:350
#17 0x00007ffff7557100 in PMPI_Init_thread (argc=0x0, argv=0x0, required=3, provided=0x7fffffffdd6c) at pinit_thread.c:87
#18 0x0000000000400871 in main () at parallel.c:11


This is (debian's libc6 2.11.2-7 package):

if (!RTLD_SINGLE_THREAD_P
&& (unload_global
|| scope_mem_left
|| (GL(dl_scope_free_list) != NULL
&& GL(dl_scope_free_list)->count)))
{
THREAD_GSCOPE_WAIT ();



Where THREAD_GSCOPE_WAIT() is

#define THREAD_GSCOPE_WAIT() \
GL(dl_wait_lookup_done) ()

which gets expanded to _rtld_global._dl_wait_lookup_done (actually local, but never mind, it's the same). This member of the _rtld_global is usually filled by nptl/nptl-init.c's __pthread_initialize_minimal_internal():

GL(dl_wait_lookup_done) = &__wait_lookup_done;

but Marcel does not do that. Actually, it just can't know for sure where to put the address of a marcel equivalent of the __wait_lookup_done function, since the offset inside the struct rtld_global structure (sysdeps/generic/ldsodefs.h) depends on the compilation options, e.g. see

#if HP_TIMING_AVAIL || HP_SMALL_TIMING_AVAIL || HP_TIMING_PAD
/* Start time on CPU clock. */
EXTERN hp_timing_t _dl_cpuclock_offset;
#endif

...

For the time being, we've just commented lt_dlclose() from openmpi...

No Comments Have Been Posted

No Changes Have Been Made to This Item