Home My Page Projects cado-nfs
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files

[#14987] [las] segmentation fault (overflow in maxfull)

Date:
2012-10-19 12:16
Priority:
3
State:
Open
Submitted by:
Shi Bai (bai)
Assigned to:
Emmanuel Thomé (thome)
Hardware:
All
Product:
none
Operating System:
All
Component:
none
Version:
v1.1
Severity:
minor
Resolution:
Won't Fix
URL:
Summary:
[las] segmentation fault (overflow in maxfull)

Detailed description
Hi,

I'm trying to run the 90-digit example integer in README -- with the
following found poly::

n: 353493749731236273014678071260920590602836471854359705356610427214806564110716801866803409
Y1: 1451236175017
Y0: -4529034879773559342494
c4: 840
c3: 481649557
c2: 101632255154039
c1: -125965386310441153947
c0: 57412174484438522480795031
m: 246222470534888404838015162626269270684517633463418924882134657504372162865457220375461458
skew: 511360.000
# lognorm: 32.62, alpha: -4.83 (proj: -1.49), E: 27.79, nr: 0
# MurphyE(Bf=10000000,Bg=5000000,area=1.00e+16)=3.93e-08

Then las with the following parameters has a segfault as,

las -I 11 -poly c90.poly -fb c90.roots -q0 440000 -q1 450000 -mt 4
-out c90.rels.440000-450000.gz

maxfull=1.002102
intend to free [0] max_full=0.968760 0.980730
intend to free [1] max_full=0.976171 0.973267
intend to free [2] max_full=0.970510 1.002102
intend to free [3] max_full=0.976455 0.980230
Segmentation fault (core dumped)

I remember I've seen something similar previously, perhaps a
double-free thing. Is it a bug?

Best regards,
Shi
Message  ↓
Date: 2017-09-02 07:33
Sender: Emmanuel Thomé

The best way is most probably to over-allocate only mildly, and allocate a full extra bucket after the last one (256-th or so).

After fill-in buckets we can see which bucket pointers walked over the next one. We'll just discard the hopefully few bucket updates that have been walked over.
Prior to applying, we can do two things:
- for all i: n = bucket_write[i-1] - bucket_start[i]; if (n>0) walked_over+=n;
- fprintf(stderr, "I'm terribly sorry, the jabberwookee ate %d of your bucket updates\n", walked_over);
- for all i: apply_buckets from MAX(bucket_start[i], bucket_write[i-1]) to bucket_write[i]

We can adjust the reallocation ratio so that we do not lose more than 1/1000-th of the bucket updates, for instance.

Date: 2017-06-29 14:40
Sender: Emmanuel Thomé

I've implemented what I had in mind with the part I call "operating-system intimate".

It works, but it's not completely satisfactory.

In short, I'm replacing *(p[i])++ = a by an assembler intrinsic, which has stricter constraints. And then, while no extra code is inserted, gcc can't deal with it in the exact same way, and the code is not as good.

Date: 2017-06-20 09:19
Sender: Emmanuel Thomé

Now that c++ percolated to most of las, my third option is almost realistic:

> - use ugly means so as to make it possible to recover. Add an mmaped nowrite red-zone after the bucket area (à la electric fence), and trap the segv occuring in that case. Change bucket_limit_multiplier or something, and longjmp() to that special-q again. No chances to be portable beyond unix world.

The attached patch does about 50% of the work. The other 50% is the part which will be operating system-intimate: we need to deal with the SEGV we get when writing past the end of the reserved bucket array.

Opinions ? For the moment the change_bucket_multiplier() has permanent effect, but it would perhaps be worthwhile to make that slightly more subtle.

E.g. with 1 minute per special-q, if we're off with bkmult too low, we're bound to start over. That's a 1 minute cpu time lost. OTOH, having bkmult some % above to accomodate for the extreme cases could also mean more time (and also more memory) permanently. If it's only to accomodate one every 1000 special-q, it's not a good idea.

Date: 2013-02-08 10:16
Sender: Emmanuel Thomé

I'm inclined to close the bug now, as I see no much better way to proceed.

Date: 2013-02-08 10:14
Sender: Emmanuel Thomé

Commit f0f17d604576e1fd101429e9986a8e5f6db50154 now aborts if mafull exceeds 1 (lacking the energy to do something else).

Date: 2013-01-30 10:32
Sender: Emmanuel Thomé

I wonder what is best doing.

I have lost any faith in the possibility to gracefully recover from this error situation using ``clean'' ways. We can do several things:
- arrange so that this appears less often, e.g. by overallocating yet 5% more for the buckets, which will be most of the time useless. Note that we are already 5% off in allocation compared to the expectation ; Another option could be to make the overallocation not linear, but proportional to the sqrt(), following a probability argument.
- exit(1). Because anything is better than a segv.
- use ugly means so as to make it possible to recover. Add an mmaped nowrite red-zone after the bucket area (à la electric fence), and trap the segv occuring in that case. Change bucket_limit_multiplier or something, and longjmp() to that special-q again. No chances to be portable beyond unix world.

I think we should do one of the three. Anything is better than a segv. Note though that we can probably hardly afford printing something, since we have thrashed memory at that point...

E.

Date: 2012-10-19 12:29
Sender: Paul Zimmermann

see #12099

Paul

Date: 2012-10-19 12:22
Sender: Shi Bai

This is a further follow-up:

Does this happen more frequently for small number? For the c59 in README, a seg fault happens as well. The following data can be used to recover it:

"c59.poly"

n: 90377629292003121684002147101760858109247336549001090677693
Y1: 23838679
Y0: -69925238862761
c4: 3780
c3: 850506
c2: -106951474
c1: 66159302917579
c0: -15711067621222368
m: 42118036103109662072145841845665811462598781784120807141415
skew: 1738.500
# lognorm: 22.79, alpha: -2.36 (proj: -2.14), E: 20.43, nr: 2
# MurphyE(Bf=10000000,Bg=5000000,area=1.00e+16)=1.54e-06

Then
/home/bai/cado-nfs/build/pi.anu.edu.au/sieve/las -I 11 -poly /tmp/cado.UUbMc3Ovyw/c59.poly -fb /tmp/cado.UUbMc3Ovyw/c59.roots -q0 234000 -q1 236000 -mt 4 -out /tmp/cado.UUbMc3Ovyw/c59.rels.234000-236000.gz
maxfull=1.003791
intend to free [0] max_full=0.992660 0.965335
intend to free [1] max_full=0.972606 0.972404
intend to free [2] max_full=1.003791 0.971374
intend to free [3] max_full=0.971416 0.976617
*** glibc detected *** /home/bai/cado-nfs/build/pi.anu.edu.au/sieve/las: free(): invalid next size (normal): 0x0000000001795110 ***

Regards,
Shi

Date: 2012-10-19 12:18
Sender: Shi Bai

Forwarded Emmanuel's comments:

BUCKET_LIMIT_ADD was a quick fix which missed the main issue. See this
comment in the commit.

+ * A significant part of the inaccuracy in predicting the bucket sizes
+ * stems from the use of the Mertens estimate in lieu of proper sums. Now
+ * that we _do_ compute this sum, we gain some precision.

The win from this move was noticeable, to the point that the need for
BUCKET_LIMIT_ADD stopped being significant. At the moment this value
is indeed ignored.

I agree that there seems to be corner cases left. The problem is that
we don't know how to handle this gracefully, even if it's with
probability 10^-4. I would not mind adding back BUCKET_LIMIT_ADD
(hence in thread_buckets_alloc), but that would still not cure the
issue for real.

Attachments:
Size Name Date By Download
19 KiBdynamic_buckets_patch.txt2017-06-20 09:19thomedynamic_buckets_patch.txt
20 KiBexample.txt2017-06-20 09:19thomeexample.txt
9 KiBsegv-safe-hooks.tar.gz2017-06-29 14:40thomesegv-safe-hooks.tar.gz
Field Old Value Date By
File Added6059: segv-safe-hooks.tar.gz2017-06-29 14:40thome
status_idClosed2017-06-20 09:19thome
close_date2013-02-08 10:162017-06-20 09:19thome
File Added6054: dynamic_buckets_patch.txt2017-06-20 09:19thome
File Added6055: example.txt2017-06-20 09:19thome
status_idOpen2013-02-08 10:16thome
close_dateNone2013-02-08 10:16thome
ResolutionAccepted As Bug2013-02-08 10:16thome
summarylas segmentation fault (overflow in maxfull)2013-02-01 08:37zimmerma
HardwareNone2013-01-30 10:32thome
Operating SystemNone2013-01-30 10:32thome
VersionNone2013-01-30 10:32thome
SeverityNone2013-01-30 10:32thome
ResolutionNone2013-01-30 10:32thome
summarylas segmentation fault2013-01-18 12:31zimmerma
assigned_tonone2012-10-19 12:18zimmerma