Home My Page Projects cado-nfs
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files

[#18739] database hang, http server refusing connections for several minutes

Date:
2015-03-09 15:47
Priority:
3
State:
Open
Submitted by:
Nadia Heninger (nadiah)
Assigned to:
Emmanuel Thomé (thome)
Hardware:
none
Product:
none
Operating System:
none
Component:
none
Version:
none
Severity:
none
Resolution:
none
URL:
Summary:
database hang, http server refusing connections for several minutes

Detailed description
While testing the other bug, sieving hung and all the slave client logs were filled with "urlopen error [Errno 104] Connection reset by peer". HTTP client refused manual connections. After 20 minutes, it suddenly started responding again and this exception showed up:

----------------------------------------
Exception happened during processing of request from ('[my ip address]', 55389)
Traceback (most recent call last):
File "/usr/lib/python3.2/socketserver.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.2/socketserver.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.2/socketserver.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/home/nadiah/cado-nfs/scripts/cadofactor/wuserver.py", line 256, in __init__
super().__init__(*args, **kwargs)
File "/usr/lib/python3.2/socketserver.py", line 648, in __init__
self.handle()
File "/usr/lib/python3.2/http/server.py", line 396, in handle
self.handle_one_request()
File "/usr/lib/python3.2/http/server.py", line 384, in handle_one_request
method()
File "/home/nadiah/cado-nfs/scripts/cadofactor/wuserver.py", line 343, in do_POST
self.do_upload()
File "/home/nadiah/cado-nfs/scripts/cadofactor/wuserver.py", line 379, in do_upload
inputfp=self.rfile, output=self.wfile, environ=env)
File "/home/nadiah/cado-nfs/scripts/cadofactor/upload.py", line 182, in do_upload
failedcommand)
File "/home/nadiah/cado-nfs/scripts/cadofactor/wudb.py", line 1007, in result
cursor.begin(EXCLUSIVE)
File "/home/nadiah/cado-nfs/scripts/cadofactor/wudb.py", line 230, in begin
self._exec("BEGIN EXCLUSIVE")
File "/home/nadiah/cado-nfs/scripts/cadofactor/wudb.py", line 195, in _exec
self.execute(command)
sqlite3.OperationalError: database is locked
----------------------------------------

After this sieving continued just fine and the computation finished.
Message  ↓
Date: 2015-11-10 07:52
Sender: Paul Zimmermann

update from Nadia: In some sense, we fixed the bug ourselves by rewriting cadofactor to use slurm.

Cf https://github.com/eniac/faas

We keep the ticket open for now

Date: 2015-04-01 11:13
Sender: Paul Zimmermann

Reassigned to Emmanuel. Any progress?

Date: 2015-03-20 01:47
Sender: Nadia Heninger

I'm not sure if this is related, but here are some logs from some long stalls I'm getting. No database errors, just an hour and a half of very little activity.

Here's the work file log from the slave process, which is running on the master node. It's getting connection time outs and connection reset by peer on the local host.

Download failed, URL error: <urlopen error [Errno 110] Connection timed out>
Waiting 10.0 seconds before retrying (I have been waiting since 3210.0 seconds)
Download failed, URL error: HTTP Error 404: No work available
Waiting 10.0 seconds before retrying (I have been waiting since 3220.0 seconds)
Download failed, Connection error: [Errno 104] Connection reset by peer
Waiting 10.0 seconds before retrying (I have been waiting since 3230.0 seconds)
Opened URL http://master:8001/cgi-bin/getwu?clientid=master.6ce5a754 after 3240.0 seconds wait
Downloading http://master:8001/australia.polyselect2.raw_672 to download/australia.polyselect2.raw_672 (cafile = None)
download/polyselect_ropt already exists, not downloading
Result file master.6ce5a754.work/australia.polyselect2.opt_672 does not exist
Running 'download/polyselect_ropt' -t 2 -inputpolys 'download/australia.polyselect2.raw_672' -area 4286997984182272.0 -Bf 31940624.0 -Bg 15246811.0 > 'master.6ce5a754.work/australia.polyselect2.opt_672'
Attaching file master.6ce5a754.work/australia.polyselect2.opt_672 to upload
Sending result for workunit australia_polyselect2_672#3 to http://master:8001/cgi-bin/upload.py
Upload failed, Connection error: [Errno 104] Connection reset by peer
Waiting 10.0 seconds before retrying
Upload failed, Connection error: [Errno 104] Connection reset by peer
Waiting 10.0 seconds before retrying (I have been waiting since 10.0 seconds)
Upload failed, Connection error: [Errno 104] Connection reset by peer
Waiting 10.0 seconds before retrying (I have been waiting since 20.0 seconds)
Opened URL <urllib2.Request instance at 0x28a4128> after 30.0 seconds wait
Cleaning up for workunit australia_polyselect2_672#3
Removing result file master.6ce5a754.work/australia.polyselect2.opt_672
Removing workunit file download/WU.master.6ce5a754
Downloading http://master:8001/cgi-bin/getwu?clientid=master.6ce5a754 to download/WU.master.6ce5a754 (cafile = None)
Download failed, URL error: HTTP Error 404: No work available


And here's from the main log file with timings:
PID64369 2015-03-20 01:14:53,497 Info:Polynomial Selection (root optimized): Resubmitting workunit australia_polyselect2_672#2 as australia_polyselect2_672#3
PID64369 2015-03-20 01:15:37,190 Info:HTTP server: 10.0.0.188 Sending workunit australia_polyselect2_672#3 to client master.6ce5a754
PID64369 2015-03-20 01:15:37,190 Debug:HTTP server: 10.0.0.188 "GET /cgi-bin/getwu?clientid=master.6ce5a754 HTTP/1.1" 200 -
PID64369 2015-03-20 01:15:37,233 Debug:HTTP server: 10.0.0.188 Translated path /australia.polyselect2.raw_672 to /tmp/australia.polyselect2.raw_672
PID64369 2015-03-20 01:15:37,233 Debug:HTTP server: 10.0.0.188 "GET /australia.polyselect2.raw_672 HTTP/1.1" 200 -
PID64369 2015-03-20 01:18:11,892 Debug:HTTP server: 10.0.0.71 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID64369 2015-03-20 01:18:11,893 Debug:HTTP server: 10.0.0.71 Translated path cgi-bin/upload.py to /home/cado-nfs/scripts/cadofactor/upload.py
PID64369 2015-03-20 01:18:11,901 Warning:Database: WuAccess._checkstatus(): Workunit australia_polyselect2_672 has status 6 (CANCELLED), expected 1 (ASSIGNED), presumably timed out
PID64369 2015-03-20 01:18:33,351 Debug:HTTP server: 10.0.0.179 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID64369 2015-03-20 01:18:33,352 Debug:HTTP server: 10.0.0.179 Translated path cgi-bin/upload.py to /home/cado-nfs/scripts/cadofactor/upload.py
PID64369 2015-03-20 01:18:33,359 Warning:Database: WuAccess._checkstatus(): Workunit australia_polyselect2_672#2 has status 6 (CANCELLED), expected 1 (ASSIGNED), presumably timed out
PID64369 2015-03-20 01:25:58,530 Debug:HTTP server: 10.0.0.188 "POST /cgi-bin/upload.py HTTP/1.1" 200 -
PID64369 2015-03-20 01:25:58,530 Debug:HTTP server: 10.0.0.188 Translated path cgi-bin/upload.py to /home/cado-nfs/scripts/cadofactor/upload.py
PID64369 2015-03-20 01:25:59,336 Info:Polynomial Selection (root optimized): Best polynomial from file /tmp/australia.upload/australia.polyselect2.e3saqx.opt_672 with E=7.43e-11 is no better than current best with E=7.92e-11
PID64369 2015-03-20 01:25:59,338 Debug:Polynomial Selection (root optimized): Newly arrived stats: {'stats_total_time': '130.63', 'stats_rootsieve_time': '130.53'}
PID64369 2015-03-20 01:25:59,339 Debug:Polynomial Selection (root optimized): Combined stats: {'stats_total_time': '18506.30000000001', 'stats_rootsieve_time': '18489.720000000005'}
PID64369 2015-03-20 01:25:59,349 Info:Polynomial Selection (root optimized): Marking workunit australia_polyselect2_672#3 as ok

I don't seem to have a server.threaded in the parameters_snapshot file, so I guess I must have whatever the default is?

Date: 2015-03-10 11:49
Sender: Emmanuel Thomé

Which value do you have for server.threaded ? (look this up in cx5+3.parameters_snapshot.* )

I am seeing exactly this "database is locked" error when enabling the threaded http server under significant load (several connections per second). It goes away with server.threaded=false

Date: 2015-03-09 21:37
Sender: Nadia Heninger

'wutimeout': 10800

Date: 2015-03-09 21:18
Sender: Emmanuel Thomé

What happens is that you have a polyselect_ropt binary which lingers from a previous build, so wuclient complains because this does not match what it's been requested to download.

However the course of actions takens seems correct, the messages you get are just misleading; this I think I can infer from the python code:

if os.path.isfile(filename):
logging.info ("%s already exists, not downloading", filename)
if checksum is None:
return True
filesum = self.do_checksum(filename)
if filesum.lower() == checksum.lower():
return True
logging.error ("Existing file %s has wrong checksum %s, "
"workunit specified %s. Deleting file.",
filename, filesum, checksum)
os.remove(filename)

So I think the issue is unrelated to what you have in cx5+3.log

What's the wutimeout in your setting ? I'd like to have an idea about what causes the (fatal) resumption of the cadofactor script...

Date: 2015-03-09 21:11
Sender: Nadia Heninger

cx5+3.log ends with:

Downloading https://cx0:60455/cgi-bin/getwu?clientid=cx5+3 to /home/nadiah/workdir/download/WU.cx5+3 (cafile = /home/nadiah/workdir/download/server.77c09da6.pem)
Downloading https://cx0:60455/c100.polyselect2.raw_48 to /home/nadiah/workdir/download/c100.polyselect2.raw_48 (cafile = /home/nadiah/workdir/download/server.77c09da6.pem)
/home/nadiah/workdir/download/polyselect_ropt already exists, not downloading
Existing file /home/nadiah/workdir/download/polyselect_ropt has wrong checksum c6f799b555dcfb409b600e0797c9216d246da458, workunit specified cb283d4b193ebc1c1003dfa5f440b0665b4e8b3f. Deleting file.

Date: 2015-03-09 21:03
Sender: Nadia Heninger

nadiah@cx0:~/cado-nfs/scripts/cadofactor$ ./wudb.py -dbfile ~/workdir/c100.db -dump -assigned
Assigned workunits:
1
Workunit c100_polyselect2_48:
status: 1
errorcode: None
timeresult: None
wurowid: 34
timecreated: '2015-03-09 20:25:20.834966'
timeverified: None
failedcommand: None
priority: None
wu: "WORKUNIT c100_polyselect2_48\nFILE c100.polyselect2.raw_48\nCHECKSUM 6ec02c12dfb562ab0489fece1edf3876b6c08937\nEXECFILE polyselect_ropt\nCHECKSUM cb283d4b193ebc1c1003dfa5f440b0665b4e8b3f\nRESULT c100.polyselect2.opt_48\nCOMMAND '${EXECFILE1}' -v -t 2 -inputpolys '${FILE1}' -area 2169902202880.0 -Bf 1034690.0 -Bg 567740.0 > '${RESULT1}'\n"
assignedclient: 'cx5+3'
retryof: None
timeassigned: '2015-03-09 20:25:41.203946'
submitter: None
resultclient: None
Associated files:
None

Date: 2015-03-09 19:36
Sender: Paul Zimmermann

note you can directly query the database (see scripts/cadofactor/README):

Printing the database
=====================

The wudb.py script can be used to print info on the workunits. For example,

./wudb.py -dbname /tmp/work/testrun.db -dump -assigned

prints all currently assigned work units.

It would be useful to do that if/when such a hang occurs again.

Date: 2015-03-09 18:58
Sender: Emmanuel Thomé

This one is not going to be an easy one, I fear.

I witnessed vaguely similar behaviour, except that I did not get an error -- but stalls I did have.

Alex, how many threads are alive at a given point in time in cadofactor ? Which ways to we have to profile the sql queries which take long ?

The stalls I saw were with cadofactor busy timeout-ing WUs, and not answering the very many clients trying to connect. For minutes and minutes. Could it be that looking up WUs within the global WU database takes long ? (while most could be "archived" -- the VERIFIED_OK ones, and the.. well I'm not sure about the semantics ; CANCELLED ?)

Field Old Value Date By
assigned_tokruppa2015-04-01 11:13zimmerma
assigned_tonone2015-03-09 16:47zimmerma