Home My Page Projects cado-nfs
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files

[#16714] add a resubmit option to wudb

Date:
2013-12-03 09:20
Priority:
3
State:
Open
Submitted by:
Emmanuel Thomé (thome)
Assigned to:
Nobody (None)
Hardware:
none
Product:
none
Operating System:
none
Component:
none
Version:
none
Severity:
none
Resolution:
none
URL:
Summary:
add a resubmit option to wudb

Detailed description
Use case: client cluter-32.somewhere.com goes offline. It will never finished the workunit which has been assigned to it.

User wants the workunits to be assigned again. There is no documented procedure to do that, except to wait for the timeout.

A way to resubmit a workunit is to forcibly set its time_assigned field to looong ago in the wudb. For example the following resubmits all jobs:

echo "update workunits set timeassigned='2000-01-01 00:00:00.000000' where status=1;" | sqlite3 c135.db

I think that such tinkering by hand is not very desirable. IMO the wudb.py script would need a --resubmit option.

E.
Message  ↓
Date: 2015-04-29 13:04
Sender: Emmanuel Thomé

add a new RESUBMIT_REQUESTED status

expand this line in cadotask.py/resubmit_timed_out_wus():

results = self.wuar.query(eq={"status": wudb.WuStatus.ASSIGNED},
lt={"timeassigned": cutoff})

to include wus with status RESUBMIT_REQUESTED ; this means an OR on the SQL condition, which is probably better done with doing two queries and merging the results.

add a command-line option to wudb.py to do this status change.

Date: 2015-04-01 11:13
Sender: Paul Zimmermann

Reassigned to "Nobody". Feel free to take this one...

Date: 2015-03-10 11:55
Sender: Emmanuel Thomé

This thread has been idle for long, and its contents do not really reflect what would be really desired, in fact.

The use case is when the user might know that some node is gone. He might want to do force resubmitting *now* the WUs which have been assigned to it.

It seems to me that forcing a timeout on these WUs is fine. So the user might be happy if there were a command like:

wudb.py --client-match "(griffon-22|graphene-10[2345]).nancy.grid5000.fr\+\d+" --expire-older-than 120

which will internally set timeassigned for these WUs to 0, and then the server will reschedule them as xyz#2.

I think it would be useful.

Date: 2014-05-30 09:34
Sender: Alexander Kruppa

> ok, I understand my mistake better. Seems that the CANCELLED state for timed out units is not an error condition. Why do I get errors and not warnings for these ?

I concur that this is probably a bug. CANCELED (as it would properly be spelled... another bug) is a status that can occur in normal operation, e.g., when a WU times out. A client submitting a result for a canceled WU is likewise a situation that can occur and that is not necessarily a sign of a real problem.

I changed the message now if the WU status is CANCELLED and a client tries to submit a result; in this case, it is printed as a Warning, not as an Error, and with a "presumably timed out" remark at the end.

The correct course of action for you to take when such an error (now: warning) occurs is: nothing. The Task has already re-submitted the WU at that point, with a modified name (e.g., "blabla#2") to avoid having a duplicated WU name in the DB, and this re-submitted WU will be issues to clients.

Manually hacking the DB will also lead to the problem that the Task loses track of how many WUs it has issued, canceled, and received. The Tasks currently keep separate counters for these, duplicating the info with the WORKUNITS table (a design bug which is meant to be fixed with the newly added SUBMITTER DB column, which will let Tasks efficiently query the status of their own WUs). If you manually change the status of WUs, you may end in a state where a Task forever waits for some (as far as it knows) not-yet-finished WUs, even though there are no outstanding WUs in the WORKUNITS table any more.

Date: 2014-05-27 21:30
Sender: Emmanuel Thomé

ok, I understand my mistake better. Seems that the CANCELLED state for timed out units is not an error condition. Why do I get errors and not warnings for these ?

Would it make sense to define a separate TIMEDOUT status (I think not).

I think I "fixed" my db as follows, by re-cancelling the WUs which had already been resubmitted as x#2:
echo "select wuid from workunits;" | sqlite3 /localdisk/thome/rsa1024/workdir/rsa1024.db > /tmp/all
echo "select wuid from workunits where status=1;" | sqlite3 /localdisk/thome/rsa1024/workdir/rsa1024.db > /tmp/work
for x in `cat /tmp/work` ; do if grep -q "$x#2" /tmp/all ; then echo $x ; fi ; done > /tmp/cancel
for x in `cat /tmp/cancel` ; do echo "update workunits set status=6 where wuid='$x';" | sqlite3 /localdisk/thome/rsa1024/workdir/rsa1024.db ; done


Now I no longer have crashes, but the server seems to take ages rereading its db. Every once in a while it says:
Info:Polynomial Selection (size optimized): Parsed 33 polynomials, added 1 to priority queue (has 2000)
Info:Polynomial Selection (size optimized): Worst polynomial in queue now has lognorm 92.550000
Info:Polynomial Selection (size optimized): Marking workunit rsa1024_polyselect1_4045000-4050000#2 as ok

Presumably this is linked to clients reporting back with results. However:
- This never completes:

merguez ~ $ curl -L https://localhost:8888/cgi-bin/getwu?clientid=xxxxxx
curl: (35) Unknown SSL protocol error in connection to localhost:8888

- The socket does not seem to have a busy queue.

merguez ~ $ netstat -tlp | grep 8888
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 *:8888 *:* LISTEN 24831/python3

So the clients are starving, because the server doesn't provide any new WUs.

[this is now becoming off-topic for the resubmit thing].

Date: 2014-05-27 20:36
Sender: Emmanuel Thomé

I forgot a detail:

[...] it marked them as *cancelled*. [X]. This didn't please the server, [...]

At [X], insert: I interrupted and restarted the server with tasks.maxtimedout=10000

E.

Date: 2014-05-27 20:31
Sender: Emmanuel Thomé

I still don't get it. IMO, resilience to the dumb-user-which-is-as-dumb-as-me could be improved.

I am running a test computation (some polyselecting for rsa1024). I had tasks.maxtimedout=100, which is too low, but I hadn't noticed. The server went offline for an hour today. Some jobs got killed also, so that they could not upload their results.

By the time the timeout expires, the server decides to resubmit the WUs. except that too many timed out. So I think that it marked them as *cancelled*. This didn't please the server, as I had many messages like:

Error:Database: WuAccess._checkstatus(): Workunit rsa1024_polyselect1_4930000-4935000 has status 6 (CANCELLED), expected 1 (ASSIGNED)

Indeed:

merguez ~ $ ~/NFS/cado/scripts/cadofactor/wudb.py -dbfile /localdisk/thome/rsa1024/workdir/rsa1024.db -wuid rsa1024_polyselect1_5000-10000 -dump
Workunit rsa1024_polyselect1_5000-10000:
1
Workunit rsa1024_polyselect1_5000-10000:
timeverified: None
resultclient: None
wu: "WORKUNIT rsa1024_polyselect1_5000-10000\nEXECFILE polyselect2l\nCHECKSUM 95931c09bfe684c4565550568569e52405ed6722\nRESULT rsa1024.polyselect1.5000-10000\nCOMMAND '${EXECFILE1}' -P 10000000 -N 135066410865995223349603216278805969938881475605667027524485143851526510604859533833940287150571909441798207282164471551373680419703964191743046496589274256239341020864383202110372958725762358509643110564073501508187510676594629205563685529475213500852879416377328533906109750544334999811150056977236890927563 -degree 6 -r -t 2 -admin 5000 -admax 10000 -incr 60 -nq 1296 -area 1.073741824e+18 -Bf 500000000.0 -Bg 250000000.0 > '${RESULT1}'\n"
timecreated: '2014-05-27 14:16:44.767625'
errorcode: None
retryof: None
timeassigned: '2014-05-27 14:21:02.507474'
timeresult: None
failedcommand: None
assignedclient: 'catrel-47.loria.fr.1a834807'
submitter: None
status: 6
priority: None
wurowid: 2
Associated files:
None

Now, since I don't like error messages, I tried to do something about it. Alas, wudb does not offer me with a consistent of "safe" r/w operations on the db, and I'm left with sqlite tinkering as a swiss army knife. Which doesn't match "safe"...

I tried this:

echo "update workunits set status=1 where status=6;" | sqlite3 /localdisk/thome/rsa1024/workdir/rsa1024.db

And now quite probably I shouldn't have, since I seem to have corrupted the db. The server now crashes saying:

Info:Polynomial Selection (size optimized): Resubmitting workunit rsa1024_polyselect1_0-5000 as rsa1024_polyselect1_0-5000#2
Traceback (most recent call last):
File "./scripts/cadofactor/cadofactor.py", line 78, in <module>
factors = factorjob.run()
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 4720, in run
last_status, last_task = self.run_next_task()
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 4788, in run_next_task
return [task.run(), task.title]
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 1731, in run
self.wait()
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 1142, in wait
self.resubmit_timed_out_wus()
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 1193, in resubmit_timed_out_wus
self.resubmit_one_wu(Workunit(entry["wu"]), commit=True)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 1163, in resubmit_one_wu
self.submit_wu(wu, commit=commit)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/cadotask.py", line 1073, in submit_wu
self.wuar.create(str(wu), commit=commit)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/wudb.py", line 951, in create
self._create1(cursor, wus, priority)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/wudb.py", line 942, in _create1
self.mapper.table.insert(cursor, d)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/wudb.py", line 406, in insert
values[self.primarykey] = cursor.insert(self.tablename, d)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/wudb.py", line 272, in insert
self._exec(command, values)
File "/users/caramel/thome/NFS/cado/scripts/cadofactor/wudb.py", line 197, in _exec
self.execute(command, values)
sqlite3.IntegrityError: UNIQUE constraint failed: workunits.wuid


Now:
- should I care about the cancelled/assigned warning or not ?
- what the hell should I have done ?
- is there any way to fix my database ?
- don't you think that equipping wudb with some "safe" operations to fix WUs could be useful ?

E.

Date: 2013-12-05 11:39
Sender: Paul Zimmermann

shouldn't it be a feature request instead of a bug?

Paul

Field Old Value Date By
assigned_tokruppa2015-04-01 11:13zimmerma