SQS - Simple Queueing System ---------------------------- Version 2.8 June, 2008 ---------------------- This README is a simplified verion if you want to install and use SQS on a single machine. If you want to use it on a cluster or look at queues on truly remote machines, see the README file. You need to see the README file if you want to use SQS for web script submissions. There are many queueing and scheduling systems around, but they were all rather over-kill for what we needed. This was originally a single queue on a single machine allowing one or two jobs to run at once. This version allows several queues, primarily because it is useful to have a main working queue, and a queue for runs that just test the input files. It has two uses:- 1. For the general running of computational chemistry codes, which use a lot of scratch disk space, a lot of time and often a lot of memory. 2. For memory, disk intensive jobs submitted from web form pages. The system runs under Linux and is written in Perl. It should run on any system with Perl. It has been used with many verisons of perl, the latest being 5.8.6. It has been tested also on Tru64 Compaq Alpha and to a limited extent under Cygwin. It consists of 12 scripts, including 'install.pl' for making the installation and deinstallation more automatic. The original files are templates named 'file.t', which are used by 'install.pl' to create the following scripts:- qsub ---- qsub [-d] [-h] [-q queue-name] [-p priority] 'jobname' Submits the job 'jobname' to the queue named 'queue-name'. If '-q queue-name' is absent, it uses the default queue; the first local one defined in sqs.conf. If '-p priority' is present, it sets that priority for the job, otherwise it sets the priority to 1. The priority is an integer in the range 1 - 9. qsub is usually called by us from inside another script which takes the local name of the job, puts together a file to run the specific code and then calls that file by name in the call of qsub. If '-h' is present the job will be held in the queue and not start until released with 'qrls'. 'jobname' must be a complete path, so the script described above uses $PWD to get the full path. The correct direct use of qsub to run a file called 'file.job' is:- qsub [-d] [-q queue-name] [-p priority] $PWD/file.job where it is assumed that file.job directs output to a file. If '-d' is present as an argument, the file 'file.job' will be deleted. Note that '-d', '-h', '-q queue-name' and '-p priority' can appear in any order, but 'jobname' must be the last argument. qsub will not let you submit a job that already exists in one of the queues as this causes serious conflicts. If the queue is empty, or only contains held jobs, or only contains running jobs to a total less than the maximum allowed, the job is started. If jobs are added to the queue when the maximum number of jobs allowed is running, the jobs are kept in the queue. qsub starts qseek if it is not running. If you are submitting to a cluster queue using ssh and password authentication, you will be asked to type in your password on the cluster machine. qseek is not started if the job is submitted with '-h' to hold the job. The script returns the 'job no' to standard output, largely for our web scripts, so this can be directed to /dev/null in normal scripts. qseek ----- This file essentially runs as a daemon, looking for jobs to run. Each user starts his/her own qseek. qseek is started by a call to 'qinit start'. If it is not already running for that user, qseek is started by qsub, for obvious reasons; by qrls, since only held jobs may be in the queue through a shutdown/reboot; and by qmove if you are moving a non-held job to a queue on another machine. qrun ---- The script that is started by qseek to run a particular job in the background. The user should not be concerned with qrun and qseek, but if things go wrong qseek may need to be killed or restarted. They are not called directly by the user. qinit ----- qinit start [queue] This starts qseek. It is called by several other scripts, but can be called directly. It does not start a second qseek if qseek is already running for the user on the machine that is linked to the queue. Starting qseek sets the priorities back from negative to their original values (see qinit stop). qinit stop [queue] This kills the running qseek. You may need to do this, for example after you make changes to sqs.conf. Note it does not kill the running jobs started by that version of qseek. Stopping qseek will set the priority of all your jobs on the machine to the negative of its original value. If qseek stops in some other way, the script qclear can be used to set the priorities to the correct negative value. 'queue' is the name of a queue that is linked to the machine on which you want to start or stop the qseek daemon. This allows qinit to get the host details from the queue attributes in sqs.conf. If several queues are linked to a particular machine, any one of them can be used. Users will be more familiar with the queue names than with the precise hostname details. If [queue] is omitted, the local machine is assumed. qinit show This simply displays the queues indicating which job is, hopefully, running, which are queued and which are held. The first column gives the 'job no', needed by qdel, qhold, qrls, qmove and qprior. The second column gives the queue name. The third column gives the user name. The forth column gives the 'job', with the last 20 characters of the full path to the job. The fifth column gives the priority. If the priority shows as a negative number, it means that qseek is not running for that user. This is to ensure that one user can run jobs when another user has jobs of higher priority but qseek is not running. The sixth column gives the time and date. This information is also stored in the log file, although here the full job path is given. The seventh column gives the status - "Running", "Queueing", "Holding" or "No daemon". "No daemon" is merely emphasising that the priority is negative. The eighth column gives the PGIP (Group Process ID) for running jobs. Note that 'job no' is unique across queues. Note that if 'qinit show' is run just after submitting a job, the job may be listed as "Queued" as it takes a little time for qseek to find the queued job and start it. We generally alias 'qinit show' to 'qu'. qinit status This checks the status of all qseek daemons belonging to the user. It also gives details of the important parameters for each queue. Note that if you are accessing the cluster machines by ssh with password authentication, you have to type in the password once for every machine in the cluster that has queues defined. For this reason, we have split the functionality of 'status' and 'show', which previously were together as 'status'. We generally alias 'qinit status' to 'qstat'. qinit status-all This checks the status of all qseek daemons belonging to the user and any other user listed as listed in $sqsdir/sqs.users. It does not check whether entries in that file are valid users. It also gives details of the important parameters for each queue. Both 'qinit status' and 'qinit status-all' add the user to $sysdir/sqs.users if it is not already there. That file can be edited by hand, but there is no reason to do so. qdel [-nd] 'job no' [list of other job nos.] -------------------------------------------- This script merely deletes the job(s) specified by 'job no', or optionally by a list of 'job nos'. It can delete a running job. qdel can not delete other users' jobs. Note that root can delete any job. Note that qdel does not start qseek, but it should be running. The argument '-nd' stops the job file from being deleted. Note that this is the opposite behaviour from qsub where the job file is kept unless the '-d' argument is present. qhold 'job no' [list of other job nos.] --------------------------------------- This puts on hold the job(s) indicated by 'job no', or optionally a list of 'job nos'. You can not hold a running job. You can not put on hold jobs for other users. root however can use this command for jobs belonging to any user, for the reason given in the 'BUGS' section below. qrls 'job no' [list of other job nos.] -------------------------------------- This releases the job(s), previous held in the queue, indicated by 'job no', or optionally by a list of 'job nos'. You can not release jobs for other users. root however can use this command for jobs belonging to any user. qrls starts qseek if it is not running and if the queue where it found the job to release previously contained only held jobs. These could have been held over a reboot where qseek was stopped. Note this does not mean qseek is not running as there maying be running jobs on that machine in other queues, or qseek may not have been stopped. It does however limit the number of checks on qseek, particularly on cluster machines. If the job is in a queue on a cluster machine and you are using ssh with password authentication, you will be asked for your password on the cluster machine if qrls tries to start qseek. qmove 'job no' 'queue-name' --------------------------- qmove moves the job specified by 'job no.' in its current queue to the queue specified by 'queue-name'. The job can be queued or holding, but it can not be running. You have to own the job. The job in the new queue holds its original status - "Holding" or "Queued", and its priority, but if the priority is negative it is made positive as it is assumed that you are moving the job to a queue where qseek is running. If this is not the case run 'qclear user queue' to set the situation right. qprior 'job no' 'new priority' ------------------------------ qprior changes the priority of 'job no' to 'new priority'. The job can be "Queued" or "Holding", but not "Running" as it is pointless to try to change the priority of a running job. If the priority is negative because qseek is not running, you can change the priority but you use a valid positive number and it will be made negative. qclear [-r host] ---------------- Carries out various tasks to clean up various problems. qclear 'user' ['queue'] If another user has non-held jobs in the queue with priority larger than yours and the other user for some reason does not have qseek running, your jobs will not start. Running qclear, with 'user' the name of the offending other user and 'queue' the name of the queue, will set the priority of the other user's job to negative. This is the correct setting, if qseek has been stopped with 'qinit stop'. Your jobs will now start. If 'queue' is omitted the default (first) queue is assumed. qclear empty ['queue'] If nothing is running, but there are jobs in the queue, qclear sets the queue to empty. All jobs will have to be resubmitted. If 'queue' is omitted the default (first) queue is assumed. qclear ID-no ['queue'] If the machine crashes when a job is running and the machine is then rebooted, 'qinit show' will still show the job as running. It does however try to check whether it is running by looking whether qrun is running with the PID for the job. 'qinit show' gives a long warning message. If the job really is not running, this use of qclear just removes the job with ID equal to 'no' (in the call always type 'ID-' followed by a number) in the given queue or default queue if 'queue' is missing. Take care using this command. qclear zero If the sqs.id is found, it is kept and the running job ID is retained. If it is not found it is created with the running job ID set back to zero, just as install.pl does. Use of 'queue empty' and 'queue zero' are not generally recommended, but they may be usefull for the SQS administrator if things get badly wrong. With improvements in the code they are now less needed than they were. Their use will affect other users than yourself. In general it is recommended that you delete all queued jobs and wait for the running job to finish. This may indeed solve the problem anyway. qexample -------- 'qexample' is a Perl script that is a general script to call 'qsub' to run a file. This version prompts the user about the use of all the arguments for qsub. qexample uses the TERM::ReadKey Perl module. qexample file.job is equivalent to:- qsub [arguments] $PWD/file.job with the arguments given by the user in response to prompts. The earlier version of qexample is kept as 'qexample.short.t. Note that you can install this with 'install.pl qexample.short', but that it is not installed by a full install. If you want to use it as the main qexample, save qexample.t and then copy qexample.short.t to qexample.t. You will probably want to add the flags you normally use to the exec line for qsub in this simple qexample.t. If you do not have the TERM::ReadKey Perl module, you can try to use qexample.getc.t which uses stuff I do not understand to get getc work properly. Again you can install this individually, but it is not installed with a full install. It should be equivalent to qexample itself, but is much longer and more difficult to understand. install.pl ---------- This script installs the other scripts and files, but needs some editing as indicated in the INSTALL file. This script also installs the man pages which are in the man sub-directory. install.pl also copies README into the $sqsdir. A call of install.pl with one parameter, being the name of one of the scripts above, will just install that script. Similarly, a call with 'sqs.conf' as argument will simply install that item and a call with 'man' as argument will install all the man pages. A call with 'perl' as argument will install all the perl scripts only and a call with 'README' as the argument will just install this file. If you call install.pl with the first parameter as '-c', it uses perlcc to compile the perl scripts. In this case the single script or the 'perl' argument is the second parameter. If you run './install queue {queue-name' it will create the queue file for queue {queue-name}. This is useful when you have altered sqs.conf, adding a new queue, and not done a full install. install.pl removes everything installed if called as 'install.pl -u'. There are also two include files- sqs.conf -------- This file is essen1tially a configuration file that contains declarations and assignments that are common to all scripts, and is included into the scripts at run-time. It needs editing to fit a particular local system (see INSTALL file). See the section below about altering sqs.conf and the comments in sqs.conf itself. sqs.inc ------- This file contains declarations that are common to all scripts and is inserted into the scripts by install.pl. You should not need to alter this. Four associated files are used:- sqs.id ------ This holds the job serial number which is incremented every time a job is submitted. Initially installation makes it contain just 0 (zero). sqs.queue.queue-name -------------------- These contain the jobs in the queue - 'queue-name', 1 line per job - '$going', 'job no', 'queue-name', 'flag' (d if '-d' set on qsub, n otherwise),'myname', 'job', 'priority' and the time/date information, with the 8 items separated by commas. Initially it should be an empty file. $going = the Group Process ID for qrun, if this is a running job, otherwise $going = 0. sqs.log ------- A log file for progress information and errors. sqs.pid.$USER.hostname ---------------------- Stores the PID of qseek for use by qstop, but not, note, by qdel on running jobs. This file is not created by install.pl, but by each user when qseek is started. There is such a file for every user on every machine where the user is running jobs. It is removed by 'qinit stop'. The system also uses temporary files called 'sqs.queue.tmp.{queue-name}', but these should be deleted by the various scripts that use them. The version number can be obtained by typing "qsub -v" or "qinit -v", these being the two most frequently used scripts at the prompt. The '-v' flag is not documented below. In fact this works with all the scripts, except for qrun, but this use, except for qsub and qinit, is not documented anywhere. PROBLEMS -------- If no jobs are running, first, use 'qinit status' to check whether qseek is running. If it is not, start it with 'qinit start'. Second, look to see if others users have jobs not running and not held that have higher priorities than yours. It could be that their qseek has died. If qseek has been stopped properly by that user, the priorities should be negative. If they are positive, check that qseek is not running using 'qinit status' and then run 'qclear user queue' where 'user' is the other user's username and 'queue' is the queue where the problem is. This makes the priorities negative and your jobs should start. Do not worry about affecting the other user. When they start qseek, the queue will be set right for them and their jobs will start. BUGS ---- Not so much a bug, but possibly a piece of non-transferable code, is the use of /bin/kill with the full path to delete process groups in qdel. I believe some errors arise on cluster machine if the shell version of kill is selected. This has been partially corrected in that /bin/kill has been replaced by $kill, where this variable is what `which kill` returns. This of course is restricted to unix. ALTERING sqs.conf ----------------- To alter sqs.conf, first, allow all jobs to complete or delete them from the queues. Then run 'qinit stop' to stop qseek. Then edit sqs.conf and run './install.pl sqs.conf' to install it correctly. For a temporary change, you can just edit sqs.conf in $sqsdir. You probably will only be altering $maxqu or $maxperuser. Of course, if there is a root install, you will have to be root or see your systems administrator. If you installed the scripts as compiled executables with './install.pl -c', you will have to recompile them. If you add new queues, you will have to create the queue files with './install queue {queue-name}'. ACKNOWLEDGEMENTS. ----------------- Nicolas Ferre' for testing and suggestions. ----------------------------------------------------------------------------- Brian Salter-Duke b_duke@bigpond.net.au June, 2008.