Running Condor (CSAIL cluster)

by raulzito234

Runing Code:

Create 2 files:

  1. Description file.
  2. Script.

Create this two files under the directory

/data/scratch/

1. The description file, in this case will be called echo.submit and it’s content is:

###standard condor headers for CSAIL###
# preserve your environment variables<br />GetEnv = True
# use the plain nothing special universe<br />Universe = vanilla
# only send email if there's an error<br />Notification = Error
# Allows you to run on different "filesystem domains"
#by copying the files around if needed
should_transfer_files = IF_NEEDED
WhenToTransferOutput = ON_EXIT
###END HEADER###
###job specific bits###
Executable = echo.sh
#Arguments =<br /># queue log (doesn't like to be on NFS due to locking needs)
Log = /tmp/echo.$ENV(USER).log
#What to do with stdin,stdout,stderr
# $(PROCESS) is replaced by the sequential
# run number (zero based) of this submission<br /># see "queue" below
#Input = input.$(PROCESS)
Error = err.$(PROCESS)
Output = out.$(PROCESS)
# how many copies of this job to queue
queue 1
####END job  specific bits###

2. The script contains the code that will run in the machine, in this case echo.sh:

<br /><br />#!/bin/sh<br />echo $PATH<br /><br />

Don’t forget to make the file runnable by you (the user):

>> chmod u+x echo.sh

Then run the code:

 >> condor_submit <description_file>

In my case I ran the command:

>> condor_submit echo.submit

Errors:

If you get the following error:

ERROR: Failed to parse command file (line 3).

Probably you are not using the correct description file or your description file does not contain the correct text.

When running the code I got the following error because I was entering the script file instead of the description file.

Managing Jobs:

The following code allows to manage submitted code:

>> condor_status -submitters

I’m able to see that the code that I submitted is being held. Therefore if you want to see if the code is run, then use the above code.

The following code may not show what jobs are queued or held:

For managing the submitted jobs, the manual says to use the following command to see all the submitted jobs by all users:

>> condor_q

Or the following command (to find jobs submitted by my user):

>> condor_q $USER

When I ran the above code the first times, I got the following response:

— Submitter: borg-login-1.csail.mit.edu : : borg-login-1.csail.mit.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

I.e. the code that I submitted to the clusters seemed to not be submitted.

Some days after the command seemed to be working normally (weird).

Releasing Jobs (Held Jobs):

For many reasons your jobs can be held. One possible reason, for example, is that if you are running matlab you might not have sufficient licenses to run the program. After you run many jobs and have solved such issues you can free your jobs from being on hold using the following command:

>> condor_release -all

This command will release all held jobs. For more information on this go here.

Removing Jobs:

>> condor_rm $USER

Advertisements