Using Your Clustermatic Cluster
This exercise should be done while logged in as a normal user,
not as root. You can create a normal user account with the command
"adduser username" and then set the password with
"passwd username".
Part 1: Run NAMD
NAMD is a parallel molecular dynamics application developed in our
group. It is the main application run on our clusters.
- Copy the files NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz (NAMD binary)
and apoa1.tar.gz (sample NAMD simulation)
from the workshop CD and untar them in your home directory with:
tar xzf apoa1.tar.gz
tar xzf NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz
Yes, the file says Clustermatic 4 but you're running 5. It's OK.
Clustermatic 5 is actually backwards compatible for a change.
- cd NAMD_2.6b1_Linux-i686-Clustermatic4
- Start NAMD on all four machines with:
./charmrun +p4 ./namd2 ~/apoa1/apoa1.namd
If you have problems, or want to see what's going in in the
launch process, add ++verbose to the charmrun command
line. The charmrun program interacts with the bproc system to
find which nodes are up, including the master node. If not
enough nodes are available it will start re-using nodes
(useful for SMP nodes). Running charmrun without arguments
will list its other options, such as ++skipmaster. Running
with ++skipmaster will only work if the NAMD input files are
available on the slaves. NAMD does all of it's I/O from the
master process, so we run it on the master node and access our
main NFS servers.
- When NAMD reaches the line that says "TIMING 20 ..." kill it with
Control-C and jot down the wallclock s/step number.
- Run NAMD again on two processors (change +p4 above to +p2) for
20 steps and compare the performance between the two. Do four
processors run twice as fast as four? How close to twice?
Part 2: Compile and Run Tachyon
Tachyon is a parallel ray tracer developed by John Stone for his
master's thesis. It is an example of a typical MPI application.
- Copy the file tachyon-0.97.tar.gz (Tachyon source and examples)
from the workshop CD and untar them in your home directory with:
tar xzf tachyon-0.97.tar.gz
- cd tachyon/unix
- Use a text editor to open the file Make-arch
- Search for the config options for "linux-lam"
- Copy this set of options to a new entry.
- Change (in the new entry) linux-lam to linux-mpich
- Change "CC = hcc" to "CC = gcc"
- Change -I$(LAMHOME)/h to -I/usr/mpich-p4/include
- Change -L$(LAMHOME)/lib to -L/usr/mpich-p4/lib
- Change -lmpi to -lmpich
- Save, quit the editor and run "make linux-mpich"
to build tachyon. If this doesn't work you probably missed
on of the edits above, or applied them in the wrong place.
The tachyon binary will end up in compile/linux-mpich/.
- cd (back to your home directory)
- Run Tachyon on the three slave machines with:
/usr/mpich-p4/bin/mpirun -d -p 3 \
tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
The Clustermatic mpirun is broken and does not allow the master
node to be used for MPI jobs. This is fine for their 1000-processor
clusters where they want minimal load on the master, but bad for us.
Tachyon reads input on every node, so the NFS mounting of /home on
the slaves is necessary.
- Look at the timing output, which is broken into different
stages of the calculation. Run on two and one processors
(change -p 3) and calculate speedups for the different
stages as well as the total time.
Part 3: Run Under Grid Engine
Sun Grid Engine (SGE) is a free, open souce, general purpose,
cross platform queueing system. In the geneology of queueing systems,
it is a descendant of the free DQS package, which was commercialized
by a German company that was recently bought by Sun.
- Run "qstat -f" to see the queue that was automatically
created. There should be only one queue, for the master node.
The states column at far right is used for error flags.
- Run "qconf -sq all.q" to see the queue setup for the
cluster. Note that there are many options to restrict
user access, memory usage, runtime, etc. that are turned off
by default. The only unique thing is the qname and hostlist.
This is a newer version of Grid Engine than on the Rocks CD, so
there will be a few differences if you compare them.
- Use a text editor to create the file tachyon.job containing:
#$ -cwd
#$ -j y
#$ -S /bin/bash
/usr/mpich-p4/bin/mpirun -d -p `bpstat -t allup` \
tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
Notice the similarity to the command for running Tachyon
manually. Since SGE doesn't know about bproc or the slave nodes,
we use bpstat to find out how many slave nodes are up.
The options preceeded by #$ are parsed by SGE as if they were
specified on the command line. -cwd causes the job to execute in
the current working directory. -j y merges standard error and output
into a single file. -S /bin/bash says to use the bash shell for this
script.
- Submit the job to run on the full cluster with the command
"qsub tachyon.job". Note that there is only one queue for
the job to go to, all.q.
- Use "qstat -f" to check on the job until it is scheduled,
then look for output files named tachyon.job.oX and
tachyon.job.poX, where X is the job number output by qsub. View
these files to see the output.
- Submit several jobs so that a backlog develops. You can use the
same tachyon.job file for all of them, just use the up arrow and
hit return to submit jobs quickly.) Use qstat to monitor how the
jobs are executed (the default scheduling policy is to take the
earliest-submitted job that can be run, and the scheduler runs at
regular intervals).
- Use a text editor to create the file namd.job containing:
#$ -cwd
#$ -j y
#$ -S /bin/bash
dir=$HOME/NAMD_2.6b1_Linux-i686-Clustermatic4
$dir/charmrun +p$((`bpstat -t allup` + 1)) $dir/namd2 ~/apoa1/apoa1.namd
Since NAMD uses the head node, we use some shell magic
to add one to the number of available slave nodes returned by bpstat.
If these were dual-processor nodes, we would need to multiply by
two as well.
- Submit the job with the command "qsub namd.job".
- Use qstat to monitor the job until it starts running, the use
"tail -f namd.job.oX (X is the job number) to watch the
job output.
- When you get tired of this, Control-C out of tail and use
"qdel X" (X is the job number) to kill the job. Use qstat
to monitor the job until it is killed.
- We are going to add a queue, so become root with "su root"
- Dump the all.q configuration with "qconf -sq all.q >
/tmp/q
- Open /tmp/q with an editor and change qname to express,
subordinate_list to all.q, and h_rt to 60.
- Load the file as a new queue with "qconf -Aq /tmp/q
This creates a queue named express that will suspend jobs in
the main queue for up to 60 seconds. In real life you would use
a longer time, like 1800 seconds (30 minutes).
- exit from the root shell to become a normal user.
- Submit a long-running NAMD job with "qsub -q all.q namd.job";
use qstat to see that it starts in the default queue all.q.
Now that we have multiple queues available it is important
to be specific about which queue we want. Otherwise, if
the all.q queue was busy this job would run in the express
queue and be killed after 60 seconds.
- Submit another NAMD job with "qsub -q express namd.job"; use
qstat to see that it starts in the express queue, and that the
old job has an S in the states column since it is suspended.
- View the live output of the old job with "tail -f
namd.job.oX" to see that it is stopped. If you wait for a
minute it should restart when the job in the express queue is
killed for exceeding its time limit. Having an express queue
is very useful for short test and setup runs. Only bproc-based
systems like Clustermatic can do this smoothly for parallel runs
(remember that SGE knows nothing about bproc). You must,
however, be sure that the cluster has enough memory for both
jobs.
Part 4: There Is No Part 4
Compiling a program and running it under a queueing system is likely
all you will ever do on your cluster. We've done a typical
application (Tachyon) and a not-so-typical one (NAMD). At this
point you might want to bpsh to a compute node to see what that
environment is like, or go see how the Rocks folks are doing. If
you're really ambitious, download your own code and see if it
comiles and runs.
See Also
Clustermatic web site (http://www.clustermatic.org/)
Grid Engine web site (http://gridengine.sunsource.net/)
NAMD web site (http://www.ks.uiuc.edu/Research/namd/)
Tachyon web site (http://jedi.ks.uiuc.edu/~johns/raytracer/)