TRIUMF Grid Software and Applications

Compute Element Remedies

  • Symptom: Slow/failing response to pbs commands, e.g. pbsnodes, server stop.
    • If one or more WN's are down but the jobs are still running in the queue then torque get`s in this state. The server must be stopped and all trace of the jobs removed.
    • 1. Use "$ pbsnodes -a" to get the jobids on the dead node.
      2. $ cd /var/spool/pbs/server_priv/jobs
      3. Prepare but don`t run a command ro remove 2 entries per job in this dir.
      $ echo rm 872956* 873210*
      4. $ /etc/init.d/pbs_server stop
      5. $ rm 872956* 873210*
      6. $ /etc/init.d/pbs_server start
      7. Make sure bad wn still offline(although this will be automatic).
      8. $ pbsnodes -o wn003.triumf.lcg

      You may need to repeat and be patient with pbsnodes and stop commands because the symptom
      is periodic failures of such commands.