The new compute cluster is beginning to feel like a production system. I’m currently run off my feet installing software for the stream of new users. Mostly this is fine, but occasionally I run into software that makes me want to band my head repeatedly on my desk until the pain goes away; or more accurately makes me want to bang the programmer’s head on the desk.
Just today we received a linux port of a code that has been running on the Windows Condor pool for a while now. Everything seemed fine except for it’s stubborn refusal to run if it couldn’t find a windowing system. Bear in mind that it doesn’t actually produce any graphical output it just dies if it can’t connect to X. After a bit of futzing around we discover that the people that normally run this code do something like:
Xvfb :1 -server 1 1024x1024x8 & export DISPLAY=:1 ./stupid_code_that_wants_X
Xvfb is the X virtual framebuffer. It creates a running X client without actually needing any graphics to be running.
Which works just great locally but if you want to launch that as a script in the job scheduling system (we use PBSpro) then you need to be a bit more careful. What happens if two of these jobs try to launch on the same machine? Obviously one of them will fail because display 1 is already allocated. What I really needed was a script that will try to launch Xvfb and increment DISPLAY on failure until it finds a display that is free. For your edification here it is:
get_xvfb_pid () { XVFB_PID=`ps -efww | grep -v grep | grep Xvfb |\ grep $USERNAME | tail -n 1 | awk '{print $2}'` } create_xvfb () { USERNAME=`whoami` DISPLAYNO=1 while [ -z $xvfb_success ] do get_xvfb_pid old_XVFB_PID=$XVFB_PID XVFB_PID="" Xvfb :${DISPLAYNO} -screen 0 1024x1024x8 >& /dev/null & sleep 1 get_xvfb_pid if ! [ -z $old_XVFB_PID ] then if [ -z $XFVB_PID ] && ! [ $XVFB_PID == $old_XVFB_PID ] then echo "Started XVFB on display $DISPLAYNO process $XVFB_PID" xvfb_success=1 else DISPLAYNO=$(($DISPLAYNO + 1)) XVFB_PID="" fi else if [ -z $XFVB_PID ] then echo "Started XVFB on display $DISPLAYNO process $XVFB_PID" xvfb_success=1 else DISPLAYNO=$(($DISPLAYNO + 1)) echo "FAIL!" $XVFB_PID XVFB_PID="" fi fi done export XVFB_PID export DISPLAY=:${DISPLAYNO} } kill_xvfb () { kill $XVFB_PID }
Which you can call from a script like thus:
[arccacluster8]$. ./xvfb_helper [arccacluster8]$ create_xvfb Started XVFB on display 1 process 9563 [arccacluster8 ~]$ echo $DISPLAY :1 [arccacluster8 ~]$ echo $XVFB_PID 9563 [arccacluster8 ~]$ ps -efw | grep Xvfb username 9563 9498 0 19:31 pts/8 00:00:00 Xvfb :1 -screen 0 1024x1024x8 [arccacluster8 ~]$ kill_xvfb [arccacluster8 ~]$ ps -efw | grep Xvfb [arccacluster8 ~]$
I submit that this is a disgraceful hack, but it might come in handy to someone else.
I wasn’t aware of Xvfb, thanks Huw.
With regard to get_xvfb_pid, may I respectfully refer you to the -U option to ps(1) and awk pattern matches?
The function could be written with less overhead as:
get_xvfb_pid () {
XVFB_PID=`ps -efwwU $USERNAME | awk ‘/Xvfb/ {print $2}’`
}
The presence of tail in your pipeline looks wrong too, I think ps(1) sorts by controlling terminal then process id, neither of which are likely to be useful!
ps on linux seems to sort by PID which given the way linux behaves seems to equate to latest process last. Hence the use of tail -n 1 which effectively gives us the PID of the last Xvfb to be spawned.
It didn’t occur to me to use the -U flag of ps, I shall use it next time I need to grep $USERNAME it’s clearly the better way. The awk trick doesn’t work because the /Xvfb/ pattern matches the awk command and you end up with the PID of awk not Xvfb.
Process IDs will wrap, which is my main concern. The awk pattern could be refined to filter out toothpicks, perhaps.
More importantly, I fed you a duff command line – -e overrides -U in most implementations, so you’d probably need to drop it (I’m pleased to comply with the rules that state when being a smartarse, it’s compulsory to make at least one mistake).
This script has a race condition – you are testing for the id of last running Xvfb to change. Somebody else can start another Xvfb after you test but before you start your Xvfb, therefore fooling your script.
Proposed fix:
create_xvfb () {
DISPLAYNO=1
while [ -z $xvfb_success ]
do
Xvfb :${DISPLAYNO} -screen 0 1024x1024x8 >& /dev/null &
XVFB_PID = $!
sleep 1
if ps –pid $XFVB_PID
then
echo “Started XVFB on display $DISPLAYNO process $XVFB_PID”
xvfb_success=1
else
echo “Failed to run Xvfb on display $DISPLAYNO”
DISPLAYNO=$(($DISPLAYNO + 1))
fi
fi
done
export XVFB_PID
export DISPLAY=:${DISPLAYNO}
}
kill_xvfb () {
kill $XVFB_PID
}
It simplifies your code, you are not firing grep upon grep upon grep and I think it is more robust too. E.g. what if I had a user named “Xvfb” that would show up in ps?
Roman,
I like your solution better. I’ll do some testing this weekend and update the post appropriately.
Thanks,
Huw