Last week I’ve been debugging a problem I had with this site’s nginx server: from time to time it hanged and I had to restart the process. Some time ago I wrote a little script that checked if it was running OK and restarted it otherwise, but anyway that wasn’t a real solution.
So I spent some days really looking into it and asking for support and reporting my findings to the nginx mailing list. One useful tip I got there was enabling the “debug” mode on the error log, which shows full traces of the processes (including their PID) as they’re processing the request, the rewrites, upstreams, etc.
With this extended log and the PID of the process malfunctioning, it’s quite easy finding out what that process was doing right before hanging. In order to find out the PID of the hanged processes, I extended my check-reboot script to log some generic system metrics right before restarting nginx: netstat -nap (which shows the PID), ps, vmstat, etc.
#!/bin/sh TIMEOUT=20 CHECK=http://localhost/wp-admin/ LOG=/var/log/checkWeb/checkWeb-$(date +%Y%m%d).log LOGR=/var/log/checkWeb/restart-$(date +%Y%m%d).log TMP=/tmp/checkWeb-$RANDOM if ! wget -t 1 -o /dev/null -O /dev/nul -T $TIMEOUT $CHECK then echo "ERROR, restarting nginx" echo "** RESTARTING **" >> $TMP date >> $TMP echo "- CLOSE_WAIT:" >> $TMP netstat -nap | grep -c CLOSE_WAIT >> $TMP echo "- vmstat" >> $TMP vmstat 1 5 >> $TMP echo "- free" >> $TMP free >> $TMP echo "- ps" >> $TMP ps aux >> $TMP echo "- netstat" >> $TMP netstat -nap >> $TMP echo "" >> $TMP echo "" >> $TMP # pkill -9 -f php-cgi pkill -9 -f nginx sleep 1s /etc/init.d/nginx start cat $TMP cat $TMP >> $LOG date >> $LOGR fi rm -rf $TMP
This way, each time localhost/wp-admin was unresponsive (I was debugging a WP site), besides restarting nginx I was getting a lot of system info. With time I got to realize that nginx processes were not actually hanging, but some of their sockets got on the CLOSE_WAIT state forever until the process was restarted. Looking for the PID of those processes according to netstat on the error log, the last request they were processing before getting to the CLOSE_WAIT state was always the same: on my blog I have some examples of how running servers with daemontools; daemontools uses named pipes (FIFOs), which can become kind of black holes if there’s no process feeding them; when nginx hit one of these FIFOs, it hanged.
Funny thing is that I never had this problem with either Apache nor lighttpd. But anyway the problem is not nginx but those FIFOs which shouldn’t really be there. I removed them and have had no hanged processes in five days, while before this nginx was restarting 3-4 times a day.