Depurando problemas con nginx

  • english
  • spanish

Last week I’ve been debugging a problem I had with this site’s nginx server: from time to time it hanged and I had to restart the process. Some time ago I wrote a little script that checked if it was running OK and restarted it otherwise, but anyway that wasn’t a real solution.

So I spent some days really looking into it and asking for support and reporting my findings to the nginx mailing list. One useful tip I got there was enabling the “debug” mode on the error log, which shows full traces of the processes (including their PID) as they’re processing the request, the rewrites, upstreams, etc.

error_log /var/log/nginx/$host-error.log debug;

With this extended log and the PID of the process malfunctioning, it’s quite easy finding out what that process was doing right before hanging. In order to find out the PID of the hanged processes, I extended my check-reboot script to log some generic system metrics right before restarting nginx: netstat -nap (which shows the PID), ps, vmstat, etc.

#!/bin/sh

TIMEOUT=20
CHECK=http://localhost/wp-admin/
LOG=/var/log/checkWeb/checkWeb-$(date +%Y%m%d).log
LOGR=/var/log/checkWeb/restart-$(date +%Y%m%d).log
TMP=/tmp/checkWeb-$RANDOM

if ! wget -t 1 -o /dev/null -O /dev/nul -T $TIMEOUT $CHECK
then
echo "ERROR, restarting nginx"
echo "** RESTARTING **" >> $TMP
date >> $TMP
echo "- CLOSE_WAIT:" >> $TMP
netstat -nap | grep -c CLOSE_WAIT >> $TMP
echo "- vmstat" >> $TMP
vmstat 1 5 >> $TMP
echo "- free" >> $TMP
free >> $TMP
echo "- ps" >> $TMP
ps aux >> $TMP
echo "- netstat" >> $TMP
netstat -nap >> $TMP
echo "" >> $TMP
echo "" >> $TMP

#       pkill -9 -f php-cgi
pkill -9 -f nginx
sleep 1s
/etc/init.d/nginx start

cat $TMP
cat $TMP >> $LOG
date >> $LOGR
fi

rm -rf $TMP

This way, each time localhost/wp-admin was unresponsive (I was debugging a WP site), besides restarting nginx I was getting a lot of system info. With time I got to realize that nginx processes were not actually hanging, but some of their sockets got on the CLOSE_WAIT state forever until the process was restarted. Looking for the PID of those processes according to netstat on the error log, the last request they were processing before getting to the CLOSE_WAIT state was always the same: on my blog I have some examples of how running servers with daemontools; daemontools uses named pipes (FIFOs), which can become kind of black holes if there’s no process feeding them; when nginx hit one of these FIFOs, it hanged.

Funny thing is that I never had this problem with either Apache nor lighttpd. But anyway the problem is not nginx but those FIFOs which shouldn’t really be there. I removed them and have had no hanged processes in five days, while before this nginx was restarting 3-4 times a day.

exec

  • english
  • spanish

exec is a built-in shell command that forces a binary to be executed by the currently running shell process instead of forking the process and running the binary on that child process.

When you run a command on a shell-script, it forks a child process and runs the command there. On a syscall level this is the classic:

if( (pid=fork()) == 0) { exec(command); exit(); } wait();

And this is usually what we want, because we will keep running commands after that one. Nevertheless, sometimes this is a problem, like when:

  • we have a program that’s going to monitor a given process, and it doesn’t run properly if there’s an intermediate shell process but we need to run this second process via a shel-script for whatever reasons (to initialize some variables, run the program with nice, whatever)
  • on MacOS we’re running a program via a shell-script and get two icons on the dock, one for the shell and another one for the program

Running a command with exec forces the shell not to fork, but to run the command directly over the shell process. An important thing to note here is that the shell-script will end there, no further commands of the shell script will be executed as the shell process will be substituted by the command process, so to speak.

#!/bin/sh # initialize variables, parse command-line parameters, etc. export IP=$1 exec nice command $*

Mi nuevo servidor

  • english
  • spanish

Ladies and gentleman, let me please introduce you to my new server, the one I’ve been blogging about lately:

dscf0042.JPG dscf0044.JPG

What? You don’t see it? Yes! The small grey box on top of the iomega disk, slightly bigger than the Fonera

In case you don’t know it yet, it’s a Linksys NSLU2, a small device around $100 that comes with two USB2 ports and an ethernet connection. Plug an external USB hard drive to it and it’ll become available over the network like a NAS share. And the best part is: you can flash its firmware and install Debian!! :-D

It’s not that powerful, it has an XScale (ARM) processor at 266Mhz and only 32Mb of RAM. There are pages explaining how to install up to 256Mb. Nevertheless, it works and is small, doesn’t make noise, and has a small electrical consumption.

Up to now I’m running the following on it and it works quite well:

top-nslu2.png

# cat /proc/cpuinfo Processor : XScale-IXP42x Family rev 2 (v5l) BogoMIPS : 266.24 Features : swp half fastmult edsp CPU implementer : 0x69 CPU architecture: 5TE CPU variant : 0x0 CPU part : 0x41f CPU revision : 2 Cache type : undefined 5 Cache clean : undefined 5 Cache lockdown : undefined 5 Cache format : Harvard I size : 32768 I assoc : 32 I line length : 32 I sets : 32 D size : 32768 D assoc : 32 D line length : 32 D sets : 32 Hardware : Linksys NSLU2 Revision : 0000 Serial : 0000000000000000 # free total used free shared buffers cached Mem: 29988 28988 1000 0 404 4808 -/+ buffers/cache: 23776 6212 Swap: 979924 41164 938760 # uname -a Linux eliza 2.6.18-6-ixp4xx #1 Tue Feb 12 00:57:53 UTC 2008 armv5tel GNU/Linux # pstree init-+-afpd---afpd |-atalkd |-atd |-avahi-daemon---avahi-daemon |-cnid_metad |-cron |-dbus-daemon |-events/0 |-getty |-khelper |-klogd |-ksoftirqd/0 |-kthread-+-aio/0 | |-kblockd/0 | |-khubd | |-3*[kjournald] | |-kmirrord | |-kpsmoused | |-kseriod | |-kswapd0 | |-2*[pdflush] | |-scsi_eh_0 | `-usb-storage |-mtdblockd |-nmbd |-papd |-portmap |-rpc.statd |-slpd |-smbd---smbd |-sshd---sshd---sshd---bash---su---bash---pstree |-svscanboot-+-readproctitle | `-svscan-+-supervise---dnscache | |-3*[supervise---multilog] | |-supervise---tinydns | `-supervise---mlnet---mlnet---mlnet |-syslogd `-udevd

Scripts daemontools para lighttpd y PHP

  • english
  • spanish

I’ve prepared a set of daemontools scripts to launch and monitor lighttpd and its PHP processes spawned with spawn-fcgi. Here is the README, a tar file with the scripts, and here you can browse the directories with all the scripts.

PS: yes, I like daemontools. It helps me achieving high availability with many services, keeping them up even when a server misbehaves and some process dies. This avoids a lot of late night calls. It’s a great invention. :)

Clusters de Asterisk con el foneBRIDGE2

  • english
  • spanish

At work we have an Asterisk cluster comprised of two Proliant servers and a Redfone‘s foneBRIDGE2 that handles the ISDN lines. The heartbeat daemon is installed on both servers, monitors them and, in the event of a system failure on the master, switches the service to the backup server, migrating the main IP and activating all the needed daemons. I’ll briefly explain the whole setup here as a reference.

Overview

As I’ve said we have two Asterisk servers, named asterisk00 and asterisk01.example.com, the former being the master. Each one of them has its IP address (say, 10.10.10.1 and .2) and there’s an additional “virtual” address (.3) that will “jump” from one server to the other if the primary crashes.

Our foneBRIDGE2 is a quad model, but we only use two ISDN lines: one to our telco, and the other to a legacy PBX. Besides the ISDN interfaces, the foneBRIDGE has two ethernet sockets to connect it to the servers, but only one of them (the first one) accepts configuration commands to set up the FB, switch servers, etc. You’d usually use a switch on that interface so that every server has access to it and can configure the FB, but my boss saw this switch as a single point of failure and refused to use one, a opinion I don’t share as it also has its drawbacks as we’ll see. So our setup is a little bit funny in that asterisk01 is connected to the primary FB interface and asterisk00 to the secondary one. The logic here is: asterisk00 is going to be running 99% of the time, and if it crashes, asterisk01 would have to re-configure the FB, so asterisk01 needs to have access to the config port. Of course, now asterisk01 is a SPOF: if our backup server goes down for any reason, we risk losing control of the FB rendering our cluster unusable!

We use the FreePBX web GUI, which in turn uses a mySQL DB to store all the settings. If you don’t use it, you can skip all instructions referring to mySQL and Apache.

mySQL synchronization

mySQL’s native ndb clustering is quite useful here. Set it up, have the service up at all times on both nodes, and the DB system automatically handles the synchronization across the cluster.

Setting up a mySQL cluster is out of the scope of this document, check the official docs here or look for a howto on Google. :)

Filesystem synchronization

All of Asterisk’s config files, libraries, modules, the users’ voicemail dirs… need to be synchronized over the cluster’s nodes. There are several alternatives here:

  • A SAN. Expensive but convenient. We don’t have one so it’s out of the question. :)
  • DRBD. If you don’t know it, think of it as a partition-level RAID1 system over the network. Works great, we use it on several other clusters, but not here. DRBD’s only drawback is that the synchronized partition can’t be mounted on both servers at once, so you can access the files only on the active node. We wanted to have everything accessible on both servers so that we could use the backup one as a testing ground for new configurations, software upgrades, etc., do DRBD wasn’t and option.
  • csync2. It’s like rsync on steroids. Similar to unison, but can synchronize files over more than two nodes. We’re using it for our Asterisk cluster.

Our csync2.conf file looks like this:

group asterisk
{
host asterisk00.example.com asterisk01.example.com;

key /etc/csync2.key_asterisk;

backup-directory /var/backups/csync2;
backup-generations 10;

auto none;

exclude  *~ .* ok lock control;
include /etc/csync2.cfg;

include /etc/hosts;
include /etc/ha.d/ha.cf;
include /etc/ha.d/haresources;

include /etc/asterisk;
include /etc/redfone*;

include /var/www;
include /var/lib/asterisk;
include /var/spool/asterisk;
include /usr/lib/asterisk;
include /etc/amportal.conf;
include /var/log/asterisk;
}

We run the synchronization every five minutes. There’s no need to sync more frequently, as there won’t be that many changes in the configuration (it’s a stable system, maybe a new phone added every X weeks) and we seldom use the voicemail. The synchronization is launched from /etc/cron.d/FB-csync2:

*/5 * * * * root [ -f /tmp/.FB-master ] && /usr/sbin/csync2 -xv

This /tmp/.FB-master file is just a “flag” that marks the master server, so that the synchronization is only run there. On the section about heartbeat we’ll see how and when this file is created.

fonulator

fonulator is Redfone’s utility to configure the foneBRIDGE. As I’ve explained before, only asterisk01 (the backup system) can configure the FB in our setup, and each server is connected to a different ethernet port on the FB. So in the event of a crash, we need to change the destination server AND the interface used to send it the TDMoE frames.

To this end, we have two different redfone.conf files (redfone_asterisk00.conf and redfone_asterisk01.conf). They look the same except for the “serverX” and “fbX” directives on the spans:

[globals]
fb1=00:50:C2:65:D0:68
fb2=00:50:C2:65 :D 0:69

# asterisk00.example.com
server1=00:80:5A:61:E7:FF
# asterisk01.example.com
server2=00:04:76:11:A3:EC

card=eth1,fb1

# Telco
[span1]
span=1,0,0,ccs,hdb3,crc4
server1
fb2
pri

# Legacy PBX
[span2]
span=2,0,0,ccs,hdb3,crc4
server1
fb2
pri

That was redfone_asterisk00.conf. It instructs the FB to send the ISDN traffic to asterisk00 (server1 here) over the second ethernet interface (fb2). The redfone_asterisk01.conf file uses server2 and fb1.

heartbeat

And now, the final piece that ties the rest together: heartbeat. Our haresources file looks like this:

asterisk00.example.com MailTo::asterisk@example.com::Asterisk 10.10.10.3 FB_fonulator FB_master FB_asterisk apache2

Meaning that:

  • asterisk00.example.com is the master server
  • in the event of a service takeover, send a mail to asterisk@example.com
  • the service’s virtual IP is 10.10.10.3
  • start (stop) the FB_fonulator, FB_master, FB_asterisk and apache2 services (remember to unlink the apache2 link from /etc/rc2.d, we don’t want it to be started at system bootup as heartbeat will handle it)

Now, the scripts. FB_fonulator runs fonulator in order to configure the FB and send the TDMoE traffic to the appropriate server. One important thing here is that, although this script will be run on both servers, it will only have an effect when run from asterisk01 as this is the server on the FB’s config interface:

#!/bin/sh

# Chech who I am and who the other host is
THISHOST="`hostname|cut -d. -f1`"
if [ "$THISHOST" == "asterisk00" ]
then
OTHERHOST="asterisk01"
else
OTHERHOST="asterisk00"
fi

# Bail out if there is no config file
F="/etc/redfone_$THISHOST.conf"
[ ! -f "$F" ] && exit 0
# Guess the appropiate interface card
export ETH=`grep -E "^card=" "$F" | cut -d= -f2 | cut -d, -f1`

case "$1" in
start)
echo "Fonulating…"
/usr/local/bin/fonulator -s -t 1 "/etc/redfone_$THISHOST.conf"
;;
stop)
/usr/local/bin/fonulator -s -t 1 "/etc/redfone_$OTHERHOST.conf"
;;
restart|status)
echo "Fonulator $1"
exit 0
;;
esac
exit 0

FB_master creates the /tmp/.FB-master “flag” file we talked about before, and forces a sync both on the start (to make sure both servers have the same data) and on the stop (to sync back to the primary server any changes after a takeover-and-back):

#!/bin/sh

F=/tmp/.FB-master

case "$1" in
start)
touch "$F"
# Activate log rotation
ln -sf /etc/asterisk/asterisk.logrotate /etc/logrotate.d/asterisk
# Force sync of these dirs
csync2 -fr /var/
csync2 -fr /etc/asterisk/
csync2 -xv
;;
stop)
if [ -f "$F" ]
then
# De-activate log rotation
rm -f /etc/logrotate.d/asterisk
# Force a last minute sync to the new master
csync2 -fr /var/
csync2 -fr /etc/asterisk/
csync2 -xv
rm -f "$F"
fi
;;
esac
exit 0

Finally, FB_asterisk starts the Asterisk service. We run Asterisk via daemontools using my scripts available here, so basically what this FB_asterisk script has to do is “svc -u/-d /service/asterisk”:

#!/bin/sh

case "$1" in
start)
echo "Starting Asterisk…"
# Check if Asterisk is already running
if /usr/sbin/asterisk -r -x "quit"
then
echo "Already running"
exit 0
fi
# Just in case…
rm -f /service/*
# Link services and start them up
ln -sf /etc/asterisk/services/asterisk/ /service/asterisk
ln -sf /etc/asterisk/services/fopserver/ /service/fopserver
svc -u /service/*
;;
stop)
echo "Stopping Asterisk …"
svc -d /service/*
rm -f /service/*
;;
restart)
echo "Restarting Astarisk …"
svc -t /service/*
;;
reload)
echo "Reloading Asterisk …"
/usr/sbin/asterisk -r -x "reload"
;;
status)
echo "Checking Asterisk’s status …"
/usr/sbin/asterisk -r -x "quit" && exit 0 || exit 1
;;
esac
exit 0

Download

All the aforementioned scripts and config files are available here. Think of them as a base to make your own Asterisk/foneBRIDGE setup. And feel free to mail me back any improvements, errors you may find, etc.

Asterisk y daemontools

  • english
  • spanish

I’ve just released my daemontools “run” scripts for Asterisk. They are here: asterisk-daemontools [README]

The scripts let you configure via variables on the “env” dir the PATH to the Asterisk exec, the user and group to launch it with, and the startup options you want to pass it. About running Asterisk with a given user, I’ve found problems with Asterisk 1.2 and the -U and -G options, so the scripts only use those options if you’re running Asterisk 1.4 and revert to “su” otherwise.

There’s also a script forFlash Operator Panel‘s “fopserver”.

I’m using these scripts on several Asterisk 1.2 and 1.4 servers with FreePBX.

Read the full article to see the “run” scripts.