Depurando problemas con nginx

  • english
  • spanish

Last week I’ve been debugging a problem I had with this site’s nginx server: from time to time it hanged and I had to restart the process. Some time ago I wrote a little script that checked if it was running OK and restarted it otherwise, but anyway that wasn’t a real solution.

So I spent some days really looking into it and asking for support and reporting my findings to the nginx mailing list. One useful tip I got there was enabling the “debug” mode on the error log, which shows full traces of the processes (including their PID) as they’re processing the request, the rewrites, upstreams, etc.

error_log /var/log/nginx/$host-error.log debug;

With this extended log and the PID of the process malfunctioning, it’s quite easy finding out what that process was doing right before hanging. In order to find out the PID of the hanged processes, I extended my check-reboot script to log some generic system metrics right before restarting nginx: netstat -nap (which shows the PID), ps, vmstat, etc.

#!/bin/sh

TIMEOUT=20
CHECK=http://localhost/wp-admin/
LOG=/var/log/checkWeb/checkWeb-$(date +%Y%m%d).log
LOGR=/var/log/checkWeb/restart-$(date +%Y%m%d).log
TMP=/tmp/checkWeb-$RANDOM

if ! wget -t 1 -o /dev/null -O /dev/nul -T $TIMEOUT $CHECK
then
echo "ERROR, restarting nginx"
echo "** RESTARTING **" >> $TMP
date >> $TMP
echo "- CLOSE_WAIT:" >> $TMP
netstat -nap | grep -c CLOSE_WAIT >> $TMP
echo "- vmstat" >> $TMP
vmstat 1 5 >> $TMP
echo "- free" >> $TMP
free >> $TMP
echo "- ps" >> $TMP
ps aux >> $TMP
echo "- netstat" >> $TMP
netstat -nap >> $TMP
echo "" >> $TMP
echo "" >> $TMP

#       pkill -9 -f php-cgi
pkill -9 -f nginx
sleep 1s
/etc/init.d/nginx start

cat $TMP
cat $TMP >> $LOG
date >> $LOGR
fi

rm -rf $TMP

This way, each time localhost/wp-admin was unresponsive (I was debugging a WP site), besides restarting nginx I was getting a lot of system info. With time I got to realize that nginx processes were not actually hanging, but some of their sockets got on the CLOSE_WAIT state forever until the process was restarted. Looking for the PID of those processes according to netstat on the error log, the last request they were processing before getting to the CLOSE_WAIT state was always the same: on my blog I have some examples of how running servers with daemontools; daemontools uses named pipes (FIFOs), which can become kind of black holes if there’s no process feeding them; when nginx hit one of these FIFOs, it hanged.

Funny thing is that I never had this problem with either Apache nor lighttpd. But anyway the problem is not nginx but those FIFOs which shouldn’t really be there. I removed them and have had no hanged processes in five days, while before this nginx was restarting 3-4 times a day.

Probando nginx

Estoy cacharreando con nginx. El nuevo servidor virtual donde tengo el dominio está un poco justo, con la migración lo dejé con Apache en vez de volver a lighttpd y ahora estoy probando alternativas.

Como aparte de configurar el servidor y el PHP con fast-cgi hay que convertir todos los rewrites del Apache (wpmu y super-cache) al formato de nginx, es posible que el blog haga cosas raras, no se actualice debidamente, algunas páginas no vayan bien, pete, no acepte comentarios, o cobre consciencia de sí mismo y decida matar a todos los humanos. YMMV.

cut | sort | uniq: Apache logs

  • english
  • spanish

Many times the reason because a web server is slow and unresponsive is that it’s under “attack”, on purpose or not, by a bot. I’ve seen cases where Google Bot, bots from research engines from universities or some other kind of indexer were responsible for more than half the traffic of a site. These cases are not real DoS attacks, this traffic can be considered legitimate, but the result is that it brings the service down. You can instruct some of these bots not to visit your site so often, like Google Bot using the Google Webmaster Tools and the sitemaps and/or robots.txt files, but usually you can’t and have to consider filtering all this traffic at the firewall. But in any case, the first step is realizing that a single IP (or a couple of them) is responsible for most of your traffic, identifying this IP and using whois learn who it belongs to.

You can run something like this to list the top five IP addresses on your Apache’s access.log:

cut -d" " -f 1 access.log | sort | uniq -c | sort -nr | head -n 5

Scripts daemontools para lighttpd y PHP

  • english
  • spanish

I’ve prepared a set of daemontools scripts to launch and monitor lighttpd and its PHP processes spawned with spawn-fcgi. Here is the README, a tar file with the scripts, and here you can browse the directories with all the scripts.

PS: yes, I like daemontools. It helps me achieving high availability with many services, keeping them up even when a server misbehaves and some process dies. This avoids a lot of late night calls. It’s a great invention. :)

PHP con lighttpd 1.5.0

  • english
  • spanish

PHP integration and configuration in lighttpd 1.5.0 has changed: mod_fastcgi isn’t used any more, you need mod_proxy_backend_fastcgi instead; and lighty won’t launch the PHP processes, you’ll have to start them using the spawn-fcgi program.

In order to setup mod_proxycore for use with PHP, this is the bare minimum configuration (put it in lighttpd.conf, or conf-enabled/php.conf):

server.modules += ( "mod_proxy_core", "mod_proxy_backend_fastcgi" )

$PHYSICAL["existing-path"] =~ ".php$" {
proxy-core.allow-x-sendfile = "enable"
proxy-core.protocol = "fastcgi"
proxy-core.backends = ( "unix:/tmp/php-fastcgi.sock" )
proxy-core.max-pool-size = 16
}

And for the PHP fast-cgi processes, just run or prepare an init.d script that runs the following command:

/usr/bin/spawn-fcgi -s /tmp/php-fastcgi.sock -f /usr/bin/php-cgi -u www-data -g www-data -C 5 -P /var/run/spawn-fcgi.pid

lighttpd 1.5.0-SVN r1992 para Debian Sarge

  • english
  • spanish

I’ve built .deb packages of lighttpd 1.5.0-SVN r1992 for Debian Sarge. They’re based off the latest packages in testing, upgraded to 1.5.0. The only thing missing is mod_mysql_vhost, as I don’t have mySQL 5.0 installed at the moment. This server already runs lighty 1.5.0, so the fact that you’re reading this page is the best proof that it works. ;)

The packages are available for download here.

(PS: I know, I know, what I should do is upgrade to Etch altogether…)

UPDATE (20070921): new release with linux-aio-sendfile support. You’ll need the libaio port too.

CakePHP y lighttpd

  • english
  • spanish

I’ve been busy the last couple of days with a web project I’m developing using CakePHP, and I wanted to use lighttpd as the web server. CakePHP comes with the typical .htaccess with Apache’s mod_rewrite rules, that need to be converted to lighty’s format. The solution can be found here, 3rd comment:

Lighttpd and CakePHP setup in subdirectories

url.rewrite-once = (
"/(.*)\.(.*)" => "$0",
"/(css|files|img|js)/" => "$0",
"^/([^.]+)$" => "/index.php?url=$1"
)

Arquitecturas escalables en Google

  • english
  • spanish

A coworker has sent me three interesting articles from High Scalability, a site I still didn’t knew but which I’ve already added to my Google Reader list. :) The articles talk about the design and computer/network architecture decisions taken at YouTube, Google and GTalk in order to handle the big load their services face. They also comment the current architecture in each site and their evolution over time:

Some lessons to learn from these articles:

  • Don’t try to fix everything with one single architecture or tool. Divide the problem, see if each sub-problem is CPU-, bandwidth- or IO-bound, and optimize it. Specialize server for each task and coordinate their work.
  • Cache content whenever possible. Pre-generate content whenever possible. Make good use of HTTP’s cache-control directives. Use squid as a reverse proxy to leverage your application servers’ load.
  • Think about externalizing some things, like hosting images or videos off-site. These elements may need more bandwidth that you currently have, and moving them off-site can be a good idea, even if it’s just a temporary measure while you manage to get more bandwidth. The service must run at all times.
  • Simplicity. Will let you make changes and evolve your architecture without screwing up.
  • Commodity-PC based clusters. They maximize the power/price ratio. Have a redundancy system in place so that when one node goes down or needs maintainence, the system keeps working without it. Have a system to easily install/change a node, also without affecting the service. And start planning the power and cooling problems ahead. ;)
  • Programming today is much about libraries and frameworks. Don’t reinvent the wheel. Use a common framework in all your developments, homegrown or not. This way novel programmers will be able to start writting code faster, will be able to switch projects easily, won’t have to code the same things over and over again, and a system upgrade will benefit all your applications.
  • Think about the architecture you’ll need from the start. I’m sadly used to developers not caring about what their code runs on, or if their code will lead to CPU, IO or bandwidth problems. Google seems to face every new development looking at the architecture they’ll need to handle the service, and then develop the code arount that architecture. This is what settles Google appart from the rest.

Cabalgando los gusanos

“Debes cabalgar por la arena a la luz del día,
para que Shai-hulud vea y sepa que no tienes miedo.”

Dune, de Frank Herbert

“Si no está en Google, no existe”. Esta frase tan categórica es cierta tanto para comercios on-line o webs corporativas, como para nuestro blog personal. Cuando necesitamos localizar información en Internet, vamos a Google. Y quien dice Google, dice Yahoo, MSN, o cualquier otro buscador. Tenemos que estar ahí.

Éstos buscadores usan “bots” o “spiders” para indexar el contenido de nuestras páginas, programas que periódicamente recorren todos los sitios que ya conocen en busca de actualizaciones y nuevos enlaces a través de los cuales descubrir, procesar e indexar más y más páginas

A nadie se le escapa que el trabajo de éstos programas es beneficioso, pero normalmente no tenemos en cuenta que generan tráfico extra a nuestra web. Aunque parezca mentira, conozco sitios en los que el tráfico de GoogleBot y compañía (ojo, hablo del propio bot, no de visitas dirigidas desde el buscador) consumía hasta un tercio del ancho de banda total de los accesos. Estamos hablando de GIGAS de tráfico al día.

Además los buscadores penalizan la información repetida: si tenemos varias páginas con contenido igual o muy similar, o aún peor, si podemos cargar una misma página con varias URLs distintas, podemos llevarnos sorpresas desagradables como páginas que no aparecen en los resultados de una búsqueda en favor de un feed o un resumen (índice de sección, categoría, etc.) con contenido similar, o páginas con un pagerank bajo porque éste se “diluye” entre varias URLs.

Por ello es importante aprender cómo funcionan éstos bots para saber cómo optimizar su paso por nuestro sitio web, cómo “llevarlos de la mano” hasta la información que queremos priorizar para así mejorar nuestro posicionamiento en los resultados, minimizando a su vez cuando sea posible la cantidad de información transmitida para no saturar nuestra conexión y servidores.
Continue reading