Cluster de correo escalable con software libre

  • english
  • spanish

At my previous job I was responsible for the MTA of a group of companies, handling around 3000 e-mail accounts spread over 20 domains. This MTA received around 150,000 mails daily, and over 95% of them was discarded/marked because it was identified as SPAM or viruses (as of last year, don’t know how this evolved since I left). We used a homegrown cluster of seven servers, which enabled us to scale as needed. And it was based on free software.

This is not an step-by-step installation guide with technical details and configuration files, but rather the story of the evolution of the service, the various problems that we faced, how we solved them, and the design decisions in each case.

Migration

The first incarnation of the server was in 2001 when we had to migrate the old server, which was starting to give lots of trouble, to more current software and hardware. I seem to remember it was a mail server from Netscape (!?) that stored the account information in an LDAP directory, but can’t recall the exact name or version of the product. The server we chosed for the migration was qmail-ldap, mainly because of the good reviews we read about its stability, reliability and security, ease of setup (personally I still think qmail is much simpler than eg sendmail) and because it also used an LDAP directory. The latter may seem a silly reason, but in the end the migration had to be done in extremis at a time that the original server wouldn’t even boot most of the times, and we got away with it with a simple ldapsearch and a little script that “translated” the LDAP scheme of one server into that of the other one. Over time the choice of qmail-ldap proved to be the right one, because thanks to its modular design it allowed us to progressively move from a one server deploy to the cluster that I refered about in the introduction.

This first server was a rack-mounted one, with redundant power supplies and hw RAID5, so that all the data was secure (or so we thought back then). We also rolled qmail-scanner and the Kaspersky anti-virus (there was no ClamAV yet, we moved to it some years later). The same server held the SMTP, POP, IMAP and WebMail (SquirrelMail) services.

Active/Passive backup

We had to do the first architectural upgrade a couple of months after the migration: a RAID5 hiccup lead to a corrupted filesystem which was quite difficult to fix. It became clear that the RAID discs and the redundant power supplies were not enough to ensure the data integrity and service availability, so we installed another server exactly like the first one, and synchronized the configuration and mailboxes using rsync and cron jobs. The switching from the primary to the backup server was manual back then, using NAT at the router.

Over time the server was upgraded to new models several times, but we kept the active/passive backup structure. The syncronization between both servers was also improved, with DRBD for the mailboxes and csync2 for the configuration, AV bases, and so on. Master-backup monitoring and service switch was automatized with heartbeat.

The SPAM flood, specialization by resources

Sometime around 2002-2003 viruses ceased beeing e-mail’s biggest problem: the increasing number of SPAM messages received every day was way worse. So we threw SpamAssassin into the mix. Over time this lead to an ever-increasing CPU and memory consumption, slowing the server to a crawl. At first it seemed that the only option was to migrate every year to a new, more powerful server (and what would we do with the old one then?), or have multiple servers and distribute all the domains among them in an attempt to distribute the load.

Finally we realized that we had two different kinds of resource needs, with different growth patterns:

  • HD space for the mailboxes: the number of mailboxes in our system was fairly stable and the vast majority of our users downloaded their e-mails using POP, so HD scalability wasn’t really that big of a problem for us. We could easily afford to upgrade disk every few years, moving the service to the backup server while we were upgrading the master one.
  • CPU for the filtering: SPAM was growing at an exponential rate, we basically needed to double the CPU power each year.

So, why not specialize our servers into storage servers and a filtering farm? We moved the SMTP service from the main servers to a front-line of SMTP servers with the follwing characteristics:

  • they were off-the-shelf PCs and their configuration was practically identical (no variations appart from hostnames and IP addressess). We prepared a system image we could easily dump in a matter of minutes to a new PC, in case one of the servers went down or we needed more raw CPU power because of an increase in SPAM.
  • we had a router load-balaincing port 25 among all these servers.
  • all these SMTP servers were independent from the central ones, except for the final step of delivering the already analized mail to its destination mailbox: each server had a local copy of the LDAP directory (synchronized with slurpd), a copy of all the configuration files and all the AV bases and the SpamAssassin bayesian database (synchronized with csync2), and a DNS resolver/cache (dnscache).
  • they did local logs, but also sent them to a centralized syslog server for easier analysis.
  • they didn’t store the mails locally for later delivery, in other words they had no delivery queue: e-mails were analyzed on the fly during the SMTP session and if one of them met certain anti-SPAM/AV criteria (blacklisted IP, a number of RBL hits, certain keywords, etc.) it was immediatelly rejected with an SMTP error and the connection was closed; on the other hand if the mail was let through (it was either legitimate, or marked as possible SPAM), it was sent to the central server on the spot, and the filtering server never gave the OK to the origin MTA until the mailboxes server acknowledged the delivery. This is done quite simply with qmail by means of replacing the qmail-queue binary with the qmail-qmqpc one. By doing this we were able to guarantee that no mail would be lost in the event that a filtering server crashed, as the origin MTA wouldn’t receive the OK from us and would re-try the delivery after a couple of minutes.

Mailboxes, the POP and IMAP services, the LDAP master, webmail, and the remote queue remained in the central server, although most of them could have been moved to independent servers if needed, but we never needed to.

Specialization by type of client

The next problem we faced came about 2-3 years ago when image- and PDF-based SPAM became popular: we added an SpamAssassin plugin which re-composed animated GIF images and did OCR to all image attachments. This extra analysis greatly increased our CPU needs (we had to go from 2 or 3 filtering servers to 5 in a couple of days) and even so there were times when a server got overloaded for some 5-10 minutes and an e-mail could take not less than 2 minutes to be processed, delivered and SMTP-OK’d. When this happened and the sending party was another MTA it represented no bigger issue, as in the event of a timeout or disconnection the remote server would re-try the delivery several times; however, if the sender was an end-user with his MUA, a longer-than-usual delivery time or (God forbid) an error message from Outlook because of an eventual dropped connection lead to a phone call to the IT team because “the mail wouldn’t work.” :-)

The solution was splitting the SMTP and analysis farm into two: one for external mail and another for internal ones, for our users. The first farm is the one the DNS’ MX records pointed to, and had all the SPAM filtering options activated; while the second one retained the domain name end users used as the SMTP server in their MUAs, had all the heavy-weight lifting filters disabled and required SMTP authentication (wouldn’t accept non-authenticated sesions even for local domains). This way all external e-mail coming from remote MTAs would go through all the filters, and our users went to the privileged servers with somewhat lesser filering capabilities (but enough for internal mail) and great response times.

The big picture

El lubricante del futuro

A veces me planteo si vale la pena romperse los cuernos en la lucha contra el SPAM: es una carrera de fondo a ver quién puede más, y de vez en cuando, como sucede con la publicidad “legítima”, los creativos se lo curran y consiguen un anuncio que te arranca una sonrisa (un lagrimón en este caso).

Os juro por Snoopy que ésto no es una cafrada mía, me acaba de llegar al correo de GMail. Haced click en el Terminator para ver a lo que me refiero. No apto para menores, espíritus sensibles ni para los que no quieran ver tambalearse un mito del cine.

Firmas anti-SPAM para ClamAV

  • english
  • spanish

A couple of days ago I stumbled upon the SaneSecurity set of ClamAV signatures, that detect a lot of SPAM (mainly the latest batch of GIF and PDF SPAM) and phishing mails. They’re similar to the MSRBL signatures, only better judging by the results we’re getting. Or to put it another way, one is the perfect complement to the other. :)

By using these two ClamAV signature sets together and some other techniques (SpamAssassin, DNS, RBL…) we’re stopping at work around 80% of all the mails we get, 100000-120000 daily, with a very low false-positive ratio, 2-3 weekly at most. And these figures include all the internal mails too which are supposed not to be SPAM, I’m sure the real SPAM blocking ratio (just external mails) in our system is way above 90%. One of these days I’ll do the math.

I’ve been thinking for some time about writing an article about the different anti-SPAM techniques we use here. I’ll see if I can get some free time to do it…

maç özeti,

live football highlights,

football video,

football videos,

football highlight,

football highlights,

live football streaming,

live footbal,

online footbal,

free football streaming,

live football stream,

stream live football,

free football streaming,

lig tv

soccer stream

football stream

ligtv

maç izle

canlı maç

canlı futbol

canlı futbol izle

canlı futbol tv

futbol maçı izle

futbol smart

futbol smart izle

futbol tv izle

футбол онлайн

трансляция футбол

смотреть онлайн футбол

смотреть футбол

soccer live

soccer tv

live soccer streaming

stream soccer

online football

watch football

football match

football streaming

live streaming

watch football

live football

football tv

futbol vivo

partido en vivo

juegos futbol

futbol online

futbol gratis

roja directa

jogos de futebol

jogo de futebol

futebol online

assistir tv

atdhe

foot en direct

jeux de foot

jeux football

calcio diretta

calcio streaming

giochi calcio

live ποδόσφαιρο

podosfairo live

αγωνεσ ποδοσφαιρου

video sepak bola

game bola sepak

Siaran Langsung Sepakbola

bola siaran langsung

futbol juegos

partidos en vivo

bóng đá online

xem bóng đá

bóng đá trực tuyến

bóng đá trực tiếp

ฟุตบอล online

ฟุตบอลสด

บอล online

ดู ฟุตบอล สด

ถ่ายทอด สด

online futball

live fussball

fussball live stream

live stream fußball

bundesliga live stream

fußball live

bundesliga fußball

piłka nożna na żywo

piłka nożna online

fotbal live

fotbal online

fotbal živě

fotbal zive

fotbollskanalen

fotball live

footballl live

football online

fotball på tv

футбол онлайн

футбол трансляція

футбол канал

live voetbal

live voetbal

voetbal kijken

voetbal online

фудбал уживо

fudbal uzivo

live fudbal

futbal online

live futbal

live footy

مباريات كرة القدم

مباشر كرة القدم

بث حي مباشر

مباشر مباريات

כדורגל שידור ישיר

שידור חי כדורגל

بث كرة القدم

football forum

football forums

football community

football talk

soccer forum

soccer forums

tn115

tn115 toner

brother tn115

tn115bk

brother tn115bk

tn115bk toner

DCP-9040CN toner

DCP-9045CDN toner

HL-4040CDN toner

HL-4040CN toner

HL-4070CDW toner

MFC-9440CN toner

MFC-9450CDN toner

MFC-9840CDW toner

DCP9040CN toner

DCP9045CDN toner

HL4040CDN toner

HL4040CN toner

HL4070CDW toner

MFC9440CN toner

MFC9450CDN toner

MFC9840CDW toner

q6000a toner

q6000a cartridge

hp 1600 color

hp 1600 toner

hp 1600 ink

laserjet 1600 toner

hp color 2600

hp toner 2600

hp 2600 ink

hp 2600 toners

hp 2600 cartridge

1600 toner

2600 toner

toner laserjet 2600

hp 2600n color

toner for hp 2600n

hp 2600n toner

hp 2600n toners

ink for hp 2600n

toner for hp laserjet 2600n

hp laser 2600n toner

hp 2600n cartridge

hp 2600n cartridges

hp laserjet 2600n ink

toner laserjet 2600n

hp toner q6003a

q6003a cartridge

q6000a toner

q6002a cartridge

hp color 3800

toner for hp 3800

toner hp 3800

hp 3800 cartridges

laserjet 3800 toner

cartridge q6470a

q6470a hp

toner q6470a

hp q6470a black

hp q7581a

q7581a toner

hp q7583a

q7583a toner

hp q7582a

q7582a toner

hp 3800dn toner

hp cp3505 toner

hp 3600 toner

toner for hp 3600

hp 3600 toners

hp 3600 cartridge

hp 3600 cartridges

hp 3600 ink

laserjet 3600 toner

q6470a toner

q6470a cartridge

q6470a black

hp q6470a

q6471a toner

hp q6471a

hp q6473a

toner q6473a

hp q6472a

toner q6472a

hp 3600 toner

toner for hp 3600

hp 3600 toners

hp 3600 cartridge

hp laserjet 3600 toner

hp 3600 ink

hp 3600 toner cartridge

hp laserjet 3600 cartridge

hp 3600 toner cartridges

laserjet 3600 toner

hp 3600n toner

toner for hp 3600n

hp 3600n cartridge

hp laserjet 3600n toner

hp 3600n cartridges

hp 3600n ink

hp color laserjet 3600n toner

hp 3600n toner cartridge

hp color laserjet 3600n cartridge

q6470a hp

toner q6470a

cartridge q6470a

hp q6470a black

hp q6471a

q6471a toner

hp 2550 toner

toner 2550

hp 2550 color

hp 2550 drum

laserjet 2550 toner

toner for hp 2550

hp 2550 toners

hp 2550 ink

hp laserjet 2550 toner

hp 2550 cartridge

hp 2550 cartridges

hp 2550 toner cartridge

hp 2550 toner cartridges

hp laserjet 2550 cartridges

hp 2550 printer cartridges

q3960a toner

q3960a cartridge

q3960a black

q3960a hp

hp laserjet q3960a

q3961a toner

hp q3960a

q3962a toner

q3963a toner

hp q3963a

hp q3962a

hp q3961a

hp toner 2840

toner for hp 2840

hp 2840 toners

hp 2840 drum

drum for hp 2840

hp laser 2840 toner

hp 2840 cartridge

hp laserjet 2840 drum

Cabalgando los gusanos

“Debes cabalgar por la arena a la luz del día,
para que Shai-hulud vea y sepa que no tienes miedo.”

Dune, de Frank Herbert

“Si no está en Google, no existe”. Esta frase tan categórica es cierta tanto para comercios on-line o webs corporativas, como para nuestro blog personal. Cuando necesitamos localizar información en Internet, vamos a Google. Y quien dice Google, dice Yahoo, MSN, o cualquier otro buscador. Tenemos que estar ahí.

Éstos buscadores usan “bots” o “spiders” para indexar el contenido de nuestras páginas, programas que periódicamente recorren todos los sitios que ya conocen en busca de actualizaciones y nuevos enlaces a través de los cuales descubrir, procesar e indexar más y más páginas

A nadie se le escapa que el trabajo de éstos programas es beneficioso, pero normalmente no tenemos en cuenta que generan tráfico extra a nuestra web. Aunque parezca mentira, conozco sitios en los que el tráfico de GoogleBot y compañía (ojo, hablo del propio bot, no de visitas dirigidas desde el buscador) consumía hasta un tercio del ancho de banda total de los accesos. Estamos hablando de GIGAS de tráfico al día.

Además los buscadores penalizan la información repetida: si tenemos varias páginas con contenido igual o muy similar, o aún peor, si podemos cargar una misma página con varias URLs distintas, podemos llevarnos sorpresas desagradables como páginas que no aparecen en los resultados de una búsqueda en favor de un feed o un resumen (índice de sección, categoría, etc.) con contenido similar, o páginas con un pagerank bajo porque éste se “diluye” entre varias URLs.

Por ello es importante aprender cómo funcionan éstos bots para saber cómo optimizar su paso por nuestro sitio web, cómo “llevarlos de la mano” hasta la información que queremos priorizar para así mejorar nuestro posicionamiento en los resultados, minimizando a su vez cuando sea posible la cantidad de información transmitida para no saturar nuestra conexión y servidores.
Continue reading

http:BL del Proyecto Honey Pot

El otro día dí con la página del servicio http:BL del proyecto Honey Pot. La idea de las Honey Nets y Honey Pots no es nueva: crear servicios, páginas, direcciones o redes completas artificales para “engañar” a los bots que rastrean la red en busca de direcciones o foros donde mandar SPAM y poder así identificarlos y bloquearlos. La gracia de este nuevo proyecto es que es distribuido y colaborativo: cualquiera puede crear una página que envíe a su BD los datos de los bots que detecte, y a su vez cualquiera puede acceder a ésta BD para consultar si una IP determinada que visite nuestra página es sospechosa, al estilo de las “listas negras” o RBL utilizadas en los servidores de correo electrónico.

Continue reading