Thursday, December 07, 2006

Agile Development -or- How to name a religion

Upon reading the normal set of useless arguments about Agile development (I'm not implying a side here actually, just saying the arguments just seem to go on and on), I've noticed how actually brilliant the naming of "Agile Development" really is.

It seems like it just makes sense right - I mean, "Our development cycle is nimble, quick, and Agile". But the real brilliance is *not* in the positive. Its really more brilliant in the negative.

What I mean is simply this. If someone asks you if you're doing Agile, and you aren't - you are now "not agile". I mean - your 4 person dev team could turn on a dime but follow none of the agile development principles, and YOU are the one termed "not agile" over some large team that has a standup meeting once a day.

This is really smart. If you're not with us, you're "not us". This can be extrapolated if you were deciding what to name a new religion. For example, a good candidate would be to name a new religion "Church of the Good Persons" where its members are indeed "good persons".

Then of course, if you are not in that religion you are implicitly "not a good person".

"Fred, I'd like you to meet Bob. Bob isn't a good person, but I'm trying to convince him to become a good person. Isn't that right Bob?"
"I, um, er...."

Fariggin awesome.

You're either with us or you're against us. And your development team is indeed either agile or ya know, its not.

Wednesday, December 06, 2006

The Architecture of Mailinator

Almost 3.5 years ago I started the Mailinator(tm) service. I got the bulk of the idea from my drunk roommate at the time and the first incarnation took me all of about 3 days to code up. In some senses it was a crazy idea. I imagine many people came up with a similar idea before - make a web-based email service that allowed any incoming email to create an inbox. No sign-up. No personal information. Send email first, check email later.



This became ridiculously handy for things like signing up for websites that send you one confirmation
email, then save or sell or spam your email address forever. And of course, it *is* very handy for users. But think about it from mailinator's side. Its basically signing up to receive spam for that address forever. That's a tall order and one that seems to have the possibility of a terrible demise. Someday, enough email could come in that will simply smush Mailinator. But, as of this writing, that day isn't today.

I have in that 3.5 years received hundreds of "thank you" emails, a pile of "it doesn't work" emails, a radio interview, articles in the Washington Post, New York Times, and Delta Skymiles magazine, 1 call from both
Scotland Yard and the LAPD, and a total of 4 subpoenas (1 of those being a Federal Grand Jury subpoena issued by the FBI).

At this point, Mailinator averages approximately 2.5million emails per day. I have seen hourly spikes that would result in about 5million in a day. In addition, the system also services several thousand web users and several thousand RSS users per day.


In the world of email services, this probably isn't all that much. The most interesting part to me is that the complete set of hardware that mailinator uses is one little server. Just one. A very modest machine with an AMD 2Ghz Athlon processor, 1G of ram (although it really doesn't need that much), and a boring (IDE , low-performance) 80G hard drive. And honestly, its really not very busy at all. I've read the blogs of some copycat services of Mailinator where their owners were upgrading their servers to some big iron. This was really the impetus for me writing down this document - to share a different point of view.

Mailinator easily handling a few million emails a day wasn't always the case. The initial mailinator system was quite busy. And in fact, got overwhelmed about a year ago when email traffic started topping 800,000 a day (that's my recollection anyway). In an effort to squeeze life out of the
server and as an exercise in putting together some principles I always championed about server development, I rewrote the system from scratch. I have no idea what the current limit of the existing system is, but at 2.5million a day, its not even breaking a sweat.

If you don't know what Mailinator is, take a small tour through the (rather funny) FAQ.

Lossy lossie lossee

There is a very important point to note about the Mailinator service. And that is, that it is indeed - free. Although it might not seem like it, it has an immense impact on the design (as you'll see). This allowed me to favor performance across the board of the design. This fact influenced decisions from how I dealt with detecting spam all the way down to how I synchronized some code blocks. No kidding.

The basic tenet is that I do not have to provide perfect service. In order to do that, my hardware requirements would be much higher. Now that would all be fine and dandy if people were paying for the service. I could then provide support and guarantees. But given its free I instead went for, in order, these two design decisions:

1) Design a system that values survival above all else even users (as of course, if its down, users aren't really getting much out of it)
2) Provide 99.99% uptime and accuracy for users.

If you wonder what I mean about "survival" in the first line, it basically means that Mailinator is attacked on literally a daily basis. I wanted to make a system that could survive the large majority of those attacks. Note - I'm not interested in it surviving all of them. Because again, if some zombie network decided to Denial-of-service me - I really have no chance of thwarting it without some serious hardware. The good news is that if someone goes to all the trouble of smashing Mailinator (again referencing the fact we're lossy), I really don't lose much sleep over it. It sucks for my users - but there really isn't anything I can do anyway. I'm not trying to be cavalier about this - I went to great lengths to handle attacks, I'm just saying its a cold reality that I simply cannot stop them all. Thus I accept them as part of the game.

The platform

The original Mailinator used a relatively standard
unix stack of applications including a Java based web application running in Tomcat. Mailinator is and was of course, always just a hobby. I had a day job (or 2) so months of development just was never an option. I chose Java for no other reason than I knew Java better than anything else. For email, it used sendmail with a special rule that directed any incoming email to mailinator.com to one single mailbox.

Sendmail --> disk --> Mailinator <-- Tomcat Servlet Engine

The Java based mailinator app then grabbed the emails using IMAP and/or POP (it changed over time) and deleted them. I should have used an mbox interface but I never got around to implementing that. The system then loaded all emails into memory and let them sit there. Mailinator only allowed about 20000 emails to reside in memory at once. So when a new one came, the oldest one got pushed out.

The FAQ advertises that emails stick around for "a couple of hours." And that was true, but exactly how long mattered on the rate of incoming emails. You'll also note an interesting side effect that since all emails lived in memory, if the server came down - all emails were lost! Talk about exploiting the fact that my service was free huh? This may seem dubious but the code was really quite stable and ran for weeks and months without downtime.

I thought about saving emails into a database of course but honestly, all this bought me was emails that stuck around longer. And, that in and of itself sort of went against my intent for mailinator. The ideas was, sign-up for something, goto Mailinator, click the link, and forget about it. If you want a
mailbox
where emails last a few days, thats fine, but there are many other alternatives out there - that's not what Mailinator is about. I forgot the database idea and now shoot for mails that last somewhere around 3-4 hours.

This all worked fabulously for awhile. It pretty much filled up all 1G of ram of the server. Finally when the incoming email rate started surpassing 800,000 a day, the system started to break down. I believe it was primarily the disk contention between unix mail apps and the Java app locking mailboxes. Regardless, there were many issues with that system that bugged me for a long time. The root of most of those problems really boiled down to one thing - the disk. The disk activity of sendmail, procmail, logging and whatever else was a silly bottleneck. And it needed to go.

More than a year ago now I did a full rewrite. Much of the anti-spam code that I'll describe later was already in this code-base but was improved and extended for the new system.

Synchronous vs. Asynchronous I/O

I've read a fair number of articles on the wonders of asynchronous I/O (java's NIO library). I don't doubt them but I decided against using it. Primarily, again, because I did a great deal of work in multithreaded environments and knew that area well. I figured if I had performance issues later, I could always switch over to NIO as a learning experience.

The biggest thing I knew I needed to do with Mailinator was to remove the unix application components. Mailinator needed to stop outsourcing its email receipt and do it itself. This basically meant I needed to write my own SMTP server. Or at least, a subset of one. Firstly, Mailinator has never had the ability to send email so I didn't need to code that part up. Second, I had really different needs for receiving email. I wanted to get it as fast as possible -or- refuse it as fast as possible.

SMTP has a rich dialog for errors but I chose to only support one error message. And that error is, appropriately enough - "User Unknown". That's a touch ironic since Mailinator accepts any user at all. Simply said, if you do anything that the Mailinator server doesn't like - you'll get a user unknown error. Even if you haven't sent it the username yet.

I looked at Apache James as a base which is a pure java SMTP server but it was way too comprehensive for my needs. I really just found some code examples and the SMTP specs and wrote things basically from scratch. From there, I was able to get an email, parse it, and put it right into memory. This bypassed the old system's step of writing it to disk all the way. From wire to user, mailinator mail never touches the disk. In fact, the Mailinator server's
disk is pretty darn idle all things considered.

Now to address persistence concerns right away - Mailinator doesn't run diskless, but it does run very asynchronously with regards to the disk. Emails are not written to disk EVER unless the system is coming down and is instructed to write them first (so it can reload them upon reboot). This little fact has been very handy when I've been subpoenaed. I simply do not have access to any emails that were sent to Mailinator in the past. If it is possible that I can get an email - so can you just by checking that inbox. If you can't get it then that means its long deleted from memory and nothing is going to get it back.

Mailinator also used to do logging (again, shut-off because of pesky subpoenas). But it did it very "batchy". It wrote several thousand logs lines to memory before doing one disk write. In effect we never want to have contention based on the incredibly slow disk.

Now if this all sounds a bit shaky, as in we might just lose an email now and then - you're right. But remember, our goal is 99.99% accuracy. Not 100%. That's an important distinction. The latest incarnation of Mailinator literally runs for months unattended. We do lose emails once in awhile - but its rare and usually involves a server crash. We accept the loss and by far most users never encounter it.

Emails

The system now is one unit. The web application, the email server, and all email storage run in one JVM.


The system uses under 300 threads. I can increase that number but haven't seen a need as of yet. When an email arrives (or attempts to arrive) it must pass a strong set of filters that are described below. If it gets past those filters it is then stored in memory - however, it is first compressed to save in-memory space. Over 99% of emails that arrive are never looked at, so we only ever decompress an email if someone actually "looks" at them.

Because of this, I am able to store many more emails than the original system's 20000. The current mailinator stores about 80000 emails and uses under 300M or ram. I probably should increase this number as plenty of ram is just sitting around. The average email lifespan is about 3-4 hours with this pool. The amount of incoming email has gone way up, so even by increasing this pool, we're largely staying steady as far as email lifespan. I could probably kick that up to 200,000 or so and increase the lifespan accordingly but I haven't seen a great need yet.

Another inherent limit that the system imposes is on mailboxes themselves. Popular mailboxes such as joe@mailinator.com and bob@mailinator.com get much more email than average. Every inbox is limited to only 10 emails. Thus, popular boxes inherently limit themselves on the amount of email they can occupy in the pool. Use of popular inboxes is discouraged anyway and generally become the creme de la cesspool of spam.

Two more
memory conserving issues is that no incoming email can be over 100k and all attachments are immediately discarded. That latter feature was in years ago but obviously really ruins this whole new wave of image spam (if you see a few seemingly "empty" emails in some popular boxes, they might have been image spam that got their images thrown away).

Spam and Survival

I'd like to emphasize here that Mailinator's mission is NOT to filter spam. If you want penis enlargement or sheep-of-the-month club emails, that's pretty much what Mailinator is good for. We are clear in the FAQ. Mailinator provides pretty good anonymity - but we do NOT guarantee it. We also do NOT guarantee ANY privacy. Its really easier that way for us. Still, it does a pretty damn good job even so. We might log you (used to and it might get turned on again someday, never know) and we DO respond to subpoenas (that whole "jail" thing is a strong motivator).

So, in essence I have no real interest in filtering out spam. I do however, have a great deal of interest in keeping Mailinator alive. And spammers have this nasty habit of sending Mailinator so much crap that this can be an issue. So - Mailinator has a simple rule. If you do anything (spammer or not) that starts affecting the system - your emails will be refused and you may be locked out.

In the new system I created a data structure I call an AgingHashmap. It is, as it indicates a hashmap (String->int) that has elements that "age".

The first type of spammer I encountered was one machine blasting me with thousands of emails. So, now, every time an email arrives, its senders IP is put into an AgingHashmap with a counter of 1. If that IP does not send us anymore email for (let's say) a minute, then that entry automatically leaves the AgingHashmap. But, let's say that IP address sends us another email 2 seconds later. We then find the first entry in the AgingHashmap and increase that counter to 2. If we see another email from that IP, it goes to 3 and so on. Eventually, when that counter reaches some threshold we ban all emails from that IP for some amount of time.

We can put this in words as so (values are examples):
If any IP address sends us 20 emails in 2 minutes, we will ban all email from that IP address for 5 minutes. Or more precisely, we will ban all email from that IP until it stops trying to send to us for at least 5 minutes.

This is really what the AgingHashmap is good for. We can setup some parameters and detect frequency of some input, then cause a ban on that input. If some IP address sends us email every second for 100 days straight, we'll ban (or throw away) every last email after the first 20.

Here's a graph of an average 24 hours of banned IP address emails. Notice at 10am and 11am some joker (i.e., some single IP address) sent us over 19000 emails per hour.



I do have some code that has Java talk back to unix's iptables system to do very hard blocking of IP addresses but its not on right now. Partially because there's no need (yet) and partially because I like to see the stats.

The funny part of this is the error Mailinator gives. Remember the "User Unknown"? Once an IP address is banned and then it tries to open a new connection it will send the SMTP greeting of "HELO". Mailinator will then reply "User Unknown" and close the connection. Of course, it didn't even get the username yet.

Zombies

The next problem came from zombie
networks. Now we were getting spam from thousands of different IPs all sending the same message. We could no longer key in on IP address. As a layer of defense past IP we created an AgingHashmap based on subjects. That is, if we get (again, example numbers) something like 20 emails with the same subject within 2 minutes, all emails with that subject are then banned for 1 hour.

Here's a similar graph. Keep in mind these emails got past the IP filter - so basically they are "same subject" emails from many disparate sources.



You could argue we should ban them forever, but then we'd have to keep track of them and the Mailinator system is inherently transient. Forgetting is core to what it does. This blocking is more expensive than IPs as comparing subjects can be costly. And of course, we have to have enough of a conversation with the sending server to actually get the subject.

Pottymouth!

Finally, we ran into some issues on emails that just weren't cool. As I said, I'm far more interested in keeping Mailinator alive than blocking out your favorite porn newsletter. But, some unhappy people used Mailinator for some really not happy purposes. Simply put, as a last layer, subjects are searched for words that indicate hate or crimes or just downright nastiness.

Boing

Another major influx that happened early on was a plethora of bounce messages. Now thats sort of odd isn't it? I mean Mailinator doesn't send email. In fact, it CAN'T send email so how could it get bounce messages? Well, some spammy type folks thought it'd be neat to send out spam from their servers using forged Mailinator addresses as a return address. Thus when those emails bounced, the bounce came here.

What's worse, is I still get email from people who think Mailinator sent them
spam. Its very frustrating to defend myself against people who are ignorant of how email works ready to crucify me for sending them spam (especially ironic is that I run a free, anti-spam website). As I've said in my FAQ - please feel free to add mailinator.com to the tippy tippy tippy top of your spam blacklists. If you EVER get an email from mailinator.com, its a forged spam.

The good news is that bounces are very easy to detect, and are really the first line of our defense. Bouncing SMTP servers aren't particularly evil, they're just doing their job so when I say "user unknown" they believe me and go away.

On an abstract level, here is what happens to an email as it enters the system.



(and to be fair, there might just be another layer or two thats not on that diagram!)

Anti-Spam revolt

There are 2 more, somewhat conflicting features of the Mailinator server that should be noted. For one, its a clear fact that when we're busy, we're busy. An easy DoS against us would be to open a socket to our server and leave it open. This is an inherent vulnerability in any server (maybe especially multithreaded servers). So, as a basic idea Mailinator closes all connections if they are silent for more than a second or two. Actually, the amount of time is variable (read below). Clearly, we are DoS'able by sending us many many connections, but this blocks at least one trivial way of bringing us down.

Secondly, although we demand servers talking to us are very speedy. We reserve the right to be very NOT speedy. Here's the logic. When Mailinator is not terribly busy, we still demand responses quickly, but we give responses slowly. In fact, the less busy we are, the slower we give responses. It is possible that sending an email into the Mailinator SMTP server could take a very long time (like 10 or 20 or 30 seconds) even for a very small amount of data.

Why? Well.. think about it. Let's say you're spamming. You want to send out a zillion emails as fast as possible. You want every receiving SMTP server to get your email, deliver it to the poor sod who wants (or doesn't want) weener enlargement and then close the connection so you can go on to the next. If you encounter some darn SMTP server that takes 20 seconds to receive your email, the speed at which you can send out your emails diminishes. You might just even think about avoiding such SMTP servers.

It might be a pipe dream to think this is slowing down any spammers, but this does tend to keep my quieter times lasting longer. And it doesn't really hurt me - or my users. And if we eventually get terribly busy, those delays are scaled down to make sure we don't lose any emails.

Sites will ban it

Every time I read some comment about Mailinator, someone always points out something like "Yeah, well sites will start banning any email addressed to Mailinator and then it will be worthless". Guys. Its been 3 years. A handful of sites have indeed blocked Mailinator, but my user base and the number of read emails has only gone up. Clearly, people are finding Mailinator more useful than ever.

I have added at times additional domains (like sogetthis.com and fakeinformation.com) that point to mailinator. Often if a site bans mailinator.com proper, you can use one of those to same effect.

Overall

Many copycat sites have appeared over the years which is pretty reasonable. This idea itself is obvious. The only real hurdle was that it seemed impossible to do given the amount of useless email you'd get. But the copycats had the advantage of seeing that Mailinator actually does work, so they knew what to shoot for. Only a few post their daily email numbers but I've yet to see any that come close to mailinator's incoming email (not that this is necessarily a good thing). I also see that many are using an architecture similar to Mailinator's original which is just fine so long as they either don't get any massive increases in email or are happy to keep buying bigger hardware.

Overall, Mailinator has been a great experience. It was a terribly fun exercise in optimization, security, and generally making things work. Thousands of people use it everyday and its amazing how many people know about it when it comes up in conversation. I've thought many times about how to make a business around it, and there is always an angle, but I've just been to busy with other things.

My hope is that its useful for you and that you tell your friends.


Digg!






Note: Eternal thanks to Jack Lawrence of Syracuse, NY who, in a drunken stupor gave me the core idea (story here), Nicci Gabriel of www.sideofsauce.com for the seriously cool web design, and to Brian Pipa of www.candyaddict.com who, as a big fan of Mailinator, added the very cool "Spam Map" and the RSS feeds.

Monday, December 04, 2006

Review: 2 Books on Anti-Religion

I finished Dawkin's The God Delusion and Harris' Letter to a Christian Nation in the past few weeks.

Its really interesting to finally see a contingient of humanity stand up against religion. That's interesting because its basically fighting for nothing. These atheists (although Harris is quick to point out that "atheist" is a silly word, read below) aren't trying to convert you per se, they're trying to tell you that your religion is pretty silly.

Both agree that we are ALL atheists when it comes to some religions. For example, most of us are atheists to Zeus nowadays (sorry dude). In fact, everyone is probably an athiest to hundreds of different religions. Just pure atheists go one religion further and actually believe in none. They go on with the idea that the word 'athiest' is rather silly in that we don't often categorized un-members of something (although I can think of a counter-example in the word "unemployed").

I absolutely loved Dawkin's earlier books The Blind Watchmaker and The Selfish Gene. The biggest fault of the God Delusion for me was that Dawkins tends to get angry at times. I understand that in his videos of talking to indignant religious people (youtube on "root of all evil"), but there doesnt seem to be much purpose for it in a book. It was an unwelcome distraction (reminded me of Aronson's The Social Animal where he kept being overly politically correct with facts like "men often have more muscle mass than women - Not to say women can't be strong!!! its just that..")

Letter to a Christian Nation is all of 80 pages and can be read in one bus ride (QED). Its clean and concise and worth it given the modest investment.

Both books bring up plenty of Bible contradictions (they both focused on only the Bible it seems) and meta-views of the universe (e.g. if one believes the universe is so complex as to have only been possibly created by a creator, then how was the yet-even-far-more-complex creator created?).

I don't expect either of these book to change many minds given that I doubt many people who are religious will read them. Of decent value is Dawkins discussion on the misconceptions of stem cell research and evolution. He says that many religious people tell him that the universe (or the eye, or the cardio-vascular system, or any other sufficiently complex system) could not have developed by accident. To which he points out as the biggest misunderstanding of evolution.

That evolution is never by accident. Its by massively parallel trial and error (Note that genetic algorithms in artificial intellgence work this way, a million computers (or simulated computers) try a million different slightly different attacks at a problem, the best approach is kept and forms the basis for the next million approaches. All losers are unmercifully deleted). That "part" of an eye is truly an advantage over your neighbor that has no eye at all. And that if your other neighbor has even a better part, his vision gives him an advantage over you. In other words, the eye did not evolve overnite. It began as something that could barely detect light and dark, and slowly got better and better as it evolved to provide more advantage to its owner.

Anyway, these books are just some more fun examples of world religions. Dawkins tells one story of describing an aboriginal religion that involved witches that fly in the night and shoot poison darts at bad people (i.e., "sinners"). To which a priest at his table laughed at what nonsense that religion was. To which Dawkins basically replied that the priest's religion didn't make much more sense itself (plenty of tense moments therein).

The overtones are of course that more people die and more wars are fought for religion than anything else. That God sure wants us to kill each other it seems. Both authors would be happier if religion simply didn't exist. Overall, there isn't any terribly unobvious stuff in there, but both are good reads.