Wednesday, March 05, 2008

Writing Java Multithreaded Servers - whats old is new (PDF Slides)

I'm giving another talk tomorrow at the SD West conference:

Here are the slides
Thousands of Threads and Blocking I/O: The Old Way to Write Java Servers Is New Again (and Way Better)

I've encountered some very strong misperceptions in the world that:

1) Java asynchronous NIO has higher throughput than Java IO (false)
It doesn't. It loses by 20-30%. Even with single thread against single thread. If multiple threads enter the equation (and multiple cores) which of course blocking I/O is intent on using - its skews even farther.

2) Thread context switching is expensive (false)
New threading libraries across the board make this negligble. I knew Linux NPTL was fast, but I was quite surprised how well Windows XP did (graphs inside notes).

3) Synchronization is expensive (false, usually)
It is possible for synchronization to be fully optimized away. In cases where it couldn't it did have a cost - however given we have multicore systems now its uncommon to write a fully singly-threaded server (synch or asynch), in other words every server design will pay this cost - but, non-blocking-data-structures ameliorate this cost significantly (again graphs inside show this).

4) Thread per connection servers cannot scale (false)
Thats incorrect at least up to an artificial limit set by JDK 1.6. 15k (or 30k depending on the JVM) threads is really no problem (note linux 2.6 with NPTL using C++ is fully happy with a few hundred-thousand threads running, Java sadly imposes an arbitrary limit). If you need more connections than this (and aren't maxing your CPU or bandwidth) - you can still use blocking IO but must depart from thread-per-connection. Or fall back to NIO.

I'll try to spruce up the benchmarks I used and try to post them. I'd like to point out that writing Java benchmarks is very hard. I spent a great deal of time making sure I warmed up the VM and insured there were no positional biases or other overzealous or skewing optimizations.

I was then *extremely* lucky to get help from Cliff Click of Azul systems (if you want to write a benchmark, a VM engineer is the right kind of person to get help from). He spent half a saturday tweaking my benchmark in ways I never thought of. Then ran them for me on his 768core Azul box (graph inside)!! thanks Cliff !


Anonymous said...

Do you think the same analysis can be applied to applications requiring longer connections (e.g., a chat server with sessions that can last a few hours)?

Paul Tyma said...

Thats an excellent question because chat servers are an interesting use case. Many open connections, rare data flow - tons of receives as compared to writes.

Using NIO would be very viable there as you could possible exhaust threads or ram or something with a threadperconnction model.

Alternatively, you can create a multithreaded-io server without thread-per-connection and that would work too. But then, like NIO, you are forced to save client state manually to some degree.

In other words, the coding style of the two servers converge in that instance. IO still has better throughput, but the very compelling case to use it (in my opinion) is somewhat diluted there. (I probably still would).

itkovian said...

And I just heard Cliff say he would not help out people who had issues unless they bought a shitload of boxes :-)

Anonymous said...

On sychronization costs, fully thread-local synchronization can be optimized away, and uncontended synchronization is extremely cheap (on Java6+ VMs this is basically the same cost as a compareAndSet instruction). It is only when it is contended that there is any significant cost.

Anonymous said...

Great work! Thank you very much.

Anonymous said...

Could you provide the code you used for the benchmarks?

Anonymous said...

At slide #47 there is

Executors.newCachedThreadPool = evil, die die die

What are the issues with newCachedThreadPool ?



Pierre Phaneuf said...

I think the main problem is that it will create new threads all the time, even when the system is completely overwhelmed and every new thread is only making things worse...

Barry Kelly said...

You're posing a false dichotomy, however, if you're suggesting (as you appear to be) that async means explicitly saving client state, and keeping track of position in a transaction.

If your async API is modeled around continuations (e.g. like the BeginRead / EndRead etc. in the .NET API), you can just pass closures as the continuations and you get client state saved for free.

Moreover, since the CPS transform from straight-line, apparently client-per-thread code, is mechanical, you can get async behaviour with any cost in programming difficulty.

Christoffer Lernö said...

Regarding anonymous' question:

Perhaps for something simple as a chat server, multiple threads could still work, but anything more complex than would appear to be rather painful.

abch said...

is the benchmark code available? or exist? ;-)

Tarun Elankath said...

Well.I guess not just chat servers, but also alos p2p file client/servers like Bittorrent would benefit more from async-io too...

Anonymous said...

From my experiences, chat servers are not that simple cases. And writes dominates reads with several magnitude. Imagine rooms with 200-400 people : if someone enters, or leaves everyone should be notified, without too much delay. I've seen that incoming data is rarely exceeded 10K/sec, but outgoing is stayed constantly at 3 MB/sec.

Purohit D said...

I appreciate the information. Java is one of the consistent player from the development industry which has been providing the wider scope for developers to come out with different solutions.

anonymous said...

On Linux each thread requires 4k stack usage right? So to serve clients wouldn't x threads require quite a bit of resources? Also what about memory fragmentation/allocation issues over time?

Chris said...

This is great stuff Paul. I love it when someone comes along and challenges previously held assumptions. I've heard the async > sync dogma a lot in my graduate classes and from peers.

Lateef Jackson said...

There are a number of things that I think would be helpful to figure out:
* Source code would help so we could run the benchmarks to figure memory, cpu ect
* Is this a Java issue or OS or architecture? Since not all OS are equal with threading and processes combined with the VM version.
* I would think a cluster is a more common deployment than a single machine with 768 cores.

My benchmarks with Kqueue and epoll vs threading in Python showed that the kernel events where significantly more scalable however I was not extensive nor did I put much effort in the thread code. Would be happy to reimplement based on your algorithms if you share the code to test it.

Paul Bates said...

My head was a mess of conflicting information until I read your presentation. Now I can see a clear path ahead. Many thanks.