Scale Stack vs node.js vs Twisted vs Eventlet

We’ve been discussing switching from Tornado to either Twisted or Eventlet for Nova (the compute project for OpenStack), so I decided to setup a test to see if there are performance differences to take into consideration. While I was at it I decided to include node.js since that’s all the rage these days, as well as Scale Stack, a C++ project I started earlier this year. Read this quick analysis provided to your by Oddments!

The Test

I wanted to check for two main factors: handling of large numbers of concurrent connections and the overhead with transferring large amounts of data. To do this I wrote a simple echo server in each framework and then used the Scale Stack echo flood tool to test each one. The tool allows you to specify the number of concurrent connections and how much data to send and verify in 32k chunks. You can find the echo server and flood tool for Scale Stack in the project source code. For each of the others, here is the echo server source:

node.js
var net = require(‘net’);
net.createServer(function (socket) {
socket.on(“data”, function (data) {
socket.write(data);
});
socket.on(“end”, function () {
socket.end();
});
}, {backlog: 32768}).listen(12345, “localhost”);
Twisted
from twisted.internet.protocol import Protocol, Factory
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor

class Echo(Protocol):
def dataReceived(self, data):
self.transport.write(data)

factory = Factory()
factory.protocol = Echo
reactor.listenTCP(12345, factory, backlog=32768)
reactor.run()
Eventlet
import eventlet

def handle(fd):
while True:
c = fd.recv(16384)
if not c: break
fd.sendall(c)

server = eventlet.listen((‘0.0.0.0’, 12345), backlog=32768)
pool = eventlet.GreenPool(size=32768)
count = 0
while True:
new_sock, address = server.accept()
pool.spawn_n(handle, new_sock)
Setup

Since none of the frameworks run multi-core for this test (although Scale Stack could), I decided to use my laptop which is a 2.4ghz Core 2 Duo with 4GB of memory running Ubuntu 10.4. There will be one core for the server, and one for the client. Doing the test on a single machine also lets us cut network bottlenecks out of the picture since it all runs through the local interface. In order to test at the high connection counts, I needed to tweak some system limits. I allow for 64k file descriptors per process in /etc/security/limits:
root soft nofile 65535
root hard nofile 65535
* soft nofile 65535
* hard nofile 65535
You’ll notice really high listen backlog settings for the echo server code above. The kernel limits need to match this as well so we need to set these new limits in /proc. I also increased the ephemeral port range so we can get up to 32k active client connections and reduced the kernel socket buffer sizes so I don’t out of memory. These can be set with:
echo 32768 > /proc/sys/net/core/netdev_max_backlog
echo 32768 > /proc/sys/net/core/somaxconn
echo “21000 61000” > /proc/sys/net/ipv4/ip_local_port_range
echo 8192 > /proc/sys/net/core/rmem_default
echo 8192 > /proc/sys/net/core/wmem_default
With the system limits set, I started running the flood tool with connection counts from 1 through 32k. For each connection count, I ran the test with the connection echoing 32k of data and 512k of data. I ran each test three times for each server and took the lowest time (times were very consistent across the board, so any sample would have done).

Results

Scale Stack vs node.js vs Twisted vs EventletGraph of the result listed below.

table

After the above tests, I also started each server up one at a time and ran a 32k connection client that sent data indefinitely to saturate the process. Here are the vmstat numbers of my system during these tests:

context switches

In all cases the server process was consuming an entire core. The idle times were on the core running the client tool, since the server could not always keep up with the client load. The last column labeled “Client Delay” was another time test I ran while the server was saturated to measure response time. For this test, a client would connect, send 32k of data, wait for the echo response, and then disconnect. Results are in seconds for this test.

Conclusions

I was very impressed with how node.js and the Python frameworks held up. I’ve been writing event-driven servers in C/C++ for the past decade or so and didn’t think the higher level languages could handle this kind of load as well as they did. My only concern with node.js or Python is not being able to use all the cores on your system. Some services are well suited to run multiple server process on a single machine or to farm work out to worker process pools to utilize all your cores, so this will be less of an issue. Other services are best implemented when all connections are in a single process and use thread pools instead. For that you’ll still need to rely on a C or C++ based server (Scale Stack is meant to be a framework like the others to help in these cases). Servers written in Erlang or Java would probably perform decently across multiple cores as well.

For short lived connections transferring less than 32k of data, all frameworks scaled very well. When a larger amount of data was being sent we started to see some differentiation. This could be due to buffering techniques or simply the overhead of calling into the language handlers more often. The increase in user % in the processor utilization from the vmstat output for node.js and Python supports this. Scale Stack only buffers once on read and has less runtime overhead since it is not running in an interpretor. The node.js and Python servers may be able to be optimized to avoid double buffering if that is indeed happening, please let me know if that is the case.

As far as the original question of Twisted vs Eventlet, I don’t think performance will be much of a deciding factor. Eventlet has a slight boost in performance and claims to be easier to write services in, but other folks still swear by Twisted. It is probably safe to say that available framework features and personal preference will be the deciding factors.

Update – August 6, 2017

I decided to run a few more versions for just the 32k connection, 512k data test. Below are the repeated times for the original four, plus Erlang, regular Python threads, and two versions of Go.

Scale Stack 50.25
node.js 117.04
Twisted 89.46
Eventlet 77.80
Erlang 61.65
Python threads 111.04 (lots of memory even with minimal stack size)
Go v1 62.95
Go v2 59.73

The Go version is very impressive, almost as fast as the C++ version. Of course these last four you get SMP without any extra work, which is a bonus. It turns out the default socket buffer sizes in Erlang are only 1500 bytes (MTU size). So be sure to push these up (in this test I set it to 16k). Memory consumption with the Erlang server was also fairly low (peak around 400M, usually around 150M).