We’ve been looking for a Drizzle regression for some time now, and today I decided I would take a step back and make another attempt to find it. The first step in doing this was to reproduce this consistently and find a baseline. We’ve noticed it most dramatically with a 16 concurrent connection test from sysbench in read-only mode. I used two 16-core Intel machines running Linux we have for development. We’ve noticed the regression on certain machines but not all, and these two machines provided one of each. I also setup a MySQL 5.1.35 server to use as a baseline to give some comparisons outside of Drizzle. So first, a few more details on the machines:
Machine 1: 16 core, 16GB RAM, cache sizes from dmesg:
[ 0.010000] CPU: L1 I cache: 32K, L1 D cache: 32K
[ 0.010000] CPU: L2 cache: 4096K
From /proc/cpuinfo:
cache_alignment : 64
Machine 2: 16 core, 40GB RAM, cache sizes from dmesg:
[ 0.010000] CPU: Trace cache: 12K uops, L1 D cache: 16K
[ 0.010000] CPU: L2 cache: 1024K
[ 0.010000] CPU: L3 cache: 16384K
From /proc/cpuinfo
cache_alignment : 128
For Drizzle I used the latest trunk in Launchpad (r1058), and for MySQL I downloaded mysql-5.1.35-linux-x86_64-glibc23.tar.gz from mysql.com.
For sysbench, I grabbed the Drizzle branch of it at lp:~drizzle-developers/sysbench/trunk since this has the libdrizzle driver. The libdrizzle driver also supports the MySQL so I use it to test against both. The sysbench commands I used were:
Drizzle: sysbench –test=oltp –oltp-read-only=on –max-time=15 –max-requests=0 –oltp-table-size=1000000 –num-threads=16 –db-ps-mode=disable –db-driver=drizzle –drizzle-host=127.0.0.1 –drizzle-port=4427 –drizzle-db=test –drizzle-user=root –drizzle-table-engine=innodb run
MySQL: sysbench –test=oltp –oltp-read-only=on –max-time=15 –max-requests=0 –oltp-table-size=1000000 –num-threads=16 –db-ps-mode=disable –db-driver=drizzle –drizzle-host=127.0.0.1 –drizzle-port=3306 –drizzle-db=test –drizzle-user=root –drizzle-table-engine=innodb –drizzle-mysql=on run
I started Drizzle and MySQL with the following options. These are not meant to be finely tuned options, but just enough to get the servers running with some sane comparable defaults and able to reproduce the regression.
bin/mysqld –no-defaults –server-id=1 –port=3306 –socket=/home/eday/other/mysql.data/sock.master –basedir=/home/eday/other/mysql –datadir=/home/eday/other/mysql.data/db.master –log-error=/home/eday/other/mysql.data/db.master/error –innodb-buffer-pool-size=128M –innodb-log-file-size=64M –innodb-log-buffer-size=8M –innodb-thread-concurrency=0 –innodb-additional-mem-pool-size=16M –character-set-server=utf8 –table-open-cache=4096 –open-files-limit=4096 –pid-file=/home/eday/other/mysql.data/db.master/pid
drizzled –datadir=/home/eday/other/drizzle.data –innodb-buffer-pool-size=128M –innodb-log-file-size=64M –innodb-log-buffer-size=8M –innodb-thread-concurrency=0 –innodb-additional-mem-pool-size=16M –table-open-cache=4096 –table-definition-cache=4096
Now with everything up and running, I gathered some data. Headings are: -
| |
1-drizzle |
1-mysql |
2-drizzle |
2-mysql |
| TPS |
1335 |
2434 |
1559 |
1239 |
| vmstat |
| in |
6k |
110k |
60k |
50k |
| cs |
100k |
210k |
120k |
100k |
| us |
22 |
75 |
72 |
78 |
| sy |
6 |
20 |
25 |
18 |
| id |
72 |
5 |
3 |
4 |
| wa |
0 |
0 |
0 |
0 |
| valgrind with cachegrind tool |
| TPS |
5.21 |
3.15 |
3.55 |
1.96 |
| iref |
858M |
1011M |
668M |
789M |
As you can see, we hit the major regression in column one. Our interrupts and context switches are way out of line, and the CPU is mostly idle. Note though that when run under cachegrind (valgrind –tool=callgrind ), we see the normal pattern and don’t notice the regression. This means to reproduce we can’t have any intrusive debugging tools. I also tried counting system calls as a sanity check and found (using strace -fc ):
1-drizzle: 402 TPS
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
55.47 106.807940 2867 37252 11127 futex
13.97 26.907878 220556 122 select
5.67 10.913999 10913999 1 rt_sigtimedwait
5.29 10.181982 565666 18 poll
2.16 3.1154314 11 362463 read
1.63 2.1146381 1479 2128 pread
1.55 2.981563 16 181220 write
0.67 0.1297864 122 10598 sched_yield
0.67 0.1288412 1394 924 nanosleep
1-mysql: 245 TPS
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
60.03 108.1509367 4594 23836 5835 futex
15.42 27.1119721 100788 279 1 select
5.25 8.1579991 1064443 9 rt_sigtimedwait
1.65 2.1005637 637 4716 pread
1.30 1.1367183 21 110986 write
1.07 1.950471 9 221889 221889 sched_setscheduler
1.06 1.929137 9 223097 1044 read
2-drizzle: 276 TPS
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
65.93 115.556211 3704 31198 8101 futex
15.30 26.810000 203106 132 select
5.89 10.330000 10330000 1 rt_sigtimedwait
5.80 10.170000 565000 18 poll
2.65 4.647123 37 125060 write
2.35 4.116470 16 250158 read
1.72 3.021068 1483 2037 pread
0.27 0.477662 20768 23 fsync
0.04 0.072760 4 18218 madvise
0.02 0.043264 18 2357 sched_yield
2-mysql: 168 TPS
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
68.25 100.851565 6704 15044 3328 futex
21.22 31.350000 116111 270 1 select
5.17 7.642591 347391 22 rt_sigtimedwait
1.93 2.853778 619 4612 pread
1.15 1.694888 11 153040 346 read
0.90 1.327743 9 152529 152529 sched_setscheduler
0.78 1.145964 15 76305 write
0.32 0.468542 37 12705 madvise
Again, when tracing the process to count system calls, we see the regression disappear, which leaves us with a smaller set of tools to use.
So what next? One theory we were tossing around is a cache alignment issues. This seems like a pretty dramatic drop in performance to be caused by this, but I ran a test to see what the behavior of a process is when you are being throttled by a shared cache line. The results showed idle CPU, but the interrupts and context switches did not drop off. This does not follow the same pattern we saw in Drizzle (interrupts and context switches did drop off). Our cache line size is also smaller on the machine showing the regression, so that did not help support this theory.
While stabbing at a few other ideas, I ran ldd to see which libraries were being use in the two drizzled binaries on each machine. Suppressing some common libs:
1-drizzle: ldd drizzled/drizzled
...
libpcre.so.3 => /lib/libpcre.so.3 (0x00007f9a88532000)
libtbb.so.2 => /usr/lib/libtbb.so.2 (0x00007f9a88320000)
libtcmalloc.so.0 => /usr/lib/libtcmalloc.so.0 (0x00007fcced5bc000)
...
2-drizzle: ldd drizzled/drizzled
...
libpcre.so.3 => /lib/libpcre.so.3 (0x00007fc0af803000)
libtbb.so.2 => /usr/lib/libtbb.so.2 (0x00007fc0af5e9000)
...
The machine showing the regression is linking with tcmalloc. Looking at the drizzle configure.ac, we use libtcmalloc by default if it is found (machine 2 does not have tcmalloc installed). I relinked drizzled without tcmalloc and received these results:
| |
1-drizzle |
1-mysql |
2-drizzle |
2-mysql |
| TPS |
2751 |
2434 |
1559 |
1239 |
There it is! For some reason tcmalloc was giving us a 51% performance drop. Perhaps this is due to the tcmalloc version or settings we need to tweak for performance (something to look into later), but for now disabling this by default is the solution. We’re verifying the fix now and should be in the Drizzle trunk shortly.