libdrizzle and PHP Extension Released

July 2nd, 2009

Version 0.4 of libdrizzle has been released. This was mostly a maintenance release with build system changes and small bug fixes. This is the client and protocol library for Drizzle and MySQL that provides both client and server interfaces.

Version 0.4.1 of the Drizzle PHP Extension has also been released. James Luedke has moved development and releases of the extension into PECL, and has also fixed a number of bugs, extended the interface, and worked with the PHP/PECL developers to get the extension up to the proper PHP coding standards. Thanks James!

Gearman Releases

July 2nd, 2009

Version 0.8 of the Gearman C Server and Library has been released. This includes basic HTTP protocol support, build system improvements, and bug fixes.

Version 0.4.0 of the Gearman PHP Extension has also been released.

If you want to learn more about Gearman, be sure to check out the upcoming Boston MySQL Meetup, MySQL Webinar, or the one of the events at OSCON (tutorial, session, and BoF).

Drizzle and Gearman in Boston Next Week

June 30th, 2009

I’ll be heading back to my home state (Maine) this week for a visit, and while I’m back there Patrick Galbraith and I will be talking at the Boston MySQL Meetup Group on Monday night about Drizzle, Gearman, and how to combine the two with projects like Narada. If you are in the Boston area, be sure to check it out!

Why You Won’t See a Drizzle Proxy

June 29th, 2009

I’ve been following the excellent work that Jan, Kay, and others have been doing with MySQL Proxy, it has really matured into a great piece of software. I talked to Jan at the MySQL UC and toyed with the idea of integrating libdrizzle into MySQL Proxy. I’ve also been asked by a number of folks when a Drizzle Proxy project will be started and if it will be as feature rich as MySQL Proxy. For a while I just said “Someday, I just don’t have the time.” Lately though I am hoping we never have a Drizzle Proxy project.

Let me explain.

One of the fundamental ideas in software engineering is code reuse through libraries or modules. Rather than create a Drizzle Proxy project, why not add a proxy module into the Drizzle server? This way, at any point during the query execution path, you could toss the query to the proxy module to deal with, and the main execution engine would be done. You could of course run the Drizzle server in a “proxy only” mode where new queries may only be parsed and then a post-parsing module determines where and how that query is proxied. Post proxy hooks will be needed as well for result processing. Functionally, it’s the same thing as the proxy, but without having to reinvent the components needed in the proxy. (Just as a side note, I understand this may not have been an option for the MySQL proxy folks).

So, to be clear, I still want to have proxy functionality, just not as an independent project.

Even with a proxy module inside of the server, I’d like to address some of the reasons proxies are created and used. These are not necessarily specific to a database proxies, many of these reasons apply to other server types as well. In the case of a database proxy, especially with Drizzle, I would like to address the list of reasons below in a different way. Why? In most architectures, I see a proxy server as a fix for a shortcoming with another component, possibly in the client, server, or maybe even in the application data model. It also introduces latency and another failure point that may not be necessary. The less code and machines your application has to run through, the better. Don’t get me wrong, there are reasons to use proxies, but sometimes they are used as a hack.

  • Query processing and rewriting - In Drizzle we plan to add query rewrite plugin hooks, both pre-parser and post-parser. At some point we want to add pluggable parser support and clean up the abstract syntax tree. These plugins would enable rewriting of queries at a few different levels, both with the raw strings or with rearranging the syntax tree before the optimizer takes over.
  • Query multi-cast, data partitioning, result merging - In my opinion, this may could probably be done at the client library layer or through another system such as Gearman. If pushing that logic into the client is not an option, you could still accomplish this through the proxy module I mentioned above, possibly running the server in a mixed-mode (some queries answered locally, some proxied).
  • Connection Pooling/Concentration - People often confuse these two terms. Pooling is the re-use of connection on a client side. This should be pushed to client APIs whenever possible. When this is not possible, you need to use a generic TCP proxy or database proxy, but these should only be run locally (not on a separate machine). Concentration is a piece of software that acts as a connection multiplexer. It takes multiple client side connections and allows them to map onto a single connection to the server. This is usually because the server does not have an efficient threading or file descriptor handling model to withstand thousands of connections. It’s not always an option to re-architect a server to handle this, but it should be preferred over creating another layer to do the concentration for you. In Drizzle, this is one thing I have a particular interest in. It involves improving or re-writing the pool-of-threads scheduler and making the execution engine more stateful so it can yield a thread when it knows it will block.
  • Sharding, HA/failover - Again, something I think belongs in the client library, and is part of the new Drizzle protocol. I’ll be adding support into libdrizzle to manage sharding and connection failover shortly.
  • Debugging layer - At some point we should be adding probes into the server where output can be piped to a module of your choice. For example, you can register for a set of events and have a module send those out into a Gearman network for processing and debugging. This will give you the flexibility to process probe output however you want and does not introduce another layer just for debugging.

These are things I plan to work on at some point or would like to help someone else work on inside Drizzle. Also, these are my own thoughts and may not be shared by fellow Drizzle developers. Treat this as an invitation for discussion. :)

Gearman Pluggable Protocol

June 12th, 2009

I just finished adding pluggable protocol support to the Gearman job server, this will enable even more methods of submitting jobs into Gearman. If all the various Gearman APIs, MySQL UDFs, and Drizzle UDFs are not enough, it’s now fairly easy to write a module that takes over the socket I/O and parsing hooks to map any protocol into the job server. As an example module, I added basic HTTP protocol support:

> gearmand -r http &
[1] 29911
> ./examples/reverse_worker > /dev/null &
[2] 29928
> nc localhost 8080
POST /reverse HTTP/1.1
Content-Length: 12

Hello World!

HTTP/1.0 200 OK
X-Gearman-Job-Handle: H:lap:1
Content-Length: 12
Server: Gearman/0.8

!dlroW olleH

I’ve added a few headers for setting things like background, priority, and unique key. For example, if you want to run the above job in the background:

POST /reverse HTTP/1.1
Content-Length: 12
X-Gearman-Background: true

Hello World!

HTTP/1.0 200 OK
X-Gearman-Job-Handle: H:lap:2
Content-Length: 0
Server: Gearman/0.8

So what protocols are we looking at? HTTP and memcached were on the top of the list, but I’m guessing other folks may have better ideas or perhaps could use it for custom integration with their existing infrastructure. This is now tested in my development branch and will be pushed to trunk in the next couple days. If anyone is interested in working on the HTTP module, please hack away, patches are welcome! :) It may be interesting to map a worker interface in as well depending on headers, along with better support for client requests and HTTP error codes.

Here is another quick example that shows how this can be useful. With the job server we started above still running, use the gearman command line client/worker to start up a worker that can do the function ‘proto’ and responds with dumping the file PROTOCOL (use any other file you have around):

> gearman -w -f proto cat PROTOCOL

If you’ve note used this command line tool before, -w makes the process act like a worker, -f function specifies which function the worker should register as, and everything after is executed every time a job is run (it fork()s, remaps stdin/out to pass the payload/read result, and then exec()s).

Now point your browser to http://localhost:8080/proto and you will see the contents of the file (assuming you are running all this on your local machine). This may not seem too useful, but now imagine more complex workers running on a distributed cluster. We now have a simple web server with distributed CGI scripts! :)

Drizzle Regression Hunting

June 9th, 2009

We’ve been looking for a Drizzle regression for some time now, and today I decided I would take a step back and make another attempt to find it. The first step in doing this was to reproduce this consistently and find a baseline. We’ve noticed it most dramatically with a 16 concurrent connection test from sysbench in read-only mode. I used two 16-core Intel machines running Linux we have for development. We’ve noticed the regression on certain machines but not all, and these two machines provided one of each. I also setup a MySQL 5.1.35 server to use as a baseline to give some comparisons outside of Drizzle. So first, a few more details on the machines:

Machine 1: 16 core, 16GB RAM, cache sizes from dmesg:
[    0.010000] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.010000] CPU: L2 cache: 4096K
From /proc/cpuinfo:
cache_alignment : 64

Machine 2: 16 core, 40GB RAM, cache sizes from dmesg:
[    0.010000] CPU: Trace cache: 12K uops, L1 D cache: 16K
[    0.010000] CPU: L2 cache: 1024K
[    0.010000] CPU: L3 cache: 16384K
From /proc/cpuinfo
cache_alignment : 128

For Drizzle I used the latest trunk in Launchpad (r1058), and for MySQL I downloaded mysql-5.1.35-linux-x86_64-glibc23.tar.gz from mysql.com.

For sysbench, I grabbed the Drizzle branch of it at lp:~drizzle-developers/sysbench/trunk since this has the libdrizzle driver. The libdrizzle driver also supports the MySQL so I use it to test against both. The sysbench commands I used were:

Drizzle: sysbench –test=oltp –oltp-read-only=on –max-time=15 –max-requests=0 –oltp-table-size=1000000 –num-threads=16 –db-ps-mode=disable –db-driver=drizzle –drizzle-host=127.0.0.1 –drizzle-port=4427 –drizzle-db=test –drizzle-user=root –drizzle-table-engine=innodb run

MySQL: sysbench –test=oltp –oltp-read-only=on –max-time=15 –max-requests=0 –oltp-table-size=1000000 –num-threads=16 –db-ps-mode=disable –db-driver=drizzle –drizzle-host=127.0.0.1 –drizzle-port=3306 –drizzle-db=test –drizzle-user=root –drizzle-table-engine=innodb –drizzle-mysql=on run

I started Drizzle and MySQL with the following options. These are not meant to be finely tuned options, but just enough to get the servers running with some sane comparable defaults and able to reproduce the regression.

bin/mysqld –no-defaults –server-id=1 –port=3306 –socket=/home/eday/other/mysql.data/sock.master –basedir=/home/eday/other/mysql –datadir=/home/eday/other/mysql.data/db.master –log-error=/home/eday/other/mysql.data/db.master/error –innodb-buffer-pool-size=128M –innodb-log-file-size=64M –innodb-log-buffer-size=8M –innodb-thread-concurrency=0 –innodb-additional-mem-pool-size=16M –character-set-server=utf8 –table-open-cache=4096 –open-files-limit=4096 –pid-file=/home/eday/other/mysql.data/db.master/pid

drizzled –datadir=/home/eday/other/drizzle.data –innodb-buffer-pool-size=128M –innodb-log-file-size=64M –innodb-log-buffer-size=8M –innodb-thread-concurrency=0 –innodb-additional-mem-pool-size=16M –table-open-cache=4096 –table-definition-cache=4096

Now with everything up and running, I gathered some data. Headings are: -

  1-drizzle 1-mysql 2-drizzle 2-mysql
TPS 1335 2434 1559 1239
vmstat
in 6k 110k 60k 50k
cs 100k 210k 120k 100k
us 22 75 72 78
sy 6 20 25 18
id 72 5 3 4
wa 0 0 0 0
valgrind with cachegrind tool
TPS 5.21 3.15 3.55 1.96
iref 858M 1011M 668M 789M

As you can see, we hit the major regression in column one. Our interrupts and context switches are way out of line, and the CPU is mostly idle. Note though that when run under cachegrind (valgrind –tool=callgrind ), we see the normal pattern and don’t notice the regression. This means to reproduce we can’t have any intrusive debugging tools. I also tried counting system calls as a sanity check and found (using strace -fc ):

1-drizzle: 402 TPS
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 55.47  106.807940        2867     37252     11127 futex
 13.97   26.907878      220556       122           select
  5.67   10.913999    10913999         1           rt_sigtimedwait
  5.29   10.181982      565666        18           poll
  2.16    3.1154314          11    362463           read
  1.63    2.1146381        1479      2128           pread
  1.55    2.981563          16    181220           write
  0.67    0.1297864         122     10598           sched_yield
  0.67    0.1288412        1394       924           nanosleep

1-mysql: 245 TPS
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 60.03  108.1509367        4594     23836      5835 futex
 15.42   27.1119721      100788       279         1 select
  5.25    8.1579991     1064443         9           rt_sigtimedwait
  1.65    2.1005637         637      4716           pread
  1.30    1.1367183          21    110986           write
  1.07    1.950471           9    221889    221889 sched_setscheduler
  1.06    1.929137           9    223097      1044 read

2-drizzle: 276 TPS
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.93  115.556211        3704     31198      8101 futex
 15.30   26.810000      203106       132           select
  5.89   10.330000    10330000         1           rt_sigtimedwait
  5.80   10.170000      565000        18           poll
  2.65    4.647123          37    125060           write
  2.35    4.116470          16    250158           read
  1.72    3.021068        1483      2037           pread
  0.27    0.477662       20768        23           fsync
  0.04    0.072760           4     18218           madvise
  0.02    0.043264          18      2357           sched_yield

2-mysql: 168 TPS
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.25  100.851565        6704     15044      3328 futex
 21.22   31.350000      116111       270         1 select
  5.17    7.642591      347391        22           rt_sigtimedwait
  1.93    2.853778         619      4612           pread
  1.15    1.694888          11    153040       346 read
  0.90    1.327743           9    152529    152529 sched_setscheduler
  0.78    1.145964          15     76305           write
  0.32    0.468542          37     12705           madvise

Again, when tracing the process to count system calls, we see the regression disappear, which leaves us with a smaller set of tools to use.

So what next? One theory we were tossing around is a cache alignment issues. This seems like a pretty dramatic drop in performance to be caused by this, but I ran a test to see what the behavior of a process is when you are being throttled by a shared cache line. The results showed idle CPU, but the interrupts and context switches did not drop off. This does not follow the same pattern we saw in Drizzle (interrupts and context switches did drop off). Our cache line size is also smaller on the machine showing the regression, so that did not help support this theory.

While stabbing at a few other ideas, I ran ldd to see which libraries were being use in the two drizzled binaries on each machine. Suppressing some common libs:

1-drizzle: ldd drizzled/drizzled
        ...
        libpcre.so.3 => /lib/libpcre.so.3 (0x00007f9a88532000)
        libtbb.so.2 => /usr/lib/libtbb.so.2 (0x00007f9a88320000)
        libtcmalloc.so.0 => /usr/lib/libtcmalloc.so.0 (0x00007fcced5bc000)
        ...

2-drizzle: ldd drizzled/drizzled
        ...
        libpcre.so.3 => /lib/libpcre.so.3 (0x00007fc0af803000)
        libtbb.so.2 => /usr/lib/libtbb.so.2 (0x00007fc0af5e9000)
        ...

The machine showing the regression is linking with tcmalloc. Looking at the drizzle configure.ac, we use libtcmalloc by default if it is found (machine 2 does not have tcmalloc installed). I relinked drizzled without tcmalloc and received these results:

  1-drizzle 1-mysql 2-drizzle 2-mysql
TPS 2751 2434 1559 1239

There it is! For some reason tcmalloc was giving us a 51% performance drop. Perhaps this is due to the tcmalloc version or settings we need to tweak for performance (something to look into later), but for now disabling this by default is the solution. We’re verifying the fix now and should be in the Drizzle trunk shortly.

libdrizzle 0.3 Released

May 29th, 2009

I’m pleased to announce a new version of libdrizzle! This is mostly a bug fixing and maintenance release before I start in on more significant development. One of the new features I added was a hook to be able to use your own I/O event mechanism rather than the default poll(). This will allow you to use libraries like libevent, which can be useful when dealing with a large number of file descriptors, or to mix with other file descriptors in your application (for example, you could listen on other fd’s alongside the non-blocking Drizzle/MySQL socket connections). There is not much for examples or documentation yet with this feature, but for now you can email or find me in #drizzle on irc.freenode.net if you would like to know more.

One of the next steps with libdrizzle is a better protocol abstraction, since the Drizzle protocol is diverging quite a bit from how the MySQL protocol works. With these abstractions, it will also be possible to easily add other database-like protocols (where column/row/fields make sense). I’m also going to start looking into more memory optimizations and performance tuning.

Narada - A Scalable Open Source Search Engine

May 27th, 2009

I’ve been working with Patrick Galbraith for the past couple weeks on a new project that started as an example in his upcoming book. It is a search engine built using Gearman, Sphinx, Drizzle or MySQL, and memcached. Patrick wrote the first implementation in Perl to tie all these pieces together, but there is also a Java version underway bring written by Trond Norbye and Eric Lambert that will be shown at the CommunityOne and JavaOne conferences next week. I’ve been helping get the system setup on a new cluster and with the port to Drizzle.

Narada provides interfaces that allow you to submit URLs to be indexed and crawled, and then to search those indexes and get a result set back. This allows you to index and search your own set of URLs, possibly for a single website or just for your own personal archive. The crawler in the back-end will be able to stop after some recursion limit from the original URL and also be able to apply URL filters (for example, only index pages under the domain “oddments.org”). Other filters and extensions should be easy to add. Narada is interesting because it is:

  • Open Source - You can modify it to fit your own needs, hopefully in a modular way so that changes can be contributed back to the project.
  • Easy to Scale - The system is built on a number of asynchronous queues, and the processes to perform that work can run on any number of machines. Increasing your capacity is now trivial, simply start up more machines and with new workers.
  • Language Agnostic - While the first versions are in Perl and Java, it is easy to mix in other languages. For example, if a certain component was slow, we could rewrite it in C for better performance. The APIs to index and search can also be wrapped for any language since it will mostly just involved wrapping the Gearman client API. I’m thinking of hacking up a PHP API.

So, how does Narada work under the hood?


Click here for the full-size image

The blue boxes represent your front-end application that use Narada, using the Gearman client API. The yellow boxes represent Gearman workers that perform one of the tasks in the chain. The orange boxes represent the storage mechanisms such as Drizzle, MySQL, Sphinx index, or memcached.

When a URL is submitted, it will first be queued in a Drizzle table for later processing. A Gearman job is started during the table INSERT to notify a Fetch Worker that a new URL is ready. Once a free Fetch Worker is available, it downloads the page and looks for more URLs to index. This is where recursion limits and filters are implemented. Next, it takes the resulting document and pushes it into memcached and notifies the Document Worker a new document is ready to be stored and indexed. The Document Worker then stores this inside of another Drizzle table and will start the Sphinx indexer if it hasn’t been run in a while. We don’t want to index on every URL since this would be wasteful and expensive. At this point the document is stored, indexed, and memcached is primed with the content.

When a search request comes in, the client will dispatch a search job to the Search Worker. This worker is responsible for performing the Sphinx search and gathering the necessary information from memcached or Drizzle so the client can return some meaningful results. In the future we will most likely be sharding the data and indexes, so the Search Worker will also be responsible for aggregating multiple shard searches into one set for the caller.

The code is still rough around the edges, but we’ve set it up on a couple clusters so far and it is working quite well. We’ll be actively working on it and refining the install process so it is easier to get it up and running.

Cache Line Sizes and Concurrency

May 27th, 2009

We’ve been looking at high concurrency level issues with Drizzle and MySQL. Jay pointed me to this article on the concurrency issues due to shared cache lines and decided to run some of my own tests. The results were dramatic, and anyone who is writing multi-threaded code needs to be aware of current CPU cache line sizes and how to optimize around them.

I ran my tests on two 16-core Intel machines, one with a 64 byte cache line, and one with 128 byte cache line. First off, how did I find these values?

one:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 64
...

two:~$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment : 128
...

You will see one line for each CPU. If you are not familiar with /proc/cpuinfo, take a closer look at the full output. It’s a nice quick reference of other things like L2 cache sizes and CPU speed. As you can see, machine one has a 64 byte cache size, and machine two has a 128 byte cache size. Next, I wrote the following C program to test concurrency:

cache_line.c

This program creates a global array of counter variables and runs a variable number of threads, where each thread increments it’s own 4-byte counter in the array. It does so at a number of array spacing levels to see the performance when counters fall on the same cache lines. With a spacing of 1 the memory is directly adjacent, and for each spacing level it skips that many counter variables in the global array. For example, if spacing is 4, the threads would use counter[0], counter[4], counter[8], and so on, which uses a chunk of memory every 16 bytes. The cache_line.c program outputs a CSV formatted table that you can use to generate some graphs. The seconds CSV output is the same set of tests without using the global array counters, and instead a local counter on the stack. This is meant to provide a baseline (since those will always be on their own cache line). The results were:

64 Byte Cache Line

128 Byte Cache Line

So what does this tell us? When spacing is one and all counter memory (16 threads * 4 bytes == 64 bytes) is entirely on one cache line, concurrency is poor. As we add more space between each counter variable, we start to see performance improve (faster runtime). This is because all thread counters are no longer on one cache line. On the 64 byte cache line machine, we see things really level off when spacing is 16. This is because each counter is now on it’s own cache line. On the 128 byte cache line machine, you can see it takes one more iteration of spacing because the cache line is twice as big.

So what can we take from this? If you have any arrays or data structures that are accessed and updated independently from different threads, make sure they are on a different cache line. This may mean wasting a little space, but as you can see, the concurrency performance is well worth it.

Gearman and Drizzle at OSCON

May 26th, 2009

If you missed Gearman or Drizzle at the MySQL Conference, have no fear, a number of folks will be at OSCON too! There will be a many opportunities to learn more or get involved with the two projects:

  • 3-Hour Gearman Tutorial - Learn about the latest developments and participate in hands-on demos to help build your own Gearman-powered applications.
  • 45-Minute Gearman Session - Get a more concise glimpse about what Gearman is and how to use it.
  • 45-Minute Drizzle Panel - Get an update and ask questions about the project and community.
  • Gearman and Drizzle Booths - We’ll have a booth for each project in the expo hall with various developers helping out. Come visit to either help out or learn some new things! The expo hall is free.
  • Gearman and Drizzle BoFs - Not official yet, but there will be at least one BoF for each project throughout the week. This is free as well! Right now there are no specific topics, just general discussion and hacking.

If your are looking for deals on OSCON passes, there are some discount codes on their twitter feed. Also, it’s still early registration until June 2nd, so now would be a good time to register. :)

Hope to see you there!