First, Elias has shored up the database by giving it more drives. 25, in fact. The biggest issue with the database performance was that the small number of drives in the machine were being overworked as you were reporting your burning calories as cache logs. The drives were, in fact, running at 500%. In lieu of purchasing a SAN, which can cost hundreds of thousands of dollars, we went with a more pragmatic and affordable approach by emulating a SAN. This helped to alleviate the amount of work for each individual drive. Sounds expensive? Yep. But nowhere close to the SAN ballpark figure (which creeped ever higher as we looked for bids). We will eventually need a SAN but we're not quite there... yet.
This happened a couple of weeks ago but there was no real indication of progress because the web servers were having even a bigger problem. We found an issue where after 800 connections to the server the machine would go to 100% CPU and create a miserable experience to users. Although we did add additional machines we would still reach peaks on the different machines even though others were well below that threshold.
Our plan was to look at a couple of ways to address this. First, bring in a system that would help us balance the number of visitors to each machine (load balancing), but like a SAN this can be expensive. Second, add a bunch more web servers - around 8 total - also expensive. We couldn't do one without the other since our current load balancing, round robin DNS, can't really balance 8 machines like it can 2 or 3.
The biggest reason for the CPU issue, however, was something that happened during our upgrade from 1.1 to 2.0. The problem was we really didn't know what had changed between the two versions to cause this issue. The upgrade should have been smooth. It had our developers puzzled, outside consultants entering therapy, and our original Microsoft representative being compelled to switch to Firefox (ok, not true, but I'm sure he was tempted). So whatever we did to improve performance we still needed to fix this code issue.
Fortunately our latest release on Tuesday fixed the Most Painful Bug in the History of Time. As it turns out the nifty little code we use that transforms UBB code into HTML was using Regular Expressions - and that the libraries for Regular Expressions had changed, I would say drastically, between 1.1 and 2.0. Since we use them all over the site we were taking an additional hit every time it was called.
Joe, our newest developer and currently the lead developer on the current version of the site, was finally able to determine this after some extensive debugging work with Microsoft. The Tuesday release implemented these changes and now each web machine is now running at under 20% CPU with the same amount of connections that put it at 100% on Monday.
It has been holding strong since the update so I think we have clear skies for at least a while. This will allow us to focus on fixing bugs and improving your experience on Geocaching.com even more so you can get outside and away from your desktop sooner.
This post has been edited by Jeremy: 07 December 2007 - 12:15 PM