I just read an outstandingly interesting blog post, about how Google and others optimize the performance of their servers – and on the importance of finding subtle bugs (some of which are so subtle they don’t even qualify for the term “bug”, but nevertheless are so influential as to define the performance of an entire complex system). It’s recommended reading for anyone interested in complex systems – and for those who just want an insight into how Google works. One of those (sadly rare) things you read that give a whole new perspective.
The post itself was written by Dan Luu, now of Microsoft: it heavily references a video talk by Dick Sites of Google (I share Dan’s reservation about looking at videos, and usually prefer to read papers – but this is an exception and highly recommended).
What struck me was how many of those problems UltraSoC’s debug IP would address and would do a far better job. Our non-intrusive debug hardware would eliminate many of the issues they discuss, with no performance impact, no latency, vastly lower cost and better ability to detect problems. In short, a far better solution.
Server performance: a problem for today
In his blog Dan discusses a range of techniques for improving performance of servers – and by extension of complex systems in general. This touches, in interesting ways, on things like how Google search works, the differences between Gmail and Outlook in performance, video issues with Microsoft Vista and many others – and just how staggeringly, insanely hard these are to solve.
So, for example, his illustration of how a single Google search query works, typically fanning out across perhaps 40 racks of 60 servers each, across multiple systems to perhaps 2000 leaf machines, each of which spends 1-2ms actually processing before passing its answer up. And then the detective work of finding out what was going well with different parts of that process.
Now as a company that makes debugging tools, remarks like: “But debugging tools haven’t kept up” ring a very loud bell. It won’t surprise you to know that UltraSoC does have capabilities that could make this situation very much better and that our IP would make the problems Dan outlines dramatically easier to solve, with far better capabilities at (literally) a fraction of the cost.
Our IP is applicable to a wide variety of systems, from low end to high end. But the paper and Dick’s talk are specifically about data centers: moving lots of data, with thousands of processors and a very distinct environment with some significant, unique issues.
Why current performance analysis techniques won’t do
Dan discusses the tools and techniques that people use to understand and improve the performance of servers, complex web systems and the like. From a software perspective the primary one is a sampling profiler: once every N ticks the main software stops and a diagnostic routine runs, polling the performance counters and gathering statistics, before outputting them and returning to the main task.
This is very versatile and has many attractions, especially for a software author. But it does have a few problems.
The first is that the “stop, run new process, return” process has an overhead. That impacts performance or, equivalently, cost. That might be 1%, and maybe that is acceptable – but some tools or some analytics are far higher: Dick discusses tcpdump needing 7% of CPU resource (“if I asked to have that running and a 7% overhead it is a real short conversation: ‘No!’”). Of course, that performance hit equates to a cost increase: even at 1% overhead you will need to add 1% more servers to compensate, a not insignificant amount of money given how much Google etc spend on CapEx.
Now, the performance impact of UltraSoC IP is precisely zero percent because it is non-intrusive hardware. However, there is an impact on die area: less than one percent. Since the die area is only part of CPU cost, and CPU cost is a small fraction of the cost of a data center, which is dominated by power, the actual cost impact is proportionately less – perhaps one twentieth of the total cost of the software alternative.
(Here’s a back-of-the-envelope calculation:
Adding UltraSoC increases the die area by less than 1%, but keeping round numbers say 1%.
Die area is roughly half the cost of a processor – package, test, IP, other yield loss etc are the rest.
Let’s say CPU is 50% of CapEx for a server (probably less but round number)
CapEx is roughly 20% of the total cost (TCO) of a data centre, with OpEx – mostly power – being the rest.
So UltraSoC increases cost by 1% * 0.5 * 0.5 of 20% = 0.05% of the cost of a server compared to the 1% software overhead. The cost advantage would be even greater for tools like tpcdump.)
So, using a software technique could increase cost by 1% best case – and a hardware technique costs a small fraction of that – about twenty times less. And if you think of tcpdump then UltraSoC could give that data for less than one hundredth of the cost.
Latency and long-tail bug profiles
But there is another aspect. That 1% overhead not only impacts performance, it also adds latency and changes timing. Adding a delay changes performance, often in subtle ways which can lead to the infamous ‘Heisenbugs’ – bugs that appear (or disappear) with the presence of debugging code.
Finally, there is another advantage to hardware performance measurement and debugging, and that is granularity and the ability to detect subtle or rare problems.
One of the most important topics is the “long tail” performance profile. It is relatively easy to analyze and optimize average / typical performance; but in implementations such as warehouse scale computing, the statistics say that many (if not most) routes through the system will be affected somewhere along the way by an “outlier” event. In systems like Google search where many things happen in parallel, it is guaranteed that the longest response time will gate the entire response.
So even though only one process in a hundred is impacted, actually most tasks end up with a 99%-ile worst-case performance.
Long tail performance translates to dollars
Controlling long tail issues is really important and really worth a lot of money.
As Dan says: “Latency translates directly into money. It’s now well established that adding user latency reduces ad clicks, reduces the odds that a user will complete a transaction and buy something, reduces the odds that a user will come back later and become a repeat customer, etc. Over the past ten to fifteen years, the understanding that tail latency is an important factor in determining user latency, and that user latency translates directly to money, has trickled out from large companies like Google into the general consciousness. But debugging tools haven’t kept up”.
I love that last sentence.
Dick’s video presentation gives an example with disk access. This bug meant one disk read in one hundred was very slow: the cumulative effect was that 25% of Google’s entire disk fleet ran significantly slower than it ought for three years – but no-one knew. Fixing that one problem, which took performance analysis, a lot of head scratching and a lot of very subtle thinking for several weeks, dramatically improved the performance of that 25% – literally millions of servers. That paid Dick’s salary for more than ten years: not a bad return on investment.
So, long tail latency is worth a lot of money. Finding those rare bugs is important. But existing tools do not do a good enough job.
Sampling profilers: server bugs get lost in the noise
Hardware engineers know this: we tend to worry about “slow path” and run tests like “five corners”. Embedded systems worry about tail latency and histograms of it. Indeed, at UltraSoC we have a case study of a communications system where our IP identifies just this. In those applications, those “once in a while” incidents are so painful and usually cause a crash.
But in the server context, as Dick describes, they tend to “get lost in the noise”. And that is because the tools used are almost designed to lose that information.
The common debugging tools are sampling profilers, like perf or SHIM. Sampling profilers typically run at something like 1kHz (1ms between samples), which gives a great overview of what is happening on average – but is guaranteed to lose information in the sampling gaps. They aggregate events into averages – but long tail problems are, by definition, not averages. Some sampling profilers (eg SHIM) can go as fast as 10MHz, but then the performance impact is probably unacceptable – and many events will still slip through the gaps.
To quote Dan again: “there are large classes of performance problems sampling profilers can’t debug effectively, and those problems are becoming more important”.
In other words, we need a tool that has time-granularity that’s much shorter than the duration of the thing we’re debugging. Otherwise we will never see it.
Fortunately, there are ways to address this problem. By embedding “smart hardware” into a processor UltraSoC provides data that is far more useful, with a million times faster sampling rate and with powerful embedded intelligence, in a non-intrusive way, at one twentieth of the cost of traditional techniques.
In a processor with embedded UltraSoC IP, the sampling rate is the system clock rate: 3GHz or whatever. That could mean an incredibly high data rate; but our embedded analytics hardware means that only “interesting” things need be reported, so rather than “losing signal in the noise” we only export the signals of interest.
Over the years, it has become commonplace to use hardware to accelerate functions that had previously been realized in software: floating point math, graphics processing and increasingly nowadays graph processing and neural networks.
Maybe those building processors for data centers and servers should look at that trend, and migrate the current software-based sampling profilers, with all their inadequacies and high overhead, to using non-intrusive, wire-speed embedded hardware to support this performance analysis.
You can read Dan Luu’s blog here
… and view the Dick Sites video here