blog_post_tests/20141022004751.blog

Proving the Great Beast concept
<p>Wendell Wilson over at TekSyndicate had a good idea &#8211; run the NetBSD repo conversion on a machine roughly comparable to the Great Beast design. The objective was (a) to find out if it freakin&#8217; <em>worked</em>, and (b) to get a handle on expected conversion time and maximum working set for a <em>really large</em> conversion.</p>
<p>The news is pretty happy on all fronts.</p>
<p><span id="more-6466"></span></p>
<p>The test was run on a dual Xeon 2660v3 10 core, 256GB memory and a 3.0GHz clock, with conventional HDs.  This is less unlike the Great Beast design than it might appear for benchmarking purposes, because the central algorithms in cvs-fast-export can&#8217;t use all those extra cores very effectively.</p>
<p>The conversion took 6 hours and 18 minutes of wall time. The maximum working set was about 19.2GB. To put this in perspective, the repo is about 11GB large with upwards of 286,000 CVS commits.  The resulting git import stream dump is 37GB long.</p>
<p>The longest single phase was the branch merge, which took about 4.4 hours.  I&#8217;ve never seen computation dominate I/O time (just shy of two hours) before &#8211; and don&#8217;t think I will again except on <em>exceptionally huge</em> repositories.</p>
<p>A major reason I needed to know this is to get a handle on whether I should push micro-optimizations any further. According to profiling the best gains I can get that way are about 1.5%.  Which would have been about 5 minutes.</p>
<p>This means software speed tuning is <em>done</em>, unless I find an algorithmic improvement with much higher percentage gains.</p>
<p>There&#8217;s also no longer much point in trying to reduce maximum  working set. Knowing conversions this size will fit comfortably into 32GB of RAM is good enough!</p>
<p>We have just learned two important things. One: for repos this size, a machine in the Great Beast class is sufficient.  For various reasons, including the difficulty of finding processors with dramatically better single-thread performance than a 3GHz 2660v3, these numbers tell us that five hours is about as fast as we can expect a conversion this size to get.  The I/O time could be cut to nearly nothing by doing all the work on an SSD, but that 4.4 hours will be much more difficult to reduce.</p>
<p>The other thing is that a machine in the Great Beast class is <em>necessary</em>. The 18GB maximum working set is large enough to leave me in no doubt that memory access time dominated computation time &#8211; which means yes, you actually do want a design seriously optimized for RAM speed and cache width. </p>
<p>This seems to prove the Great Beast&#8217;s design concept about as thoroughly as one could ask for. Which means (a) all that money people dropped on me is going to get used in a way well matched to the problem, neither underinvesting nor overinvesting, and (b) I just won big at whole-systems engineering.  This makes me a happy Eric.</p>
<p>UPDATE: More test results.  Wendell reports that the same conversion run on a slightly different Beast-class box with a 400Mhz slower clock but 5MB more RAM cache turned in very similar times &#8211; including, most notably, in the compute-intensive merge phase.  This confirms that more cache compensates for less clock on this load, as expected.</p>