Carlos E. Perez

Building a 270* Teraflops Deep Learning Box for Under $10,000

Look what I got for Christmas!! If you don’t recognize it, it’s two Titan V cards from Nvidia. A single Titan V has a systolic array unit dubbed as a TensorCore that is capable of 110 teraflops peak performance. In addition, it includes a conventional GPU that’s capable of 25 teraflops half-precision. So we are roughly speaking here about 135 teraflops (half-precision) per card. This makes a grand total of 270 teraflops for a box with these two cards inserted in it. We don’t even have to count the relatively minuscule extra flops that a multi-core CPU provides. (* Editor’s note: I will have to update these theoretical numbers when I dig up more details)

Each Titan V costs $3,000.00 + plus taxes. So this gives around $3,700 wiggle room to come up with a decent box to host these cards on. I recently built a 50 teraflops box for under $3,000, which comes out to 16.6 gigaflops per dollar. This new box should give you a mind-boggling, 27 gigaflops per dollar. Just to make a comparison, a late model Intel i7 8700K cranks out 217 gigaflops. An i7 8700k costs $400, the math comes out to .53 gigaflops per dollar. Granted the numbers here are theoretical vs empirical, it is still a massive difference!! (BTW, I could have placed these Titans on the same box as my 50 teraflops box and it would cost around $7,300. Equating to 36 gigaflops per dollar.)

In June of 2007, IBM’s Blue Gene/P was installed at Argonne National Laboratory capable of 445 teraflops (double precision). Two years earlier, Blue Gene/L was the fastest supercomputer in the world at 280 teraflops. Back in 2007, IBM was charging $1.3m per rack. 20 of these racks gets you to around 280 teraflops (it will set you back $26 million). You might be saying: “well hold on now, you are comparing double precision with half-precision which isn’t fair”. Honestly, I don’t care, that’s because Deep Learning work loads really don’t care much for higher precision.

The Blue Gene/P monster of a machine looked like this back in 2007 (Just 10 years ago):

Now imagine all this computational horsepower sitting quietly (water cooled) underneath your desk. All in a single box and costing 2,600 times less (i.e. $26,000,000 versus $10,000). This isn’t even including in the equation the cost of power. There’s no need for a battalion of folks to install and maintain this monstrosity. There’s no need to wear a shirt, tie, slacks and shoes to work on this! Think about how potentially world dominating this can be ;-).

There are two architectural developments that got you this massive increase in flops in a very short time. (1) The use of fp16 meaning less silicon than comparable fp32 or fp64 multiply-add accumulators and (2) systolic arrays that gets you 110 teraflops for the same amount of silicon that got you around 10 teraflops. In 2016 you could get less than 10 teraflops per GPU chip, fast-forward to 2017 and its a quantum leap to 135 teraflops with a V100 GPU. I don’t expect 2018 to yield this kind of leap in capability. The low hanging fruit has already been picked and only needs to be exploited by software.

The next big leap perhaps may be the kind of architecture GraphCore is touting. Here are some intriguing benchmarks from GraphCore. If I were to gaze at my crystal ball, Google is going to stun the world again with a new kind of architecture in silicon. Better Deep Learning algorithms are feeding back into more capable silicon. This is what Elon Musk has coined as “double exponential growth”. Deep Learning progress is moving at break-neck speed!

This kind of comparison in terms of size and cost gives a visceral feel to the kind of changes that are coming. How many businesses are still running their operations the same way that it was 10 years ago? This kind of exponential change in compute capability has got to mean a massive change in how we run our ever day operations. 99.99% of the people out there likely don’t realize what’s happening! When I mention “Deep Learning” to people, most folks eyes glaze over. Don’t even mention the term “Intuition Machine”, it sounds like an oxymoron.

I’m waiting for a water cooled AMD Threadripper box custom built by a reputable vendor of workstation class desktops. This will give me the opportunity to kick the tires on this kind of an intuition machine!

Here’s the Threadripper box that’s been put together prior to shipment:

I will report progress as more parts and software come in… stay tuned!

You might be wondering, can I mine cryptocurrency with this? Yes, of course, that’s when Intuition Fabric comes out!