Machine learning benchmarks: Hardware providers (part 1)

Shiva Manne Machine Learning, Open Source, Student Incubator 13 Comments

The rise of machine learning as a discipline brings new demands for number crunching and computing power. With easily accessible and cheap hardware resources, one has to pick the right platform to run the experiments and model training on.

Should you use Amazon’s AWS EC2 instances? Or go with IBM’s Softlayer, Google’s Compute Engine, Microsoft’s Azure? How about a real dedicated server, such as the ones offered by Hetzner?

With no clear benchmark to address the practical aspects of choosing a hardware and software platform for ML, individuals and small companies like to go ahead with the popular choice. This is often suboptimal.

This is part one in a blog post series by our Incubator student Shiva Manne compares machine learning frameworks and hardware platforms from practical aspects, such as their ease of use, cost, stability, scalability and performance. We gratefully acknowledge support by AI Grant and IBM Global Entrepreneurship program, which made these (sometimes costly) experiments possible.

graph of cloud hardware benchmark in USD

Figure 1: The cost of training a model on different hardware providers. The same benchmark task is run from a Docker container on 3 kinds of instances on each platform, which we call budget, medium and high-end, with the total cost in US dollars recorded.

Benchmark setup

Hardware platforms: the contestants

Relevant metrics are collected from benchmark runs on Hetzner’s dedicated servers, Amazon’s AWS, Google’s Compute Engine and IBM’s Softlayer. These platforms were chosen on the basis of their popularity. We consider 3 kinds of machine instances — budget, medium & high-end (see table 1). This categorisation of machines is based on their cost ($/hr) which, for this CPU intensive benchmark, is primarily dependent on the total number of CPU cores it contains.

For benchmarking Spark MLlib, we run the model on Google’s Cloud Dataproc. It is a fast, easy-to-use and fully managed cloud service for running Apache Spark and Apache Hadoop clusters.

The benchmark task: why Word2Vec?

The choice of a benchmarking algorithm is hard, bound to create controversy and nitpicking no matter what. We decided to go with Word2Vec, because it combines several pleasant properties:

  1. It is a practical algorithm, with countless exciting applications and a wide range of usage in both industry and academia. No more “counting words”.
  2. Word2vec is a computation-heavy algorithm: most time is spent in number crunching, leading to a hardcore high performance computing profile. This is great for benchmarking.
  3. Word2vec is also easy to parallelize, which means we can test the scaling properties of both hardware and software.
  4. Word2vec has a stellar reference implementation by the original authors (you cannot really diss something signed off by Jeff Dean!). This makes sanity and accuracy checking much easier – there’s no argument about what it should do.
  5. It’s been around for more than 4 years and has become a cornerstone in ML practice. Machine learning frameworks have no excuse for not shipping a sane implementation for training word embeddings like Word2Vec.

In short, we use Word2Vec as a gentle proxy for measuring the quality of ML frameworks, to gain insights about their performance, robustness and scalability. At the same time, we also use it to compare the bang-for-buck ROI of different hardware providers.

Evaluation of the cloud platforms


(hours / GB RAM)


Google Compute Engine

Amazon Web Services

IBM’s Softlayer



4 cores


16 cores


32 cores


4 cores


16 cores


36 cores


4 cores


16 cores


40 cores

8 hyperthreads

(cloned repo in Aug 2017)

3.8 / 1.0

1.0 / 1.0

0.5 / 1.1

3.9 / 1.0

0.8 / 1.0

0.6 / 1.1

4.0 / 1.0

1.2 / 1.0

0.6 / 1.1

1.5 / 1.0

(v 2.1.0)

4.6 / 1.1

1.7 / 1.2

1.4 / 1.2

4.8 / 1.1

1.5 / 1.2

1.7 / 1.2

4.5 / 1.1

1.8 / 1.1

1.6 / 1.2

2.0 / 1.1

(v 0.7.1)

13.9 / 7.2

5.7 / 7.0

4.8 / 10.2

16.9 / 4.7

5.0 / 6.8

9.0 / 12.2

15.9 / 7.7

10.6 / 6.7

10.0 / 18.4

5.4 / 11.2

(v 1.3.0)

36.2 / 14.2

35.3 / 14.2

38.1 / 15.9

34.7 / 14.1

30.9 / 14.2

52.9 / 14.2

30.1 / 14.2

42.0 / 14.2

29.9 / 14.2

14.4 / 14.2

Table 1 (scroll right to see more platforms): Summary of the benchmark results. Each cell indicates the training time (in hours) and peak memory (in GB). Long table — scroll to the right to see other platforms. More on the software frameworks, including results for the latest DL4J version, in the next post.

A note on repeatability

Repeatability is clearly critical for any benchmark and we like to keep things above board. For this reason, we’re releasing all datasets, code and details, so you can review, repeat or even extend these experiments:

  • Dataset: The word vectors are trained on the English Wikipedia corpus dump dated 01-05-2017 (14GB of xml.bz2; download here). This raw dump format is pre-processed using this code to yield a plain text file containing one article per line, with punctuation removed and all words converted to lowercase. The first 2 millions lines of this pre-processed wiki corpus (download here; 2GB of gzipped plain text, 8GB unzipped) are used to train the Word2Vec on each framework and platform.
  • Docker: The benchmarks are run inside a Docker container to ensure a standardized environment while training the models. To profit from SSE/AVX/FMA optimizations, this Docker container is built by compiling Tensorflow from source. Each benchmark instance is trained for 4 epochs using the skipgram model with negative sampling and the exact same parameter values. I have created a dedicated repository for this benchmarking project, containing all the resources required to run/repeat the benchmark.

Ease of setup & use, stability

We found IBM’s Softlayer to be highly “new-user” friendly, which greatly expedited the process of configuring and ordering an instance. The experience was similar while ordering a dedicated server from Hetzner’s portal.

This wasn’t as easy with Google’s product – with GCE, I had to spend around 15 minutes pondering over the numerous mandatory fields (like private keys, firewall rules, subnets etc.). An unsatisfactory customer support experience only compounds their problems.

AWS fits somewhere in between the two cloud platforms in terms of ease of setting up an instance.

With regards to stability, I had no issues with either AWS, Softlayer or Hetzner. However, when running a long task (the benchmark with Tensorflow-GPU) on GCE, the job abruptly stopped after a few days without showing any errors, and I couldn’t figure out why. A repeated attempt also resulted in the same failure.

Another difference between these platforms is the time required to provision an instance after placing an order. Softlayer lacks in this regard with provisioning sometimes taking up to 20 minutes. AWS and GCE takes at most a couple of minutes. Like most dedicated server providers, Hetzner takes anywhere from an hour to a day to set up and get the machine up and running.

Hardware cost

graph of cost of framework-cloud hardware in USD

Figure 2: Cost of training different Word2Vec implementations across the cloud platforms with different kinds of instances (budget, medium, high-end).

Dedicated servers come out on top when comparing prorated costs (even though these are charged on a monthly basis and have no hourly option) against virtual machines by a huge margin. Of all the virtual instances providing cloud platforms, GCE is undeniably one of the more inexpensive ones. Compared to Amazon’s complex pricing, GCE’s pricing is lucid and straightforward with no hidden costs. They also keep expenses in check better than others by offering features such as no upfront costs, per-second billing and automatic sustained use discounts.

Amazon has a fairly low approval rating concerning its pricing/billing, a sentiment evinced in a large number of benchmarks and blogs. Although looking at figure 1, this doesn’t seem to be entirely accurate since AWS comes across as reasonably cheap in the budget-medium range. As it happens, it unexpectedly surpasses even GCE in the budget instances. Juxtaposing the high-end instances uncloaks the huge differences in the costs between AWS and other platforms in this segment. The price gap between Softlayer and AWS is not enormous but the costs for AWS eclipses all the other platforms completely, telling us that the high prices charged by AWS for premium instances are not backed by expected performance speedups, at least for these kind of intense number-crunching tasks.

Softlayer outperforms AWS in this regard, where its higher end (pricier) instances correspond to performance improvements as expected. Softlayer is also the only platform that allows the provisioning of bare metal servers amongst the 3 cloud service providers, which might be useful under security or compliance constraints.

Softlayer & GCE allow you to custom configure instances, which aids in keeping the costs low while AWS customers are required to choose from a predefined set of instances and then over-provision resources (CPUs & RAM) according to their needs. Another winning/distinguishing point for GCE is the granularity of time that it uses to bill customers, in per-second increments. Starting this October, AWS has also adopted per-second billing while Softlayer continues to charge on an hourly basis. Additionally, we had a serious issue with Softlayer where the price would not update properly for weeks, showing an outdated monthly cost that caused us to overshoot our monthly budget. Very unpleasant, not good.

It becomes apparent from the above plot that barring original C, all the other software frameworks cost more to train on hardware instances with better configs and that the expected decrease in training time (owing to more CPU cores) does not justify the price of these extra resources. A similar conclusion can be drawn by plotting the cost of training a Word2Vec model using Spark MLlib with different cluster sizes:

graph of Spark MLlib cost with workers in USD

Figure 3: The MLlib implementation of Word2Vec in Spark doesn’t scale very well; the optimal ROI point comes at mere 3 workers.

TL;DR Summary

Cloud Platform



Google Compute Engine

  • Cheapest cloud
  • Quick provisioning
  • Allows custom configuration
  • Complicated to setup & order an instance
  • Stability issues

Amazon Web Services

  • Cheap for low-end budget instances
  • Quick provisioning time
  • Relatively costly for high-end machines and charges for every extra service (customer support, monitoring metrics…)
  • Complex pricing

IBM Softlayer

  • Allows custom configuration
  • Offers bare metal instances
  • Stable platform
  • Expensive for medium-budget instances
  • Relatively slow provisioning


  • Cheapest option for long term  use
  • Great performance & stability
  • Lots of free services and features
  • Slow provisioning
  • Significant setup costs
  • Monthly billing, no hourly plans – expensive for short one-time use

A few more general observations:

  • The performance of most frameworks does not scale well with additional CPU cores beyond a certain point (8 cores), even on machines with many more cores.
  • Fewer cores with high clock speeds perform much better than high number of cores with low clock speeds. Stick to the former to cut overall costs.
  • Dedicated servers like Hetzner can be an order of magnitude cheaper than cloud, for the high performance and free bandwidth they provide. You should definitely take this into consideration while renting machines for extended (month+) periods of use.

In some of the images above, you could already see we were running a battery of tests not only against the hardware providers, but also against different software platforms. How do the machine learning frameworks compare?

In the next post of this series, we’ll dig deeper and investigate the ease of use, documentation and support, scalability and performance of several popular ML frameworks such as Tensorflow, DL4J or Spark MLlib. Stay tuned or subscribe to our newsletter below. EDIT: the next part benchmarking GPU platforms is here!


HW Provider- Instance


virtual CPU cores



Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz




Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz



AWS-High End

Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz




Intel(R) Xeon(R) CPU @ 2.20GHz




Intel(R) Xeon(R) CPU @ 2.20GHz



GCE-High End

Intel(R) Xeon(R) CPU @ 2.30GHz




Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz




Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz



Softlayer-High End

Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz




Intel(R) Quad Core(TM) i7-7700 CPU @ 3.60GHz



Comments 13

  1. Douglas

    Hey, I REAllY liked your article! That was awesome and very informative for me. I used to read only big news sites, but now I realized that there are plenty good sites to follow

    1. Jack Douglas

      We used AWS spot instance during a competition and got kicked off with two minutes notice during heavy usage! Never again!

  2. Martin Haeberli

    Thanks – nice article.

    But agree with question re spot instances?

    Also, I would be interested in an analogous benchmark that was GPU-focused.


  3. paurullan

    As Tina has said, you may have missed a piece of the puzzle: AWS spot instances and GCE preemptible machines.

    Right know (2017-12-07) on GCE, a 96vCPU + 624GB is almost $3000 per month but around $900 if configured as preemptible. If your problem is trully number crunching you can run this beast for less than a $1,5/hour.

    In AWS you can have the m5.24xlarge (96 vCPUs + 384.0 GiB ) for $4 dollars on demand and $1,3 as spot instance.

  4. Henryk Zygalski

    I would have like to see a on premise system thrown into the mix. Also the same comparison with GPU hardware would be great. Any serious ML that takes more than an hour should be done on GPUs.

  5. thehftguy

    Hello, I see you quoted me in your article.

    2-4 core systems are not considered “budget”. Google is cheaper than AWS but not significantly.

    For budget, you should compare the shared CPU instances, g1-small vs t2.small. Google guarantees 50% of a CPU whereas AWS is only 20% of a CPU, google should do the computation a lot faster and save a lot of money.

  6. Pingback: Cloud Computing Resources: Cost Analysis for Machine Learning : Stephen E. Arnold @ Beyond Search

  7. Pingback: Stuff The Internet Says On Scalability For December 8th, 2017 – People Up2date

  8. Sean

    Any idea what the sustained GFLOPS were on each platform? Was Tensorflow so poor because it didn’t vectorize?
    It would be very useful to know the GFLOPS numbers, as a proxy for other floating point intensive applications that are not pure matrix multiply

  9. Pingback: Сравнение GPU для машинного обучения: Amazon, Google, IBM - NeuroHive

Leave a Reply

Your email address will not be published. Required fields are marked *