There has been a strong desire for a series of industry standard machine learning benchmarks, akin to the SPEC benchmarks for CPUs, in order to compare relative solutions. Over the past two years, MLCommons, an open engineering consortium, have been discussing and disclosing its MLPerf benchmarks for training and inference, with key consortium members releasing benchmark numbers as the series of tests gets refined. Today we see the full launch of MLPerf Inference v1.0, along with ~2000 results into the database. Alongside this launch, a new MLPerf Power Measurement technique to provide additional metadata on these test results is also being disclosed.

The results today are all focused around inference – the ability of a trained network to process incoming unseen data. The tests are built around a number of machine learning areas and models attempting to represent the wider ML market, in the same way that SPEC2017 tries to capture common CPU workloads. For MLPerf Inference, this includes:

  • Image Classification on Resnet50-v1.5
  • Object Detection with SSD-ResNet34
  • Medical Image Segmentation with 3D UNET
  • Speech-to-text with RNNT
  • Language Processing with BERT
  • Recommendation Engines with DLRM

Results can be submitted into a number of categories, such as Datacenter, Edge, Mobile, or Tiny. For Datacenter or Edge, they can also be submitted into the ‘closed’ category (apples-to-apples with same reference frameworks) or the ‘open’ category (anything goes, peak optimization). The metrics submitted depend on single stream, multiple stream, server response, or offline data flow. For those tracking MLPerf’s progress, the benchmark set is the same as v0.7, except with the requirement now that all DRAM must be ECC and steady state is measured with a minimum 10 minute run. Run results must be declared for what datatypes are used (int8, fp16, bf16, fp32). The benchmarks are designed to run on CPU, GPU, FPGA, or dedicated AI silicon.


The companies that have been submitting results to MLPerf so far are a mix of vendors, OEM partners, and MLCommons members, such as Alibaba, Dell, Gigabyte, HPE, Inspur, Intel, Lenovo, NVIDIA, Qualcomm, Supermicro, and Xilinx. Most of these players have big multi-socket systems and multi-GPU designs depending on what market they are targeting to promote with the results numbers. For example, Qualcomm has a system result in the datacenter category using two EPYCs and 5 of its Cloud AI 100 cards, but it has also submitted data to the edge category with an AI development kit featuring a Snapdragon 865 and a version of its Cloud AI hardware.

Qualcomm's Cloud AI 100

The biggest submitter for this launch, Krai, has developed an automated test suite for MLPerf Inference v1.0 and run the benchmark suite across a number of low-cost edge devices such as the Raspberry Pi, NVIDIA’s Jetson, and RockChip hardware, all with and without GPU acceleration. As a result, Krai provides over half of all the results (1000+) in today’s tranche of data.  Compare that to Centaur, which has provided a handful of data points for its upcoming CHA AI coprocessor.

Because not every system has to run every test, there’s not a combined benchmark number to provide. But taking one of the datapoints, we can see the scale of the results submitted so far.

On ResNet50, with 99% accuracy, running an offline dataset:

  • Alibaba’s Cloud Sinian Platform (two Xeon 8269CY + 8x A100) scored 1,077,800 samples per second in INT8
  • Krai’s Raspberry Pi 4 (1x Cortex A72) scored 1.99 samples per second in INT8

Obviously certain hardware would do better with language processing or object detection, and all the data points can be seen at MLCommon’s results pages.

MLPerf Inference Power

A new angle for v1.0 is power measurement metadata. In partnership with SPEC, MLPerf has adopted the industry standard SPEC PTDaemon power measurement interface as an optional data add-on for any submission. These are system-level metrics, rather than simply chip level, which means that extra controllers, storage, memory, power delivery, and the efficiencies therein all count towards the data measurement submitted.

MLPerf provides the example of a Gigabyte Server with 5x Qualcomm Cloud AI 100 cards averaging 598 W during an offline test for 1777.9 queries per second. Submitters are allowed to provide additional power data in submission details, such as processor power, however only the system-level power will be part of the official submission process.

Around 800 of the submitted data points in today’s list come with power data. Again, most of them from Krai.

Full results can be found at the MLCommons website.

Related Reading



View All Comments

  • Raqia - Wednesday, April 21, 2021 - link

    Do they have any metrics or benchmarks more oriented to the training side of the pipeline, and will you be incorporating any of these inference tests into Anandtech's benchmarking suite? (Interesting inference results when factoring in power: seems like Qualcomm's solution stands out for perf / W.) Reply
  • Yojimbo - Wednesday, April 21, 2021 - link

    They do, but those are released at a different time. The latest round of training benchmarks are the version 0.7 HPC Training benchmarks released November 17, 2020. To my eye there's not much there: Lawrence Berkley showing how bad the Xeon Phi is at training compared to the V100, Fujitsu showing how much better the V100 is at training than the A64FX ARM processor in Fugaku, and the Swiss Supercomputing Center showing how outdated the P100 is for training.

    The latest normal training benchmarks are the version 0.7 benchmarks released July 29, 2020.
    I read somewhere in passing that the 1.0 training results are due out in 3 months.

    As far as Qualcomm's inference results, it only submitted results for two models. The published results only show its solution standing out in perf/W in Resnet-50, which is a relatively small CNN. My guess is that it doesn't stand out much in larger and non-CNN models.
  • Ian Cutress - Thursday, April 22, 2021 - link

    Of course there are arguments there for both Phi/A64FX. Phi wasn't really built with ML in mind, and Knights Mill was a bit of a hack in the end. A64FX was built for HPC, not ML. NV has been arming its arch with ML in mind for several generations now. Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbers, as their own models seem to be different enough that it's not worth the AI chip companies even bothering to run MLPerf and submit results. Reply
  • Yojimbo - Thursday, April 22, 2021 - link

    The Fugaku supercomputer is definitely phenomenal. (expensive too). But the mlperf results show that it's not very efficient for ai training. As for Xeon Phi, it was a complete misstep.

    It is quite literally Fujitsu benchmarking a v100 system and also benchmarking a slice of Fugaku, and LBNL benchmarking a v100 system and also benchmarking a Xeon Phi system.

    Regarding the ai startups, they can't possibly engage personally with enterprises. They would benefit immensely from mlperf results, they just can't easily run the benchmarks yet, at least not with results that show their value. And if they can't easily run the benchmarks yet they can't easily adapt to whatever models people are actually running. That's why the uptake of these chips have so far mostly been by institutions whose purpose for buying the machines include researching and evaluating the technology in the machine for. As for the dearth of mlperf results from the ai chip startups, I refuse to believe that any ai startup running the table with mlperf wouldn't create a buzz that would have potential customers knocking at its door and potential investors clamoring to give more money. And it's not like their bookings are full as it is.

    Most of the well-funded chips are training-focused. Hopefully we'll see some results from Graphcore, Sambanova, Cerebras, and Habana in the upcoming training results. These companies seem to have no trouble using these various standard models in the marketing slides so it's not like they ignore them. The whole point of mlperf was to progress from the point where everyone cherry picks favorable results with carefully controlled parameter choices to compare themselves with their competitors.
  • mode_13h - Friday, April 23, 2021 - link

    > Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbers

    Speaking with a lot of foxes, it seems that hens aren't really interested in more secure chicken coops...

    LOL. Maybe deep-pocketed, cutting-edge AI researchers aren't too interested in MLPerf numbers, but the bulk of the market isn't using such cutting edge stuff. Maybe some older networks can be dropped from these metrics, but I think the primary motive in AI chip makers downplaying the importance of benchmarks is mostly because they fear they wouldn't top the leader board (or not for long, even if they're currently in front).

    Honestly, without benchmarks, how are people supposed to make informed decisions? We can't all trial all solutions and test them on our current stuff. Even then, everyone wants room to grow and to have some sense that when we get to the point of deploying the next gen networks we won't be stuck with HW that won't play ball. This is probably a large part of the enduring popularity of GPUs -- because they're among the most flexible solutions.
  • RSAUser - Friday, April 30, 2021 - link

    The biggest problem is that the AI landscape is still progressing quite quickly, so most of these benchmarks are outdated in too short a time frame and can be misleading.

    At least those looking at ML benchmarks are hopefully better informed as to what they mean so they can use them correctly. It will probably turn into a market stunt instead though, with ill-informed CTO's wanting that one cause higher rank even though doesn't fit use-case, seen it happen so often.
  • mode_13h - Friday, April 30, 2021 - link

    Good points, but I'm still reluctant to agree that having no standard benchmarks is a better situation than having a few outdated ones. If the main problem is staying current, then refreshing them every couple years could at least help. Reply
  • Yojimbo - Friday, April 30, 2021 - link

    They are refreshing them more often than that. That's why the AI startups can't keep up. Again, if they can't keep up with the standard benchmarks they can't keep up with the myriad of models. With inference the trained models need to be tuned to run well on the hardware. And even in a mature industry, benchmarks don't tell you how a particular product will work in your workload. An overview of several benchmarks gives one a clue of where one can look. That's why it's important to submit results to more than one or two benchmarks. At this point, resnet-50 isn't very useful, not by itself anyway. Reply
  • brewerfaith - Wednesday, May 5, 2021 - link

    The results today are all focused around inference – the ability of a trained network to process incoming unseen data. The tests are built around a number of machine learning areas and models attempting to represent the wider ML market, in the same way that SPEC2017 tries to capture common CPU workloads. For MLPerf Inference, this includes: Reply
  • mode_13h - Wednesday, May 5, 2021 - link

    These spammers just copy a paragraph out of the article and append their link. Reply

Log in

Don't have an account? Sign up now