In August of 2016, Naveen Rao, then CEO of Nervana Systems and now, after Intel's purchase of Nervana, VP and GM of the Artificial Intelligence Products Group at Intel, claimed 55 TOPS with the Nervana chip on 28 nm. NVIDIA claims 120 TOPS with the Tesla V100. So in peak theoretical throughput for deep learning operations the V100 seems to have a big advantage.
Think about the performance of the Nervana chip against the Pascal GPU (Nvidia’s top end deep learning and HPC chip, featured in its DGX-1 appliance) in terms of teraops per second, to be exact. A Maxwell TitanX card (which is what they currently use in their cloud for model training) is about six teraops per second, Rao says. Pascal is about double that number, with between 10-11 teraops per second. Nvidia also has a half-precision mode built in as well, which can yield approximately 20 teraops per second, at least based on the numbers he says are gleaned from users that are actively putting Pascal through the paces for deep learning workloads. “On our chip, which is a TSMC 28 nanometer chip, we’re seeing around 55 teraops per second.”
So Qwertilot remembered correctly. They were claiming about 10x Maxwell TitanX.
Volta definitely seems to have a big advantage over Lake Crest. I wonder what they can accomplish moving off of TSMC 28nm though; that's a pretty significant disadvantage by itself. If Intel can wave some money on their fabs and get process parity with Nvidia, that could cut away a big portion of that performance lead. (Maxwell to Pascal was 70% ish without any major architectural changes).
Probably because the company making the ASIC had a lot of the design work done before the Intel buyout and moving it to a different fab entirely would not have been worthwhile.
I'd assume 1-2 generations down the line they'll be fabbed by Intel directly.
Intel factories are MONSTERS that churn out one thing in mega-volume at high margin. That cannot be interrupted by low-margin, low volume, experimental toys. If it takes off, this can change, but requires starting the architecture and design with the target being an intel FAB from day 1, which would likely be a hand-me-down from CPU or chipset first and later a purpose built factory. Long way off, if ever.
Intel knows what's at stake with AI. They didn't buy Nervana just to play with toys. NVIDIA is trading at a trailing 12 month stock price to earnings ratio (P/E) of over 50 when the industry average is under half that, largely because of the expected future opportunity of AI. Intel's P/E is 15.4, BTW. Intel's market capitalization (share price times number of shares outstanding) is $190 billion while NVIDIA's is $118 billion, even though Intel has much higher revenues. Intel's growth has been slow since the late 90s (their revenue has doubled in the 18 years since 1999, I think I read recently), and they desperately want to tap into the big growth opportunities in the market to stay relevant and hopefully to grow.
What aeolist said was most likely correct. They will move to Intel's 14 nm process with the next generation of this chip. They would have had to delay the release of the chip if they tried to move it over to Intel's fabs after Intel's August 2016 purchase of Nervana Systems.
Well I'm definitely curious how fast it'll be and how much it'll cost. Having used nvidia graphics cards for training it's much faster than a cpu but still not what I'd call fast. And there's the CUDA problem... hopefully intel can integrate this with tensorflow, caffe, torch, opencv, etc.
In order to get the most speed, there need to be libraries well-optimized for a particular architecture, which takes a lot of work and benefits from intimate knowledge of the architecture, especially if you want speed boosts available to the public quickly..
Intel will have to make specialized libraries to plug into the deep learning frameworks in order to take advantage of this chip, much like NVIDIA provides CUDA libraries optimized for their GPUs. With either this Nervana chip or with NVIDIA's GPUs using CUDA, I would think that if you want to switch hardware you are going to have to change libraries. Maybe Intel can write libraries that work well with both NNP and AVX-512, but are you really going to switch from NNP to AVX-512 for your training? Correct me if I'm wrong, but I'd guess no. Maybe you'd want to use an AMD GPU. Intel is not going to be targeting AMD GPUs with their libraries.
Besides, doesn't it seem likely that OpenCL libraries that are decent on both NVIDIA and AMD GPUs are more likely to exist than something that is decent on this NNP ASIC and something else?
Intel has Nervana's neon framework that supports CPU, MKL, and GPU backends, which now also utilizes Intel's Math Kernel Library (MKL). neon v2.1.0 actually includes AVX-512 support for SKL and KNM in DL workloads,[1] and their latest docs state support for Pascal, Maxwell, and Kepler[2]. neon's been kicking around for a bit, too, and I believe it (used) to have a reputation of being faster but less versatile/powerful than the other more popular frameworks.
In the NNP announcements, there was no mention of neon but it seems logical that neon will support it, and that NNP may be custom-designed for it.
Yes, Nervana has the expertise to make such a framework. But my point was the following. The original poster said "And there's the CUDA problem..." I don't understand how switching from a general purpose architecture to an ASIC solves "the CUDA problem", which I assumed to mean "the amount that one is locked into using the current architecture of choice".
Using an NNP or a GPU, one will still be tied into the libraries one links to the same amount, and be able to take advantage of the reusable part from the back end framework the same amount. So unless by "the CUDA problem" he means just the existence of CUDA itself, and has in mind a "final solution", what exactly is the problem that the use of Intel's NNP solves?
"Intel will have to make specialized libraries to plug into the deep learning frameworks in order to take advantage of this chip"
The big players (Intel, AMD, and Nvidia) already have optimized backends for popular frameworks like Caffe, Caffe2, TensorFlow, etc. The smaller players (Movidius, for one) have proprietary frameworks that let you import models from the standard frameworks.
"The big players (Intel, AMD, and Nvidia) already have optimized backends for popular frameworks like Caffe, Caffe2, TensorFlow, etc. The smaller players (Movidius, for one) have proprietary frameworks that let you import models from the standard frameworks."
I'm talking about development libraries for training, not about inferencing. TensorFlow has to call libraries to perform the math operations involved with the training algorithms. The implementation of these libraries can make or break the performance of TensorFlow on a particular architecture. It takes expertise, hard work, and time to make math libraries that optimally take advantage of a particular architecture in a wide range of use cases. You need someone who has the technical knowledge of this type of programming, intimately knows the architecture, and intimately knows the mathematics.
I think generalized AI will take a new technology. I don't think we will achieve abstract reasoning or complex contextual understanding with current methods. Current methods are good at pattern matching. So it could be 10 years later or it could be 100 years later.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
25 Comments
Back to Article
extide - Wednesday, October 18, 2017 - link
If NV gets 100TF on the Tensor cores, and they are claiming 10x .. I wonder if we could see 1PF on this thing!Qwertilot - Wednesday, October 18, 2017 - link
I've a memory that that is 10x Maxwell, so rather less mind boggling :)Yojimbo - Saturday, October 21, 2017 - link
In August of 2016, Naveen Rao, then CEO of Nervana Systems and now, after Intel's purchase of Nervana, VP and GM of the Artificial Intelligence Products Group at Intel, claimed 55 TOPS with the Nervana chip on 28 nm. NVIDIA claims 120 TOPS with the Tesla V100. So in peak theoretical throughput for deep learning operations the V100 seems to have a big advantage.Yojimbo - Saturday, October 21, 2017 - link
Here's the quote, from nextplatform:Think about the performance of the Nervana chip against the Pascal GPU (Nvidia’s top end deep learning and HPC chip, featured in its DGX-1 appliance) in terms of teraops per second, to be exact. A Maxwell TitanX card (which is what they currently use in their cloud for model training) is about six teraops per second, Rao says. Pascal is about double that number, with between 10-11 teraops per second. Nvidia also has a half-precision mode built in as well, which can yield approximately 20 teraops per second, at least based on the numbers he says are gleaned from users that are actively putting Pascal through the paces for deep learning workloads. “On our chip, which is a TSMC 28 nanometer chip, we’re seeing around 55 teraops per second.”
So Qwertilot remembered correctly. They were claiming about 10x Maxwell TitanX.
mode_13h - Wednesday, October 25, 2017 - link
He wants to talk about "teraops per second", but he neglects the smaller Pascals' 8-bit MAC performance? That's like 44 TOPS for GP102.Yojimbo - Wednesday, October 25, 2017 - link
He was talking about training. 8 bit integers don't seem to cut it for training.Drumsticks - Tuesday, October 24, 2017 - link
Volta definitely seems to have a big advantage over Lake Crest. I wonder what they can accomplish moving off of TSMC 28nm though; that's a pretty significant disadvantage by itself. If Intel can wave some money on their fabs and get process parity with Nvidia, that could cut away a big portion of that performance lead. (Maxwell to Pascal was 70% ish without any major architectural changes).Drumsticks - Tuesday, October 24, 2017 - link
And that was with reducing the die size from 398mm (squared) to 314mm. Keeping the die size fixed could probably enable even bigger gains.shabby - Wednesday, October 18, 2017 - link
Does it need a new motherboard?Dr. Swag - Wednesday, October 18, 2017 - link
Why are they manufacturing it at tsmc and not at their own fabs?aeolist - Wednesday, October 18, 2017 - link
Probably because the company making the ASIC had a lot of the design work done before the Intel buyout and moving it to a different fab entirely would not have been worthwhile.I'd assume 1-2 generations down the line they'll be fabbed by Intel directly.
woggs - Wednesday, October 18, 2017 - link
Intel factories are MONSTERS that churn out one thing in mega-volume at high margin. That cannot be interrupted by low-margin, low volume, experimental toys. If it takes off, this can change, but requires starting the architecture and design with the target being an intel FAB from day 1, which would likely be a hand-me-down from CPU or chipset first and later a purpose built factory. Long way off, if ever.Yojimbo - Saturday, October 21, 2017 - link
Intel knows what's at stake with AI. They didn't buy Nervana just to play with toys. NVIDIA is trading at a trailing 12 month stock price to earnings ratio (P/E) of over 50 when the industry average is under half that, largely because of the expected future opportunity of AI. Intel's P/E is 15.4, BTW. Intel's market capitalization (share price times number of shares outstanding) is $190 billion while NVIDIA's is $118 billion, even though Intel has much higher revenues. Intel's growth has been slow since the late 90s (their revenue has doubled in the 18 years since 1999, I think I read recently), and they desperately want to tap into the big growth opportunities in the market to stay relevant and hopefully to grow.What aeolist said was most likely correct. They will move to Intel's 14 nm process with the next generation of this chip. They would have had to delay the release of the chip if they tried to move it over to Intel's fabs after Intel's August 2016 purchase of Nervana Systems.
mode_13h - Wednesday, October 25, 2017 - link
You're forgetting that Intel actually runs a foundry business for 3rd party IP. So, they must be setup to do smaller runs than their x86 parts.FunBunny2 - Wednesday, October 18, 2017 - link
Big Brother's, or Dear Leader's, next Big Hammer.andrewaggb - Wednesday, October 18, 2017 - link
Well I'm definitely curious how fast it'll be and how much it'll cost. Having used nvidia graphics cards for training it's much faster than a cpu but still not what I'd call fast. And there's the CUDA problem... hopefully intel can integrate this with tensorflow, caffe, torch, opencv, etc.Yojimbo - Saturday, October 21, 2017 - link
In order to get the most speed, there need to be libraries well-optimized for a particular architecture, which takes a lot of work and benefits from intimate knowledge of the architecture, especially if you want speed boosts available to the public quickly..Intel will have to make specialized libraries to plug into the deep learning frameworks in order to take advantage of this chip, much like NVIDIA provides CUDA libraries optimized for their GPUs. With either this Nervana chip or with NVIDIA's GPUs using CUDA, I would think that if you want to switch hardware you are going to have to change libraries. Maybe Intel can write libraries that work well with both NNP and AVX-512, but are you really going to switch from NNP to AVX-512 for your training? Correct me if I'm wrong, but I'd guess no. Maybe you'd want to use an AMD GPU. Intel is not going to be targeting AMD GPUs with their libraries.
Besides, doesn't it seem likely that OpenCL libraries that are decent on both NVIDIA and AMD GPUs are more likely to exist than something that is decent on this NNP ASIC and something else?
Nate Oh - Monday, October 23, 2017 - link
Intel has Nervana's neon framework that supports CPU, MKL, and GPU backends, which now also utilizes Intel's Math Kernel Library (MKL). neon v2.1.0 actually includes AVX-512 support for SKL and KNM in DL workloads,[1] and their latest docs state support for Pascal, Maxwell, and Kepler[2]. neon's been kicking around for a bit, too, and I believe it (used) to have a reputation of being faster but less versatile/powerful than the other more popular frameworks.In the NNP announcements, there was no mention of neon but it seems logical that neon will support it, and that NNP may be custom-designed for it.
[1] https://www.intelnervana.com/neon-2-1/
[2] http://neon.nervanasys.com/docs/latest/overview.ht...
Yojimbo - Thursday, October 26, 2017 - link
Yes, Nervana has the expertise to make such a framework. But my point was the following. The original poster said "And there's the CUDA problem..." I don't understand how switching from a general purpose architecture to an ASIC solves "the CUDA problem", which I assumed to mean "the amount that one is locked into using the current architecture of choice".Using an NNP or a GPU, one will still be tied into the libraries one links to the same amount, and be able to take advantage of the reusable part from the back end framework the same amount. So unless by "the CUDA problem" he means just the existence of CUDA itself, and has in mind a "final solution", what exactly is the problem that the use of Intel's NNP solves?
mode_13h - Wednesday, October 25, 2017 - link
"Intel will have to make specialized libraries to plug into the deep learning frameworks in order to take advantage of this chip"The big players (Intel, AMD, and Nvidia) already have optimized backends for popular frameworks like Caffe, Caffe2, TensorFlow, etc. The smaller players (Movidius, for one) have proprietary frameworks that let you import models from the standard frameworks.
Yojimbo - Wednesday, October 25, 2017 - link
"The big players (Intel, AMD, and Nvidia) already have optimized backends for popular frameworks like Caffe, Caffe2, TensorFlow, etc. The smaller players (Movidius, for one) have proprietary frameworks that let you import models from the standard frameworks."I'm talking about development libraries for training, not about inferencing. TensorFlow has to call libraries to perform the math operations involved with the training algorithms. The implementation of these libraries can make or break the performance of TensorFlow on a particular architecture. It takes expertise, hard work, and time to make math libraries that optimally take advantage of a particular architecture in a wide range of use cases. You need someone who has the technical knowledge of this type of programming, intimately knows the architecture, and intimately knows the mathematics.
XiroMisho - Saturday, October 21, 2017 - link
What's it's HPS capabilities and when this comes out will miners finally start buying these and leave our damn video cards alone?! XDmarvdmartian - Monday, October 23, 2017 - link
Neural Net Processors? Isn't that what they put in the Terminators??So Intel is actually Cyberdyne Systems???
Time to stock up on ammo!
mode_13h - Wednesday, October 25, 2017 - link
No, it's not time to stock up on ammo. Generalized AI is still a ways off.We're unlikely to see it in less than 10 years, but I'd give it a bit longer.
Yojimbo - Thursday, October 26, 2017 - link
I think generalized AI will take a new technology. I don't think we will achieve abstract reasoning or complex contextual understanding with current methods. Current methods are good at pattern matching. So it could be 10 years later or it could be 100 years later.