CPU Performance: Meet Kryo

To dive right into the heart of matters then, after getting our standard benchmarks out of the way we had enough time left to load up some of our more advanced analysis tools to run on the 820 MDP/S. While Qualcomm has been somewhat forthcoming in the Kryo CPU architecture, they have never been as forward as say ARM (who is in the business of licensing the IP), so there are still some unanswered questions about what Kryo is like under the hood.

Qualcomm CPU Core Comparison
  Snapdragon 800 Snapdragon 810 Snapdragon 820
CPU Codename Krait ARM Cortex-A57 Kryo
ARM ISA ARMv7-A (32-bit) ARMv8-A (32/64-bit) ARMv8-A (32/64-bit)
Integer Add 1 2 1
Integer Mul 1 1 1
Shifter ALUs 1 2 1
Addition (FP32) Latency 3 cycles 5 cycles 3 cycles
Multiplication (FP32) Latency 6 cycles 5 cycles 5 cycles
Addition (INT) Latency 1.5 cycles 1 cycle 1 cycle
Multiplication (INT) Latency 4 cycles 3 cycles 4 cycles
L1 Cache 16KB I$ + 16KB D$ 48KB I$ + 32KB D$ 32KB I$ + 32KB D$?
L3 Cache N/A N/A N/A

One thing that immediately jumps out is how similar some of our results are to Krait. According to our initial tests, the number of integer and FP ALUs would appear to be unchanged. Similarly the latency for a lot of operations is similar as well. This isn’t wholly surprising as Krait was a solid architecture for Qualcomm, and there is a good chance they agreed and decided to use it as their starting point. At the same time however I do want to note that these are our initial results done rather quickly on what’s essentially a beta device; further poking later on may reveal more differences than what we’ve seen so far.

But with the above said, there’s a big difference between how many execution units a CPU design has and how well it can fill them, which is why even similar designs can have wildly different IPC. We’ll investigate this a bit more in a moment, however it’s worth noting that this is exactly the philosophy ARM has gone into with Cortex-A72, so it is neither unprecedented nor even unexpected.

Looking at the memory hierarchy and latency, our results point to a 32KB L1 data cache. For the moment I’m assuming the instruction cache is identical, as is the case on most designs, but this test is purely a data test. Meanwhile L2 cache size is a bit harder to pin down; we know that the different CPU clusters on 820 will be using different L2 cache sizes. Ultimately it's pretty much impossible to pin down the exact L2 cache size from this test alone, especially since we can't see the amount of L2 attached to the lower clocked Kryo cluster.

According to our colleague Matt Humrick over at Tom's Hardware, while investigating the matter, it seems that Qualcomm disclosed that we're looking at an 1MB L2 for the performance cluster and a 512KB L2 for the power cluster. We're still looking into independently confirming this bit of information with Qualcomm.

However what you won’t find – and much to our surprise – is an L3 cache. Our test results indicate (and Qualcomm confirms) that Snapdragon 820 does not have an L3 cache as we initially expected, with the L2 cache being the highest cache level on the chip. We initially reported there to be an L3 due to the fact that we found evidence and references to this cache block in Qualcomm's resources, but it seems the latest revision of the SoC doesn't actually employ such a piece in actual silicon, as demonstrated by the latency graph. This means that there isn’t any kind of cache back-stopping interactions between the two CPU clusters, or between the CPU and GPU. Only simple coherency, and then beyond that main memory.

Geekbench 3 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
SD 801 (2458MHz) 7.6 GB/s 4.6 GB/s 4.6 GB/s 5.2 GB/s
SD 810 (1958MHz) 7.5 GB/s 7.4 GB/s 6.4 GB/s 6.6GB/s
SD 820 (2150MHz) 17.4 GB/s 11.5 GB/s 13.1 GB/s 12.8 GB/s
SD 820 > 810 Advantage 131% 55% 103% 94%

Meanwhile looking at Geekbench 3 memory performance, one can see that memory bandwidth is greatly improved over both Snapdragon 800/801 and 810. Stream copy in particular is through the roof, increasing by 131% (over double 810’s performance). Even the other tests, though not as great, are between 55% and 103%. The Snapdragon 820 also shows improved latency to main memory when compared to the Snapdragon 810, so it seems that Qualcomm made definite improvements in the memory controller and general memory architecture of the chipset, allowing the CPUs to get nearer to the theoretical total memory bandwidth offered by the memory controllers.

Moving on, let’s shift to some benchmarks that make a more comprehensive look at performance, starting with SPECint2000. Developed by the Standard Performance Evaluation Corporation, SPECint2000 is the integer component of their larger SPEC CPU2000 benchmark. Designed around the turn of the century, officially SPEC CPU2000 has been retired for PC processors, but with mobile processors roughly a decade behind their PC counterparts in performance, SPEC CPU2000 is currently a very good fit for the capabilities contemporary SoCs.

SPECint2000 - Estimated Scores
  Snapdragon 810 Snapdragon 820 % Advantage
164.gzip
823
1176
43%
175.vpr
2456
1707
-30%
176.gcc
1341
1641
22%
181.mcf
789
593
-25%
186.crafty
1492
1449
-3%
197.parser
753
962
28%
252.eon
2321
3333
44%
253.perlbmk
1090
1384
27%
254.gap
1325
1447
9%
255.vortex
1043
1583
52%
256.bzip2
867
1041
20%
300.twolf
DNC
DNC
N/A

Even though this early preview means we don’t have the luxury of building a binary with a compiler aware of Kryo, using our A57 binaries produces some preliminary results on the 820 MDP/S. Performance does regress in a couple of places – but in other places we see performance increases by up to 52%. 820 does have a slight 10% frequency advantage over 810, so when taking into account the clock difference the IPC improvements are slightly lower. This is also showcased when comparing the Snapdragon 820 to a more similarly clocked Exynos 7420 (A57 @ 2100MHz), where the maximum advantage drops to 33% and similarly to a clock-normalized Snapdragon 810, the overall average comes in at only 5-6%. Once we get the opportunity to have more time with a Snapdragon 820 device we'll be able to verify how much the compiler settings affect the score on the Kryo architecture.

Our other set of comparison benchmarks comes from Geekbench 3. Unlike SPECint2000, Geekbench 3 is a mix of integer and floating point workloads, so it will give us a second set of eyes on the integer results along with a take on floating point improvements.

Geekbench 3 - Integer Performance
  Snapdragon 810 Snapdragon 820 % Advantage
AES ST
739.7 MB/s
700.7 MB/s
-5%
AES MT
3.05 GB/s
1.99 GB/s
-35%
Twofish ST
89.8 MB/s
102.7 MB/s
14%
Twofish MT
448.5 MB/s
345.5 MB/s
-23%
SHA1 ST
628.9 MB/s
983 MB/s
56%
SHA1 MT
3.02 GB/s
2.84 GB/s
-6%
SHA2 ST
83.5 MB/s
134.9 MB/s
61%
SHA2 MT
393.4 MB/s
374.6 MB/
-5%
BZip2Comp ST
5.01 MB/s
7.29 MB/s
45%
BZip2Comp MT
20.5 MB/s
20.5 MB/s
0%
Bzip2Decomp ST
7.99 MB/s
9.76 MB/s
24%
Bzip2Decomp MT
30.8 MB/s
24.9 MB/s
-19%
JPG Comp ST
18.9 MP/s
23.3 MP/s
23%
JPG Comp MT
88.9 MP/s
76.7 MP/s
-14%
JPG Decomp ST
41.5 MP/s
62.2 MP/s
49%
JPG Decomp MT
182.7 MP/s
176.6 MP/s
-3%
PNG Comp ST
1.11 MP/s
1.56 MP/s
43%
PNG Comp MT
4.78 MP/s
4.61 MP/s
-4%
PNG Decomp ST
17.9 MP/s
24.2 MP/s
35%
PNG Decomp MT
94.1 MP/s
64.3 MPs
-32%
Sobel ST
53.3 MP/s
86.3 MP/s
62%
Sobel MT
248.4 MP/s
244.8 MP/s
-1%
Lua ST
1.30 MB/s
1.59 MB/s
22%
Lua MT
5.93 MB/s
4.5 MB/s
-24%
Dijkstra ST
3.38 Mpairs/s
5.52 Mpairs/s
63%
Dijkstra MT
13.7 Mpairs/s
13.7 Mpairs/s
0%

The actual integer performance gains with GeekBench 3 are rather varied. Single-threaded results consistently show gains, ranging from a minor -5% regression for AES up to a 61% improvement for SHA2. Given the architecture shift involved here, this is a bit surprising (and in Qualcomm’s favor) since you wouldn’t necessarily expect Kryo to beat Cortex-A57 on everything. On the other hand MT results typically show a regression, since Snapdragon 810 had a 4+4 big.LITTLE configuration that meant that it had the 4 Cortex-A53 cores contributing to the task, along with the big cores all running at their near-full clockspeed, while Kryo’s second cluster runs at a reduced clockrate. And though one could have a spirited argument about whether single-threaded or multi-threaded performance is more important, I’m firmly on the side of ST for most use cases.

Geekbench 3 - Floating Point Performance
  Snapdragon 810 Snapdragon 820 % Advantage
BlackScholes ST
5.46 Mnodes/s
12.3 Mnodes/s
125%
BlackScholes MT
25.5 Mnodes/s
32.1 Mnodes/s
26%
Mandelbrot ST
1.2 GFLOPS
2 GFLOPS
67%
Mandelbrot MT
6.41 GFLOPS
6.23 GFLOPS
-3%
Sharpen Filter ST
1.07 GFLOPS
2.15 GFLOPS
100%
Sharpen Filter MT
5.02 GFLOPS
6.11 GFLOPS
22%
Blur Filter ST
1.27 GFLOPS
3.14 GFLOPS
147%
Blur Filter MT
6.14 GFLOPS
8.84 GFLOPS
44%
SGEMM ST
2.29 GFLOPS
4.09 GFLOPS
79%
SGEMM MT
6.12 GFLOPS
9.19 GFLOPS
50%
DGEMM ST
1.05 GFLOPS
1.95 GFLOPS
85%
DGEMM MT
2.81 GFLOPS
4.53 GFLOPS
61%
SFFT ST
1.25 GFLOPS
1.98 GFLOPS
58%
SFFT MT
4.11 GFLOPS
5.65 GFLOPS
37%
DFFT ST
1.03 GFLOPS
1.68 GFLOPS
63%
DFFT MT
2.97 GFLOPS
4.76 GFLOPS
60%
N-Body ST
486.6 Kpairs/s
841 Kpairs/s
73%
N-Body MT
1.72 Mpairs/s
2.34 Mpairs/s
36%
Ray Trace ST
1.84MP/s
2.86 MP/s
55%
Ray Trace MT
8.16 MP/s
8.46 MP/s
4%

GeekBench 3’s floating point results are even more positive for Snapdragon 820. There is only a single performance regression, a -3% in Mandelbrot multi-threaded. Otherwise in both MT and ST workloads, performance is significantly up. This is a prime example of where Kryo is taking better advantage of its execution units than any high-end Qualcomm SoC before it, as even holding steady (or on paper having a slight deficit) it in practice comes out significantly ahead.

The Qualcomm Snapdragon 820 Performance Preview CPU Performance, Cont
Comments Locked

146 Comments

View All Comments

  • BurntMyBacon - Monday, December 14, 2015 - link

    @V900: "Actually Samsung probably wouldn't save any money by using an Exynos SOC."

    They'd most likely save some. Just not enough to forgo a better chip if available.

    @V900: "I doubt Apple would let them manufacture their CPUs if they weren't seperate divisions and had firewalls between them."

    The "firewall" would exist around the fabrication facilities only. R&D and architecture design have no bearing on Apple products. If they are sufficiently proficient at design and the cost of the ARM IP doesn't eat the savings, then they could save some here.

    @V900: "The two divisions are independent of each other, which means that Samsung the SOC vendor charges Samsung the device vendor the same prices they charge everyone else."

    Current fabrication facilities (TSMC, GloFlo, et al) don't charge the same price per customer. They will give discounts for volume, customer loyalty, just to keep the fabs busy, etc.. Samsung could charge themselves preferred pricing, but it certainly wouldn't be free. How much they could save here is dependent on what they charge vs their competitors (I.E. TSMC) and if there is any margin for preferred pricing. Sometimes they will give their competitors very low margin pricing just to keep the fab busy until they have their next push. Samsung has generally been short on supply, so this hasn't happened much, but given their new expansion, it may be a consideration in the future.
  • zeeBomb - Thursday, December 10, 2015 - link

    Damn it, late!
  • WorldWithoutMadness - Thursday, December 10, 2015 - link

    I suppose if they're gonna use Qualcomm one last time, it would be for S7 and Note6. Chances are pretty good to accommodate those who are 'stubborn' with Qualcomm's stuff.

    After that, they are going to use their M1 and its derivative for everything else, better margin in saturated market is their goal in the first place.

    Well this wouldn't be long until Google release their own processor design to standardize Android's madness
  • zeeBomb - Friday, December 11, 2015 - link

    So the summary is...the CPU of Kryo is getting some major competition to Apples A9 but the GPU is great, beating the A9 in many of the tests.

    Also... The Kyro Snapdragon 820 attained a high 131648 and the Kirin 950 with 95280. Thoughts?

    http://www.gizmochina.com/2015/12/11/snapdragon-82...
  • gg555 - Sunday, December 20, 2015 - link

    It has already been heavily leaked that the S7 will use the 820 in some markets.
  • yeeeeman - Friday, March 13, 2020 - link

    I can tell you from the future that Samsung will use both Exynos and Snapdragon for GS7. The exynos chip with custom mongoose cores is better.
  • Krysto - Thursday, December 10, 2015 - link

    Performance improvements are nice and all, and I'm more excited about the extra features such as Zeroth, Sense ID, and Smart Protect, but Qualcomm must under no circumstance blow it again on the heating/power consumption front. Whatever compromises they need to make for that to not happen again, they must do them.

    The Snapdragon 810 overheating issue was very much real, even with the latest versions where they claimed to have "fixed" the issue. Play any game on a 810 device for 10 minutes, and you'll see what I mean. The device get unnaturally hot. That's completely unacceptable and should never again be decided as a "compromise" in order to beat Apple in performance or whatever. Never again!

    Now, I hope Qualcomm will focus even more on hardware-enabled security features. It also makes no sense for them to support SHA1 anymore, but I guess that was a decision taken years ago. Next version should drop support for it. What I'd like to see is ChaCha20 acceleration as soon as possible, as it will be part of TLS 1.3 and will be included in OpenSSL 1.1.

    I also wish Qualcomm would open source more parts of its security-related firmware, and would also open source its baseband firmware (I know, a hard thing to ask but only way we can be sure there's no backdoor in there). Otherwise, at the very least they should try to completely isolate the baseband firmware from most OS functions, so even if the baseband is "owned" they can't take control of the device, other than perhaps listen to phone calls.

    Security is only going to become a more and more important feature in future chips, not just for smartphones and PCs, but also for IoT, which direly needs strong security by default, because we all know most IoT OEMs will never update those devices again after people buy them, or will only do it for a short while.
  • ganz - Thursday, December 10, 2015 - link

    I keep seeing people complaining about the heat of the 810. I've got an HTC One M9, and I've played games on it. I'd characterize the experience as, well, warm. Ish. Posts like yours indicate people are experiencing heat that's an order of magnitude greater than I am.

    Can you give me a sample workload that might allow me to experience this for myself? Barring that, can you give me an objective number in Celsius that's too high for you to bear?
  • tipoo - Thursday, December 10, 2015 - link


    iirc, the M9 had a patch for the overheating issue, but that just ended up throttling performance earlier to never get so hot.
  • jjj - Thursday, December 10, 2015 - link

    It's not about the phone heating up, it's about the chip heating up and having to slow down. The problem is not the heat to your hand, the problem is that the chip slows down hard and you lose perf.
    So if you want to see it overheating, track clocks and load.

Log in

Don't have an account? Sign up now