After Swift Comes Cyclone Oscar

I was fortunate enough to receive a tip last time that pointed me at some LLVM documentation calling out Apple’s Swift core by name. Scrubbing through those same docs, it seems like my leak has been plugged. Fortunately I came across a unique string looking at the iPhone 5s while it booted:

I can’t find any other references to Oscar online, in LLVM documentation or anywhere else of value. I also didn’t see Oscar references on prior iPhones, only on the 5s. I’d heard that this new core wasn’t called Swift, referencing just how different it was. Obviously Apple isn’t going to tell me what it’s called, so I’m going with Oscar unless someone tells me otherwise.

Oscar is a CPU core inside M7, Cyclone is the name of the Swift replacement.

Cyclone likely resembles a beefier Swift core (or at least Swift inspired) than a new design from the ground up. That means we’re likely talking about a 3-wide front end, and somewhere in the 5 - 7 range of execution ports. The design is likely also capable of out-of-order execution, given the performance levels we’ve been seeing.

Cyclone is a 64-bit ARMv8 core and not some Apple designed ISA. Cyclone manages to not only beat all other smartphone makers to ARMv8 but also key ARM server partners. I’ll talk about the whole 64-bit aspect of this next, but needless to say, this is a big deal.

The move to ARMv8 comes with some of its own performance enhancements. More registers, a cleaner ISA, improved SIMD extensions/performance as well as cryptographic acceleration are all on the menu for the new core.

Pipeline depth likely remains similar (maybe slightly longer) as frequencies haven’t gone up at all (1.3GHz). The A7 doesn’t feature support for any thermal driven CPU (or GPU) frequency boost.

The most visible change to Apple’s first ARMv8 core is a doubling of the L1 cache size: from 32KB/32KB (instruction/data) to 64KB/64KB. Along with this larger L1 cache comes an increase in access latency (from 2 clocks to 3 clocks from what I can tell), but the increase in hit rate likely makes up for the added latency. Such large L1 caches are quite common with AMD architectures, but unheard of in ultra mobile cores. A larger L1 cache will do a good job keeping the machine fed, implying a larger/more capable core.

The L2 cache remains unchanged in size at 1MB shared between both CPU cores. L2 access latency is improved tremendously with the new architecture. In some cases I measured L2 latency 1/2 that of what I saw with Swift.

The A7’s memory controller sees big improvements as well. I measured 20% lower main memory latency on the A7 compared to the A6. Branch prediction and memory prefetchers are both significantly better on the A7.

I noticed large increases in peak memory bandwidth on top of all of this. I used a combination of custom tools as well as publicly available benchmarks to confirm all of this. A quick look at Geekbench 3 (prior to the ARMv8 patch) gives a conservative estimate of memory bandwidth improvements:

Geekbench 3.0.0 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
Apple A7 1.3GHz 5.24 GB/s 5.21 GB/s 5.74 GB/s 5.71 GB/s
Apple A6 1.3GHz 4.93 GB/s 3.77 GB/s 3.63 GB/s 3.62 GB/s
A7 Advantage 6% 38% 58% 57%

We see anywhere from a 6% improvement in memory bandwidth to nearly 60% running the same Stream code. I’m not entirely sure how Geekbench implemented Stream and whether or not we’re actually testing other execution paths in addition to (or instead of) memory bandwidth. One custom piece of code I used to measure memory bandwidth showed nearly a 2x increase in peak bandwidth. That may be overstating things a bit, but needless to say this new architecture has a vastly improved cache and memory interface.

Looking at low level Geekbench 3 results (again, prior to the ARMv8 patch), we get a good feel for just how much the CPU cores have improved.

Geekbench 3.0.0 Compute Performance
  Integer (ST) Integer (MT) FP (ST) FP (MT)
Apple A7 1.3GHz 1065 2095 983 1955
Apple A6 1.3GHz 750 1472 588 1165
A7 Advantage 42% 42% 67% 67%

Integer performance is up 44% on average, while floating point performance is up by 67%. Again this is without 64-bit or any other enhancements that go along with ARMv8. Memory bandwidth improves by 35% across all Geekbench tests. I confirmed with Apple that the A7 has a 64-bit wide memory interface, and we're likely talking about LPDDR3 memory this time around so there's probably some frequency uplift there as well.

The result is something Apple refers to as desktop-class CPU performance. I’ll get to evaluating those claims in a moment, but first, let’s talk about the other big part of the A7 story: the move to a 64-bit ISA.

A7 SoC Explained The Move to 64-bit
Comments Locked

464 Comments

View All Comments

  • purerice - Wednesday, September 18, 2013 - link

    so far little mention of battery life. I wonder if LPDDR3 has a detrimental effect on battery life compared to LPDDR2.

    I usually like Anand's work but this quote got me: Although Apple could conceivably keep innovating to the point where an A-series chip ends up powering a Mac, I don't think that's in the cards today.

    Sorry Anand, the only way ARM can do that is to attempt redesigning its chips for desktop, or to try attempting some Larrabee-type chip based on its A-series. 8 Bay Trails working together could outrun a low end Haswell on performance, wattage, and price, but if it were really that easy, Intel would be doing that already. Maybe if ARM buys AMD, but with ARM's current strategy, it just doesn't seem feasible for them to overtake Intel or AMD.
  • Wilco1 - Wednesday, September 18, 2013 - link

    A quad core version of the current A7 would already outperform the current Haswell in the latest MacBook Air.
  • stacey94 - Wednesday, September 18, 2013 - link

    No it wouldn't. What are you even basing that off? The Geekbench 3 scores?

    Even assuming that's applicable across platforms, the Haswell will have twice the single threaded performance (again, based off Geekbench scores that probably mean nothing here). This matters more. By your logic AMD's 8 core bulldozer should have outperformed Sandy Bridge. It didn't.
  • Wilco1 - Thursday, September 19, 2013 - link

    With double the cores and a small clock boost to 1.5GHz it would have higher throughput. Single threaded performance would still be half of Haswell of course, so for that they would need to increase the clock and/or IPC further. A 20nm A7 could run at 2GHz, so that would be ~75% of Haswell ST performance. I would argue that is more than enough to not notice the difference if you had 4 cores.
  • Laxaa - Wednesday, September 18, 2013 - link

    Nice review, but Iæm dissapointed about the audio capture performance. 64kbps mono is not OK in 2013, and I see that most smartphone manufacturers skips on this. Even my Lumia 920 dissapoints in this department(96kbps mono) but at least it has HAAC mics that makes it a decent companion at concerts(I think the 1020 has 128 kbps stereo)

    Why isn't this an issue in an industry where everyone guns for better video and still image performance? It seems like such a small thing to ask for.
  • ddriver - Wednesday, September 18, 2013 - link

    Same thing as with the lack of ac wifi and lte-a, with so much improvements in the 5s, apple really needs to hold back on a few features so it can make the iphone 6 an attractive product. You can pretty much bet money that the iphone 6 will fill those gaps, deliberately left gaping.
  • steven75 - Friday, September 20, 2013 - link

    If that's what you think, I can only imagine what you thought about the Nexus 4 with no LTE at the time that ALL phones came with LTE (not just flagships) and the Moto X with it's middle tier internals yet flagship pricing.
  • teiglin - Wednesday, September 18, 2013 - link

    I'm a bit baffled by the battery life numbers. Specifically the difference in performance relative to the 5/5c on wifi vs. cellular. Given that they are all presumably using the same wifi and cellular silicon, why is there such a dramatic relative increase in the battery life of the 5s compared to the 5/5c moving from wifi to LTE? I don't see why the newer SoC should be improving its efficiency over LTE vs. wifi; if anything, I'd expect a good wifi connection to feed data to the platform faster than LTE, allowing the newer silicon to race to sleep more effectively.

    Were all the tests conducted on the same operator with comparable signal strength? Obviously you can't do much to normalize against network congestion--a factor almost certain to favor tests run in the past, though perhaps middle-of-the-night testing might help minimize it--but what other factors could account for this difference? Do you have any speculation as to what could cause such a huge shift?
  • DarkXale - Wednesday, September 18, 2013 - link

    In wireless communications, power draw from the CPUs is considered negligible. Its transmitting the actual symbols (the bits) that costs massive amounts of power. So much in fact that compressing it will normally yield battery savings. Similarly, Anand makes a mistake here on the second to final page - higher data rates are -not- more power efficient, they are less so.
  • teiglin - Wednesday, September 18, 2013 - link

    It has been my experience that the SoC and display are more power-hungry than cellular data transfer in terms of peak power consumption. That's just anecdotal of course, based on comparing battery drain from an active download vs. screen-on-but-idle vs. screen-on-and-taxing-cpu and such. And if you're actually saying that SoC power draw in smartphones is negligible, then please just stop; I'm assuming you're just arguing that baseband/transceiver power is higher.

    Anand and Brian have always argued that newer, faster data transfer standards help battery life because generally those standards run at comparable power levels to the old ones but get tasks done faster, so for the same load (e.g. their battery life test). I'm not an expert in wireless communications, but their numbers have always borne out such arguments. I look at is as analogous to generational CPU improvements--they get faster and can spend more power while completing tasks, but total power to do a given task can be reduced by having a more efficient architecture.

    All of which is at best peripheral to my actual question, since I was asking about differences within the same communications standards at (presumably) the same theoretical data rates, but I guess Anand and company have stopped reading comments. :(

Log in

Don't have an account? Sign up now