How to Enable NVMe Zoned Namespaces

Hardware changes for ZNS

At a high level, in order to enable ZNS, most drives on the market only require a firmware update. ZNS doesn't put any new requirements on SSD controllers or other hardware components; this feature can be implemented for existing drives with firmware changes alone.

The critical element in hardware comes down to when an SSD is designed to support only zoned namespaces. First and foremost, a ZNS-only SSD doesn't need anywhere near as much overprovisioning as a traditional enterprise SSD. ZNS SSDs are still responsible for performing wear leveling, but this no longer requires a large spare area for the garbage collection process. Used properly, ZNS allows the host software to avoid almost all of the circumstances that would lead to write amplification inside the SSD. Enterprise SSDs commonly use overprovisioning ratios up to 28% (800GB usable per 1024GB of flash on typical 3 DWPD models) and ZNS SSDs can expose almost all of that capacity to the host system without compromising the ability to deliver high sustained write performance. ZNS SSDs still need some reserve capacity (for example, to cope with failures that crop up in flash memory as it wears out), but Western Digital says we can expect ZNS to allow roughly a factor of 10 reduction in overprovisioning ratios.

Better control over write amplification also means QLC NAND is a more viable option for use cases that would otherwise require TLC NAND. Enterprise storage workloads often lead to write amplification factors of 2-5x. With ZNS, the SSD itself causes virtually no write amplification and clever host software can avoid causing much write amplification, so the overall effect is a boost to drive lifespan that offsets the lower endurance of QLC compared to TLC (or beyond QLC). Even in a ZNS SSD, QLC NAND is still fundamentally slower than TLC, but that same near-elimination of background data management within the SSD means a QLC-based ZNS SSD can probably compete with TLC-based traditional SSDs on QoS metrics even if the total throughput is lower.

 

The other major hardware change enabled by ZNS is a drastic reduction in DRAM requirements. The Flash Translation Layer (FTL) in traditional block-based SSDs requires about 1GB of DRAM for every 1TB of NAND flash. This is used to store the address mapping or indirection tables that record the physical NAND flash memory address that is currently storing each Logical Block Address (LBA). The 1GB per 1TB ratio is a consequence of the FTL managing the flash with a granularity of 4kB. Right off the bat, ZNS gets rid of that requirement by letting the SSD manage whole zones that are hundreds of MB each. Tracking which physical NAND erase blocks comprise each zone now requires so little memory that it could be done with on-controller SRAM even for SSDs with tens of TB of flash. ZNS doesn't completely eliminate the need for SSDs to include DRAM, because the metadata that the drive needs to store about each zone is larger than what a traditional FTL needs to store for each LBA, and drives are likely to also use some DRAM for caching writes - more on this later.

NVMe Zoned Namespaces Explained The Software Model
Comments Locked

45 Comments

View All Comments

  • Carmen00 - Friday, August 7, 2020 - link

    Fantastic article, both in-depth and accessible, a great primer for what's coming up on the horizon. This is what excellence in tech journalism looks like!
  • Steven Wells - Saturday, August 8, 2020 - link

    Agree with @Carmen00. Super well written. Fingers crossed that one of these “Not a rotating rust emulator” architectures can get airborne. As long as the flash memory chip designers are unconstrained to do great things to reduce cost generation to generation with the SSD maintaining the fixed abstraction I’m all for this.
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Great article Billy. A couple of pointers to other parts of the ecosystem that are being upstreamed at the moment are:

    - QEMU support for ZNS emulation (several patches posted in the mailing list)
    - Extensions to fio: Currently posted and waiting for stabilizing support for append in the kernel
    - nvme-cli: Several patches for ZNS management are already merged

    Also, a comment to xZTL is that it is intended to be used on several LSM-based databases. We ported RocksDB as a first step, but other DBs are being ported on top. xZTL gives the necessary abstractions for the DB backend to be pretty thin - you can see the RocksDB HDFS backend as an example.

    Again, great article!
  • Billy Tallis - Friday, August 7, 2020 - link

    Thanks for the feedback, and for your presentations that were a valuable source for this article!
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Happy to hear that it helped.

    Feel free to reach out if you have questions on a follow-up article :)
  • jabber - Friday, August 7, 2020 - link

    And for all that, will still slow to Kbps and take two hours when copying a 2GB folder full of KB sized microfiles.

    We now need better more efficient file systems not hardware.
  • AntonErtl - Friday, August 7, 2020 - link

    Thank you for this very interesting article.

    It seems to me that ZNS strikes the right abstraction balance:

    It leaves wear leveling to the device, which probably does know more about wear and device characteristics, and the interface makes the job of wear leveling more straightforward than the classic block interface.

    A key-value would cover a significant part of what a file system does, and it seems to me that after all these years, there is still enough going on in this area that we do not want to bake it into drive firmware.
  • Spunjji - Friday, August 7, 2020 - link

    Everything up to the "Supporting Multiple Writers" section seemed pretty universally positive... and then it all got a bit hazy for me. Kinda seems like they introduced a whole new problem, there?

    I guess if this isn't meant to go much further than enterprise hardware then it likely won't be much of an issue, but still, that's a pretty keen limitation.
  • Spunjji - Friday, August 7, 2020 - link

    Great article, by the way. Realised I didn't mention that, but I really appreciate the perspective that's in-depth but not too-in-depth for the average tech-head 😁
  • AntonErtl - Saturday, August 8, 2020 - link

    As long as the zone is not divided between file systems, or direct-access databases, it is natural that writes are are synchronized and sequenced. And talking to the SSD through one NVMe/PCIe interface means that all writes (even to multiple zones) are sent to the drive in sequence.

    OTOH, you have software and hardware with synchronous interfaces (waits for some feedback before sending the next request), and in such a setting doing everything through one thread costs throughput.

    So you can either design everything to work with asynchronous interfaces (e.g., SCSI tagged command queuing), at least at all single-thread levels, or you design synchronous interfaces that work with multiple threads. The "write it somewhere, and then tell where you wrote" approach seems to be along the latter lines. What's the status of asynchronous interfaces for NVMe?

Log in

Don't have an account? Sign up now