Mass Hot Storage: Knox

For OpenRackv1 to work, a new server design was needed that implemented bus bar connectors at the back. Using an updated Freedom-based chassis minus the PSU would cause a fair bit of empty space. Simply filling the entire space with 3.5" HDDs is wasteful, as most of Facebook's workloads aren't so storage hungry. The solution proved to be very similar to the power shelf, namely grouping the additional node storage outside the server chassis on a purpose built shelf: Knox was born.


OCP Knox with one disk sled out (Image Courtesy The Register)

Put simply, Knox is a regular JBOD disk enclosure built for OpenRack which needs to be attached to host bus adapters of surrounding Winterfell compute nodes. It differs from standard 19" enclosures for two main reasons: it can fit 30 3.5" hard disks, and it makes the job of maintenance quite easy. To replace a disk, one must simply slide out the disk sled, pop open the disk bay, swap the disk, close the bay and slide the tray back into the rack. Done.

Object Storage

Seagate has contributed the specification of a "Storage device with Ethernet interface", also known by its productized version as Seagate Kinetic. These hard disks are meant to cut out the middle man and provide an object storage stack directly on the disk, in OCP speak this would mean the Knox node would not need to be connected to a compute instance but can be directly connected to the network. Seagate, together with Rausch Netzwerktechnik, has released the 'BigFoot Storage Object Open', a new chassis designed for these hard disks, with 12x 10GbE connectivity in a 2 OU form factor.

The concept of the BigFoot system is not unknown to Facebook either, as they have released a system with a similar goal, called Honey Badger. Honey Badger is a modified Knox enclosure and pairs with a compute card -- Panther+ -- to provide (cold) object storage services for pictures and such. Panther+ is fitted with an Intel Avoton SoC (C2350 for low end up to C2750 for high end configurations), up to four enabled DDR3 SODIMM slots, and mSATA/M.2 SATA3 onboard storage interfaces. This plugs onto the Honey Badger mainboard, which in turn contains the SAS controller, SAS expander, AST1250 BMC, two miniSAS connectors and a receptacle for a 10GbE OCP mezzanine networking card. Facebook has validated two configurations for the Honey Badger SAS chipset, one based on the LSI SAS3008 chip and LSI SAS3x24R expander, the other configuration consists out of the PMC PM8074 controller joined by the PMC PM8043 expander.

Doing this eliminates the need for a 'head node', usually a Winterfell system (Leopard will not be used by Facebook to serve up Knox storage), replaced by the more efficient Avoton design on the Panther card. Another good example of modularity and lock-in free hardware design, another dollar saved.

Cold Storage

A slightly modified version of Knox is used for cold storage, with specific attention being made to running the fans slowly and only spinning a disk when required. 

Facebook meanwhile has built another cold storage solution, this time using an OpenRack filled with 24 magazines of 36 cartridge-like containers, each of which holds 12 Blu-ray discs. Apply some maths and you get a maximum capacity of 10,368 discs, and knowing you can fit up to 128GB on a single BD-XL disc, you have a very dense data store of up to 1.26PB. Compared to hard disks optical media touts greater reliability, with Blu-ray discs having a life expectancy of 50 years and some discs could even be able to live on for a century.

The rack resembles a jukebox; whenever a data is requested from a certain disk, a robot arm takes the cartridge to the top, where another systems slides the right discs into one of the Blu ray readers. This system serves a simple purpose: getting as much data as possible stored in a single rack, with access latency not being hugely important.

Integrate: OpenRack The Next Generation: Winterfell
Comments Locked

26 Comments

View All Comments

  • Kevin G - Tuesday, April 28, 2015 - link

    Excellent article.

    The efficiency gains are apparent even using suboptimal PSU for benchmarking. (Though there are repeated concurrency values in the benchmarking tables. Is this intentional?)

    I'm looking forward to seeing a more compute node hardware based around Xeon-D, ARM and potentially even POWER8 if we're lucky. Options are never a bad thing.

    Kind of odd to see the Knox mass storage units, I would have thought that OCP storage would have gone the BackBlaze route with vertically mount disks for easier hot swap, density and cooling. All they'd need to develop would have been a proprietary backplane to handle the Kinetic disks from Seagate. Basic switching logic could also be put on the backplane so the only external networking would be high speed uplinks (40 Gbit QSFP+?).

    Speaking of the Kinetic disks, how is redundancy handled with a network facing drive? Does it get replicated by the host generating the data to multiple network disks for a virtual RAID1 redundancy? Is there an aggregator that handles data replication, scrubbing, drive restoration and distribution, sort of like a poor man's SAN controller? Also do the Kinetic drives have two Ethernet interfaces to emulate multi-pathing in the event of a switch failure (quick Googling didn't give me an answer either way)?

    The cold storage racks using Blu-ray discs in cartridges doesn't surprise me for archiving. The issue I'm puzzled with is the process how data gets moved to them. I've been under the impression that there was never enough write throughput to make migration meaningful. For a hypothetical example, by the time 20 TB of data has been written to the discs, over 20 TB has been generated that'd be added to the write queue. Essentially big data was too big to archive to disc or tape. Parallelism here would solve the throughput problem but that get expensive and takes more space in the data center that could be used for hot storage and compute.

    Do the Knox storage and Wedge networking hardware use the same PDU connectivity as the compute units?

    Are the 600 mm wide racks compatible use US Telecom rack width equipment (23" wide)? A few large OEMs offer equipment in that form factor and it'd be nice for a smaller company to mix and match hardware with OCP to suit their needs.
  • nils_ - Wednesday, April 29, 2015 - link

    You can use something like Ceph or HDFS for data redundancy which is kind of like RAID over network.
  • davegraham - Tuesday, April 28, 2015 - link

    Also, Juniper Networks has an ONIE-compliant OCP switch called the OCX1100 which is the only Tier1 switch manufacturer (e.g. Cisco, Arista, Brocade) to provide such a device.
  • floobit - Tuesday, April 28, 2015 - link

    This is very nice work. One of the best articles I've seen here all year. I think this points at the future state of server computing, but I really wonder if the more traditional datacenter model (VMware on beefy blades with a proprietary FC-connected SAN) can be integrated with this massively-distributed webapp model. Load-balancing and failovering is presumably done in the app layer, removing the need for hypervisors. As pretty as Oracle's recent marketing materials are, I'm pretty sure they don't have an HR app that can be load-balanced on the app layer in alongside an expense app and an ERP app. Maybe in another 10 years. Then again, I have started to see business suites where they host the whole thing for you, and this could be a model for their underlying infrastructure.
  • ggathagan - Tuesday, April 28, 2015 - link

    In the original article on these servers, it was stated that the PSU's were run on 277v, as opposed to 208v.
    277v involves three phase power wiring, which is common in commercial buildings, but usually restricted to HVAC-related equipment and lighting.
    That article stated that Facebook saved "about 3-4% of energy use, a result of lower power losses in the transmission lines."
    If the OpenRack carries that design over, companies will have to add the cost of bringing power 277v to the rack in order to realize that gain in efficiency.
  • sor - Wednesday, April 29, 2015 - link

    208 is 3 phase as well, generally 3x120v phases, with 208 tapping between phases or 120 available to neutral. Its very common for DC equipment. 277 to the rack IS less common, but you seemed to get hung up on the 3 phase part.
  • Casper42 - Monday, May 4, 2015 - link

    3 phase restricted to HVAC?
    Thats ridiculous, I see 3 Phase in DataCenters all the time.
    And Server vendors are now selling 277vAC PSUs for exactly this reason that FB mentions. Instead of converting the 480v main to 220 or 208, you just take a 277 feed right off the 3 phase and use it.
  • clehene - Tuesday, April 28, 2015 - link

    You mention a reported $2 Billion in savings, but the article you refer to states $1.2 Billion.
  • FlushedBubblyJock - Tuesday, April 28, 2015 - link

    One is the truth and the other is "NON Generally Accepted Accounting Procedures" aka it's lying equivalent.
  • wannes - Wednesday, April 29, 2015 - link

    Link corrected. Thanks!

Log in

Don't have an account? Sign up now