Case Study: SSDs in AnandTech's Server Environment

For the majority of the history of AnandTech we've hosted our own server infrastructure. A benefit of running our own infrastructure is that we're able to gain a lot of hands on experience with enterprise environments that we'd otherwise have to report on from a distance.

When I first started covering SSDs four years ago I became obsessed with the idea of migrating nearly every system over to something SSD based. The first to make the switch were our CPU testbeds. Moving away from mechanical drives ensured better benchmark consistency between runs as any variation in IO load was easily absorbed by the tremendous amount of headroom that an SSD offered. The holy grail of course was migrating all of the AnandTech servers over to SSDs. Over the years our servers seem to die in the following order: hard drives, power supplies, motherboards. We tend to stay on a hardware platform until the systems start showing the signs of their age (e.g. motherboards start dying), but that's usually long enough that we encounter an annoying number of hard drive failures. A well validated SSD should have a predictable failure rate, making it an ideal candidate for an enterprise environment where downtime is quite costly and in the case of a small business, very annoying.

Our most recent server move is a long story for a separate article but to summarize the move, we recently switched hosting providers and data centers. Our hardware was formerly on the east coast and the new datacenter is in the middle of the country. At our old host we were trying out a new cloud platform while our new home would be a mixture of a traditional back-end with a virtualized front-end. With a tight timetable for the move and no desire to deploy an easily portable solution at our old home before making the move we were faced with a difficult task: how do we physically move our servers half way across the country with minimal downtime?

Thankfully our new host had temporary hardware very similar in capabilities to our new infrastructure that they were willing to put the site on as we moved our hardware. The only exception was, as you might guess, a relative lack of SSDs. Our new hardware uses a combination of consumer and enterprise SSDs but our new host only had mechanical drives or consumer grade SSDs on tap (Intel SSD 320s). The fact that we had run the site's databases off of a mechanical drive array for years meant that even a small number of consumer drives should be more than capable of handling the load. Remembering back to some of our earliest lessons in the SSD space: a single solid state drive can offer an order of magnitude better random IO performance than even the fastest hard drives. Sequential performance is typically closer, but with modern 3Gbps SSDs you're still looking at roughly a 100MB/s advantage in sequential throughput over the fastest mechanical drives.

Not wanting to deal with any potential IO performance issues, we decided to deploy a bunch of consumer grade Intel SSDs on our temporary DB servers that our host had on hand. This isn't uncommon, in fact a huge percentage of enterprise workloads are served just fine by consumer SSDs. It's only the absolute heaviest of workloads that demand eMLC or SLC drives. Given that we were going to stay on this platform for a short period of time, there was no need for eMLC/SLC drives. As we know from years of dealing with SSDs, NAND cells have a finite lifespan. Consumer grade MLC NAND is good for about 5000 program/erase cycles per cell, while enterprise grade MLC (eMLC/MLC-HET) can get you over 6x that. While most client workloads won't ever hit 5000 p/e cycles, a heavy enough enterprise workload can definitely reach that point. It's not just the amount of writing you do, it's also how much each write is amplified. Remember that although NAND is programmed at the page level (4KB - 8KB), it can only be erased at the block level (512KB - 2048KB). This imbalance in write/erase granularity means that eventually you'll have to write more to NAND than you've sent to the host (e.g. go to write 8KB but have to read, modify, write an entire 2048KB block as there are no empty blocks to write to). The ratio of NAND to host writes is referred to as write amplification. The combination of workload and write amplification are what determine the longevity of any SSD, but in the enterprise world it's something you actually need to pay attention to.


Write amplification around 1 may be realistic for client (read heavy) workloads, but not in the enterprise

Our host had eight Intel SSD 320s (120GB) on hand that we could use for our temporary database servers. From a performance standpoint these drives should be more than enough to handle our workload, but would they be reliable?

The easiest way to combat write amplification is to increase the amount of spare area on an SSD. NAND that isn't user addressable can be used for background operations and will help ensure that there are empty blocks to be written to as often as possible. If writes happen on empty vs. full blocks, write amplification goes down considerably.

We deployed the Intel SSD 320s partitioned down to 100GB each. Curious as to how they've been holding up over the past ~9 days that they've been running as our primary environment, I installed and ran Intel's SSD Toolbox on our DB servers.

As I mentioned in our review of Intel's SSD 710 you can actually measure things like write amplification, host writes and estimated lifespan using SMART attributes on the latest Intel SSDs. This won't work on the Intel SSD 510, X25-E or first generation X25-M, but on all other Intel SSDs it will.

The attributes of importance are E1, E2, E3 and E4 (these are hex representations of the SMART attribute numbers 225 - 228). E2 - E4 are all timer dependent, while E1 gives you an indication of just how much you've written to the NAND over the life of the drive. I'll get into the timer based counters shortly, but when we originally setup the servers I didn't have the foresight to reset the counters to get an accurate estimate of write amplification on our workload. Instead I'll look at total bytes written and use some internal estimates for write amplification to gauge lifespan.

The eight drives are divided among two database servers running two different applications (one for the main site and one for the forums/ads). The latter more heavily loaded but both are pretty demanding. The four drives are configured in a RAID10 to increase capacity and offer some redundancy. A simple RAID1 of larger drives would be fine, but 120GB drives are all we had on hand at the time. The total number of writes across all of the drives for the past 9 days is documented in the table below:

AnandTech Database Server SSD Workload
SMART Attribute E1 MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB)
Drive 0 576.28GB 1.03TB
Drive 1 563.28GB 1.03TB
Drive 2 564.44GB 1.13TB
Drive 3 568.03GB 1.13TB

Each drive in the first DB server has seen around 570GB of writes in 9 days, or roughly 63GB/day. The drives in the second DB server have gone through 1.03TB of writes in the same period of time or 114GB/day. Note that both workloads are an order of magnitude greater than an average consumer workload of 5 - 10GB/day. That's not to say that we can't run these workloads on consumer SSDs, we just need to be careful.

With no write amplification we could run on these consumer drives indefinitely. With each MLC NAND cell good for 5000 program/erase cycles, we could write to the drives 5000 times over before we started to lose NAND. Based on the numbers above, we'd blow through a p/e cycle every ~2 days on the first DB server and every ~1 day on the second server.

While we like to assume write amplification is nice and low, in reality it isn't. Intel's own datasheets tell us the worst case write amplification for the 320:

If you divide the column on the right by the column on the left you'll come up with 125 program/erase cycles per cell (if you define 1TB as one trillion bytes). If we assume that each cell is good for 5000 p/e cycles (Intel's 25nm MLC NAND spec) then it means that we're actually writing 40x what we think we're writing. This 40x value gives us an upper bound for write amplification on Intel's SSD 320. That's far lower than the peak theoretical max write amplification of 256 (writing 2048KB for every 8KB write sent to the host), but it's safe to say that Intel's firmware won't let things get that bad.

Write amplification of 40x isn't very good but it's also not very realistic for the majority of workloads. Our database workloads are heavy but they are not perfectly random writes over all LBAs for the life of the drive. Those workloads do exist, but we're simply not an example of one. A more realistic, but still conservative estimate for write amplification in our case would be 10x (just based on some internal estimates for write amplification). The actual write amp is likely less than half that but again, I wanted to be conservative.

Calculating longevity based on this data is pretty simple. Multiply the total bytes written by estimated write amplification and that's how we can scale up/down:

Intel SSD 320 - 120GB - SSD Longevity
  MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB)
Total Bytes Written (9 days) 576GB 1030GB
Estimated Write Amplification 10x 10x
Drive Capacity 120GB 120GB
Cycles Used 48 85.8
Cycles Used per Day 5.33 9.53
Worst Case P/E Cycles Available 5000 5000
Estimated Lifespan (Days) 937.5 524.3
Estimated Life (Years) 2.57 years 1.44 years

With 10x write amplification we're looking at roughly 2.5 years for one set of drives and 1.4 years for the other set. From a cost standpoint, it's likely cheaper to go the consumer drive route and proactively replace drives compared to jumping to eMLC although that's not always desirable from an uptime perspective. If our write amplification estimates are off (they are) then you can expect something more along the lines of 5 years for the first set of drives and 3 for the second set.

We could combat write amplification by setting aside even more spare area for the drives. Partitioning them as 80GB or even 60GB drives would tangibly reduce write amplification and give us even more time on these consumer drives.

This isn't so much an issue for us as our stay on the 320s is temporary, but it did bring up an interesting question. As enterprise SSD endurance is heavily dependent on write amplification, how would Intel's SandForce based SSD 520 handle enterprise workloads given that its effective write amplification is often times less than 1x?

If you were able to average 0.5x write amplification with Intel's SSD 520, your 5000 p/e cycle MLC NAND would behave more like 10000 p/e cycle NAND. While that's still short of what you get from eMLC, it is perhaps enough to be a better balance of price/performance for many SMB enterprise customers.

It was time to do some investigating...

Intel's SSD 520 in the Enterprise
Comments Locked

55 Comments

View All Comments

  • jeremyshaw - Wednesday, February 8, 2012 - link

    woah... I've been waiting for an article like this for a long time.

    Thank you Anandtech!
  • ckryan - Wednesday, February 8, 2012 - link

    Is AnandTech ever planning on doing a longer period SSD test? A long term testing scenario would make for interesting reading.
  • Anand Lal Shimpi - Wednesday, February 8, 2012 - link

    Technically all of our SSD tests are long term. We're still testing Vertex 2 class drives and I actually still have six Intel X25-M G1s deployed in systems in my lab alone. You only hear about them when things go wrong. Most of the time I feed errors back to the vendors to get fixes put into firmware updates. The fact that you aren't seeing more of this sort of stuff means that things are working well :-P

    But the results of our long term tests directly impact our reviews/recommendations. It's one of the reasons I've been so positive on the Samsung SSD 830 lately. I've been using 830s 24/7 since our review published in September with very good results :)

    Take care,
    Anand
  • Samus - Thursday, February 9, 2012 - link

    I've had an X25-M G1 in my Macbook since 2009, used daily, never a problem. Lack of trim support doesn't really seem to matter unless you're the type the writes/deletes a lot of data.
  • jwilliams4200 - Wednesday, February 8, 2012 - link

    Since you found that the 520 does not really do any better than the 320 for endurance, does this also imply that the Sandforce controller was not able to achieve significant compression on the workload that you fed to it? In other words, Sandforce compression does not work very well on real data as opposed to artificial benchmark data.
  • ckryan - Wednesday, February 8, 2012 - link

    SF is really good at compressing fake data. I suppose some logs could really benefit, but one of my personal SF drives has 10% more raw writes than host writes. I suspect I'm not alone with this either.

    People doing repeated incompressible benches could have WA higher than 1 with SF, but once you install the OS and and programs, every day writes are less compressible than promised it would seem.
  • Anand Lal Shimpi - Wednesday, February 8, 2012 - link

    Keep in mind that only 10% more NAND writes than host writes is *really* good. It's not uncommon to get much, much higher than that with other controllers.

    We did an 8 month study on SF drives internally. The highest write amp we saw was 0.7x. On my personal drive I saw a write amp of around 0.6x.

    Take care,
    Anand
  • jwilliams4200 - Thursday, February 9, 2012 - link

    Baloney!

    You just saw a write amplification of near 1 on this very article. Why do you dodge my question?
  • erple2 - Thursday, February 9, 2012 - link

    I suspect that the workloads that they were testing for with the SF drives internally are not what is reflected in this article.

    That implies, then, that the SF drives have been doing other workloads like acting in desktops and/or laptop duties. For those kinds of things, I suspect that a 0.6-0.7x is more reasonable (assuming there isn't much reading/writing of incompressible data).

    Given that some of the workload may be for mobile applications, and given a strong focus on WDE for laptops, I wonder how that ultimately impacts the write amplification for drives with WDE on them.
  • jwilliams4200 - Thursday, February 9, 2012 - link

    The "8 month study" that he refers to is very hard to believe.

    Does he really expect us to believe that the people in Anand's test lab used these SSDs for 8 months and did not run any benchmarks on them?

    Most benchmarks write easily compressible data, and a lot of it.

    The real way to test the Sandforce compression is to write typical user data to the SSD and monitor the raw write and host write attributes. That experiment has already been done on xtremesystems.org, and the findings were that typical user data bare compresses at all -- at best raw writes were 90% of host writes, but for most data it was 100% or higher. The only thing that got some compression was the OS and application installs, and most people only do those once, so it should not be counted towards user data when estimating endurance.

Log in

Don't have an account? Sign up now