Deduplication and Compression: Bringing Flash to the Masses … of Applications

by | Jul 21, 2015 | Cloud, Data Center security, Professional Services

All-flash arrays (AFAs) are sweeping across the technology landscape like a tidal wave, for enterprises of all sizes. They promise super-fast I/O as well as ease of installation, but flash technology is still relatively expensive from a $/GB perspective. Yet, despite the cost, AFAs are in high demand. What is driving this success?
The short answer: space efficiency.
The long answer: AFAs are shrinking the latency between the CPU and storage infrastructure layers. But the only way to make flash technology cost effective for the wide variety of enterprise application workloads is to use less physical flash capacity to store more actual data. Space efficiency techniques such as deduplication and compression accomplish this quite effectively, essentially allowing you to do more with less.
Deduplication—from one, many
While various AFA manufacturers perform deduplication differently, the net result is largely the same. Chunks of data, are stored once on disk, and referred to multiple times across the storage array. This has a “compression effect,” but is not compression, per se (we will discuss compression next). For instance, in a virtual machine environment with many machines that run on the same operating system, there will be a large percentage of data that will be the same across all of those VMs. Why store those data chunks multiple times, when you can store them once and refer to them multiple times?
The “deduplication ratio” achieved will have a big impact on the ability to mitigate the higher flash costs. Not all datasets will be reduced equally, and some datasets don’t reduce well at all. I’ve seen ratios as good as 150:1 reduction, and I’ve seen them lower than 2:1. Be wary of array configurations and quotes that depend on specific deduplication ratios to achieve cost targets, unless a detailed analysis of your data has been performed that determines your expected deduplication ratio. If your dataset doesn’t have enough repeatable data (or has encrypted/compressed data prior to being written to the array), you’ll have to size your array based on the real, raw capacity of the flash array, which will make it more expensive.
The fact that manufacturers perform deduplication in various ways provides a range of options. Some perform the function in-line (as the data is ingested and written), while others perform it post-process (at some point after the full data has been written to disk). And some perform a combination of both techniques. None of these are right or wrong; the technique that works best for you will depend on your workloads, datasets, as well as personal architectural preferences.
The role of compression
AFAs also utilize good, old-fashioned compression to achieve space savings. Compression and deduplication are not mutually exclusive. In fact, they are combined to provide the ultimate in efficiency. Most AFAs on the market can auto-detect if data is compressible, leaving incompressible data in its original format to avoid unnecessary cycles being poorly spent.
In some cases, compression comes at a performance cost, as it uses CPU cycles that otherwise would be used for data throughput. However, even after this performance ‘hit,’ the performance levels experienced will still eclipse those of traditional HDD arrays by multiples (from a latency perspective).
Sometimes an array is still an array
Performance is only one aspect of application management; one still has to protect the data and get it off-box or possibly off-site. Most AFAs have advanced storage capabilities, such as point-in-time snapshots, cloning, and replication. The integration of these features with your applications remains the same as with traditional arrays, and must be considered. Leveraging cloning capabilities can also add significantly to space efficiency on an AFA, referencing entire datasets multiple times for use cases such as dev/test, QA, or recovery.
Another point to keep in mind is that backing up data from an AFA will not necessarily be faster than a traditional HDD array, as this type of data access is sequential. In this case, flash drives don’t pump out data any faster than HDDs. However, random data access during backups will be much faster, because with flash there is no competition for head seeks or spinning media. If you’ve ever experienced an application slowdown during a backup window, AFAs can help this by removing the need to move disk heads all over the place.
So why are AFAs better than traditional arrays?
When used for specific workloads that combines performance as well as capacity requirements, AFAs beat out HDDs in several ways:

  • They can cost less with respect to initial cost of acquisition (assuming the data can be stored efficiently via deduplication and compression)
  • They provide better scalability and reliability over the lifecycle of the application, from an IOPS and latency perspective
  • They can significantly reduce the cost of administration and need for ongoing performance tuning
  • They provide better ability to handle unexpected peaks in random workloads or changes in application workload profiles

Deduplication and compression can make the choice of AFA vs HDD much easier, if your datasets allow you to realize enough space efficiencies to get the $/GB of the AFA close enough to that of the traditional array. Even if you don’t get all the way there, the tremendous performance benefits (and never having to point at the storage when there is a performance issue) might very well push you over the top.
Getting started
If “sticker shock” is stopping you from making all-flash arrays part of your technology strategy, don’t let the advertised “price per GB” cost of raw flash stop you. With the space efficiency technologies offered in today’s AFA, and an eye towards balancing performance as well as capacity over the long haul, AFAs with space efficiency are a perfect fit for the storage needs of an increasingly broad spectrum of applications in your data center. By evaluating your data and your existing infrastructure, you can determine what ROI you can achieve through the use of AFAs.
Have questions? Feel free to reach out to me, and we can talk about your specific situation.
If you like this post, check out The CEO’s Guide to Investing in Flash Storage for an overview of everything business decision makers need to know about data storage in their organizations!