DEDUPE ALL THE DATA

Funfact: This post will be markety in nature.

I just got back from the Gartner Data Center Symposium where we (SimpliVity) were Platinum sponsor. This was my first time at a Gartner based event and I found the overall atmosphere of the show to be

GartnerDCvery well put together and effective. One of the really nice things about this show was the non-competition with sessions for the solutions exchange vendors. This gives participants the ability to not have to sacrifice a session in order to go talk with the vendors they would be interested in.

One other aspect of the show I really liked was the ability to speak to both end users/decisions makers as well as the Gartner analysts. I found myself having much longer one on one or small group discussions on our technology and how it would fit into the organization instead of just running through a random pitch/demo. For me, the evangelization aspect of working the show floor is what I tend to enjoy. That said, I was doing a fair amount of demonstrations of our new software release on the OmniCube system.

This of course leads me into the thrust of this post and discussion.

 DedupeDedupe All The Data.

There are 3 primary means to reducing the storage footprint for virtualized workloads in the datacenter today. Deduplication, Compression, and Thin Provisioning. All three assist in reducing the physical storage required, all three have matured to a point of wide adoption over the years as well and the technology behind them is traditionally well understood and accepted across the IT spectrum.

Deduplication: It’s Not Just For Space Reduction

For some storage platforms, dedupe takes place as a post process action. Coming in after the data has been written, and reducing the total physical capacity after the fact. Helpful this is for storage space stop-hammertimereduction, but it has very little to no benefit for the reduction in IO operations. In fact, I’d argue it generates more IO within the array as the work that has to be done to hydrate/dehydrate the data requires significant overhead. This overhead impacts performance, and thus most storage systems today cannot do inline deduplication of data. As with tiering and data progression actions (where data is moved about the array during a scheduled period or in real time), post process dedupe can lay some serious hammertime onto your underlying disk infrastructure. 

Now don’t get me wrong, there is a lot to be said for space reduction at the storage level. The All Flash array vendors require it in order to be efficient and bring the cost of their flash down in line with the costs of disk, if they can’t then it blows the cost model for Flash as a general purpose storage platform. Then there is deduplication used in backup and replication technologies. The need to reduce RPO/RTO for backup windows given the data explosion is significant. That Weekend Full that takes 64 hours to backup that 50TB of data isn’t cutting it for backup, and it most certainly isn’t going to work for a recovery point objective of 4 hours. So we get backup appliances that incorporate dedupe so the same data isn’t backed up redundantly. Going further, replication at the storage level really can benefit from deduplicaiton as well, if you only have to send data across the WAN once, you’re much better off than having to send the same blocks over and over again. Bandwidth is expensive.

All This Brings Me to What If?

whatifWhat if all of our data was deduplicated before it hit the disk structure? To do that you have to be able to do in-line deduplication. Taking it further you need to do it at a very fine grained level (aka 4-8k blocks). Just so we are clear, this can’t be done post-process, and its not enough to do it on just a single tier of storage within the array, you need to be able to provide dedupe once and forever across all tiers of storage (DRAM, SSD, HDD), across all classifications of data (primary, backup, WAN, and Cloud) and do it on a global scale (across all locations that contain your data). Oh and for fun lets throw in the requirement that there will be no penalty on performance at any level of that process? One last thing, throw compression into the mix as well because that would be sweet. Sweet like a chaco taco made with a bacon shell sweet. If you have this functionality, you can move beyond simple reductions in space on the storage tier, and now you are moving into the realm of Data Virtualization.

Data Virtualization Engine

When I first starting working with our OmniCube platform I didn’t quite get the true benefit of what we call our Data Virtualization Engine. I thought, hey its cool that we dedupe and compress data, that will save tons of space. What I didn’t fully grasp was the drastic amount of IO we would essentially eradicate. With inline deduplication and compression that occurs before the data moves down the traditional storage disk system the ability to simply not do IO is presented. As I’m fond of pointing out, the quote from Gene Amdahl is: “The Best IO is the one you don’t have to do”, and I can think of no other platform that illustrates this functionality like the OmniCube.

dedupe-all-the-thingsSo, what am I looking at in the image above? Essentially, the data structure layout on the OmniCube system. Data is broken out in logical groupings of VM Data, Local Backups, and Remote Backups (data replicated from one local OmniCube Federation to another). Then broken out is the deduplication/compression ratios which when multiplied present an Efficiency quotient. In the example above, its 164:1. And while a total of nearly 72TB of data is being stored on a pair of OmniCubes that physically can hold 28.4TB , the actual amount of data written has only been 445GB (aka 1.5% of physical capacity) of actual unique data . The result is called Savings, ie eradicated IO, or IO we have not had to write. It goes a step further as well, when you don’t have to write data more than once, new writes if they are not truly unique remain unwritten to the backend disks though acknowledged to the VM, this increases performance far beyond what a traditional storage platform can achieve, even with hundreds of spindles..  A write that you don’t have to do will always be faster than the one you must do, regardless of how fast your underlying storage is. This is why we call the equation Efficiency, and the result Savings.

Yeah Right, Lab Queen.

laqqueenOk, so you’re probably asking yourself, what was the makeup of those machines, I bet it was just the same machine over and over again. Well, for the record its not, its several VM’s with a data creation script that generates change rate data over the day. It’s more like a 2:1 ratio of Windows/Linux of varying size. The larger machines are roughly 470GB when compressed, with the smaller systems clocking in at 3GB, 16GB, and 40GB.  So now for a bit further explanation about the numbers. There is one thing about this that helps with the seemingly unrealistic Efficiency ratios, its the fact that we don’t have to write the same block twice in the system. I also can take a backup of a Virtual Machine 100 times, and we will count that as part of the logical calculation. Now before you cry foul, keep this one thing in mind. If you were going to back those machines up through traditional means, you would have had to read and write out that data structure, we don’t, and since we don’t its our position that you should understand just how efficient the Data Virtualization is.

This posts getting a little long winded, and in my usual fashion, its getting a little divergent from the general thrust of what I was looking to write. But the one aspect of this that should stick with you dear reader, is that you cannot do what I’ve shown you above with traditional storage platforms especially if they claim dedupe, but only do it across a single tier of storage, or as a post process action.

The sister post to this one will go further into the aspects of the backup and replication and why I tend to say that Backup is Broken. You will have to stay tuned for that one. As always more information about SimpliVity can be found here.

Edited to add per the point brought up in comments: We are not simply writing a single copy of this primary data, it would not be proper to do so. In the next post about backup, I’ll go into more depth about the underlying data protection aspects. To put it simply, we will have more than a single copy of primary data, and it will not always reside on the same system. Full storage HA comes with a secondary OmniCube unit, or multi-site implementations.

2 thoughts on “DEDUPE ALL THE DATA

  1. December 15, 2013 at 9:59 am

    Great post and more a read about Simpivity, the more I like it but… I have to disagree on one point; backup.

    The goal of backing data up is to have multiple physical (as opposed to logically and deduped) copies of that data. If Simplivity doesn’t do at least a second physical copy of the data elsewhere be it another node or site or cloud, then it isn’t a backup!

    I may haven’t fully understand backups with Simplivity though.
    Please enlighten me if such…

  2. December 15, 2013 at 10:48 am

    I will add something to the post about it. We do indeed keep secondary copies of data. I may not have alliterated that properly.

Leave a Reply