I was just doing some basic research on creating a PowerShell script that would determine data change rates for a VM or a group of VM’s and came across this post by @Cocquyt on the Netapp forums which got my brain into one of these I need to write about this loops.
How to determine which VM’s (NFS files) are contributing the most to snapshot deltas? Hi – we have a 3.5Tb NFS datastore running about 30 vmware virtual machines. We try to maintain 21 days of daily netapp snapshots, but lately the daily deltas are > 100gb/day and its becoming challenging to keep 21 days without growing the volume. (the data:snapshot space ratio is about 1:2 – the snapshots take twice as much space as the actual VM images – and this is with dedup ON) How can we best determine which set of the VMs is contributing the most to the daily snapshot delta (that 100Gb)? Armed with this information we can then make decisions about potentially storage vMotioning VMs to other datastores to meet the 21day retention SLA.
Wow, that sounds like a massive pain in the ass to deal with. You dig a little further in the post and you can see just how convoluted the solution could be especially when you look to see what recommendations are made by the forum posters. In my past life as a simple admin/engineer running storage and virtualization, I could spend 20% of my day doing exactly this type of work. Trying to solve a problem via google. Reading and posting in forums, asking friends, calling the vendor or VAR. Essentially wasting productivity that should have been dedicated to enabling the business, on problems created by the limitations and complexity of the tools I had at my disposal. To me, that hits at the list of things I consider “Not Optimal”, and for the most part, overly complex.
Just so you know, I’m not picking on Netapp here, the issues faced in that post are going to apply to a large swath of storage companies and their arrays. The post is from 2011, and I’m not a Netapp expert so I don’t know how much better the process has gotten, but to me it illustrates a pain point for many organizations running some of the legacy storage platforms whose general foundational technologies were designed in the early to mid-90’s.
When those systems were architected, designed, and deployed years ago they were probably state of the art, and at that time, the convoluted, complex, and cumbersome configuration was considered part of doing business.
Few if any systems fit my 3E’s of being Efficient, Effective or Easy when it came to storage deployments at that time. Great consideration would have been given in regards to how to lay out the underlying disk, the size and speed of the disks, the RAID sets, disk, LUN sizes, zoning, HBA firmware and compatibility, switch settings, multi-patching software, thin provisioning limits, page sizes, snapshots, clones, etc. etc. etc.
What is interesting to me today, is that there are still a large number of customers who still rely on those legacy platforms for their production environments. I’m guessing if they have a 5 year old platform still being utilized today, they are probably looking at the next generation version of those products to replace the older platforms. Yet even though the model numbers may have changed and the bezels are newer, some of those limitations from the initial design of those systems will carry forward. If they are lucky, some of the complexities from the past have been abstracted away, perhaps the UX allows the abstraction of many of the manual operations, perhaps the algo’s associated with specific functions are quicker, most assuredly the speeds of processes have increased and the number of raw IOPS have gone up as well. But what has not changed are the flaws associated with architectures that were created long ago, and the bottom line is, the 3E’s of Efficient, Effective, or Easy to implement are not being utilized.
Stay tuned, second post to follow on how these 3E’s are being addressed today.
Post Edit to add a hat tip from Christian Mohn @h0bbel, there is another E that can come into play, and that would be Expensive. Though I tend to view the expense of a platform to be a slightly different aspect of this overall discussion, I’ll weave that into the next post as well.