This week, along with several thousand other attendees, I came to what is billed as the largest Cloud expo on the east coast for four days. Despite the sessions being fairly vendor heavy, there were a number of good takeaways for CxOs, application developers, system administrators and security gurus. I thought it would be helpful to provide a digest, so here we go!
For CxOs: Truth In Costing…
Finally the pundits are admitting publicly that Cloud is not always less expensive than premises-based hosting, something we have been saying for years. If you are running legacy applications which have very stable workloads and which consume fairly constant memory, CPU and storage resources, then “right-sizing” your own in-house gear will almost always save you money over cloud hosting.
Indeed, one vendor here does a brisk business showing companies who spend upwards of $50K/month with Amazon and other cloud providers how bringing in-house all or a portion of their cloud hosting can save them 20% or more.
Several small and large vendors also now provide “Private Cloud in a Rack” turnkey compute, storage, network and management kit in a pre-configured rack(s) so that your in-house IT department don’t have to spend months learning how to build cloud infrastructure properly. We’ve done this ourselves on a one-off basis for several clients already who “got” that building a resilient, highly available private cloud requires a great deal of cross-discipline expertise to get it right.
But again, this only saves money if you can fill up the kit near capacity. If your workloads are growing fast or bursty, then you’ll need to buy hardware big and powerful enough _in advance_ to handle those anticipated loads. If you underestimate the loads, then you are stuck. If you overestimate the loads, then you’ve in hindsight spent too much money on gear.
In those cases, deploying to the cloud (at least initially) gives you a lot of flexibility and can dramatically shorten your time to market. Shortening time-to-market for new initiatives is now reported to be the primary driver for cloud adoption.
Separately, CFOs (though perhaps not line managers) will be pleased that chargeback reporting is now much, much easier than it was just last year. In some companies however, line managers resist having IT charged back to their departments as doing so will hurt their margin. I spoke to two CIOs of major companies however who run their own internal-to-IT-only chargeback system. One CIO mentioned that when an underperforming line manager complained to the COO that lack of IT support was the root cause of their troubles, the CIO simply showed the COO a report which demonstrated that in fact the under-performing department consumed a disproportionate share of IT resources consistently. The line manager was reassigned BTW.
For Application Developers: It’s Not Just Democrats and Republicans Who Don’t Like Each Other
I was surprised at the visible disdain devs who code their cloud apps to BASE standards have for devs who support and code traditional n-tier ACID- or near-ACID-compliant apps.
To catch up the non-developers, the “E” in “BASE” stands for “Eventual consistency” and means that all of the database nodes will almost always have different data in their databases, but that eventually they get consistent. Most mobile apps are coded to BASE standards for example, and the two key benefits of BASE apps is first, that there is never a maintenance window (the app is distributed across hundreds if not thousands of nodes) as nodes can be updated on a rolling basis with no negative impacts, and; second, that you can lose a significant amount of your nodes and the application will continue to function, albeit with reduced capacity.
BASE apps also take full advantage of true cloud architectures. Tools as old-school as Citrix’s NetScaler can be used to spin up and spin down nodes in response to load, providing a consistent end-user experience while minimizing costs.
Now, on the other side of the fence, go tell a physician (or their malpractice insurance carrier) that the database nodes in their hospital EHR software don’t all have the same patient data in them, or that one database node shows that a patient was administered a large amount of oxycodone for pain management but the prescription for it hasn’t made it into that same database node just yet, and you’ll begin to understand why there remains a place in this world for ACID-compliant applications.
ACID devs here complained that the BASE devs just don’t “get” corporate America. The BASE devs counter that if you just make the “Eventual consistency” window short enough, you don’t need ACID compliance. The free beer served here in the late afternoons definitely increased the volume as well as the entertainment value of these exchanges!
Lastly, I walked away very impressed with Zend, who provide a commercial flavor of PHP tweaked for performance and which enables the web front-end of apps like Drupal to be run across multiple nodes quite easily.
For System Administrators: Professional Driver On Closed Course; Do Not Attempt Yourself!
It was interesting to see that the Admins who have to support both BASE and legacy apps and who had tried to do their own private clouds were the ones who were the most vocal in outsourcing it all. Having earned one more Pavlovian scar from the incredible complexity cloud architecture comprises, these admins definitely “get” that job security comes from understanding their company’s business processes and enabling users to do better — and not from how many racks of blinky lights live in the in-house data center.
One of the more political challenges several admins expressed was how to explain to non-technical management that (despite what VMware and some cloud providers say), just deploying the company’s legacy n-tier ACID compliant apps to a cloud won’t improve system uptime. These legacy apps rely on resilient architecture being underneath them at all times and are not designed for the node failures that happen all the time at cloud providers, and which BASE applications handle gracefully, without a reboot.
I was at first surprised when I spoke to one cloud hosting provider who caters to hosting BASE applications exclusively, and who keeps their costs low by removing most of the redundancy and resiliency in their infrastructure. No redundant switching, no multi-pathing etc. But then it made total sense as a hardware failure which would immediately crash a legacy n-tier ACID-compliant app just reduces the capacity of a cloud-architected BASE app.
I was also surprised that more sysadmins don’t pay greater attention to IOPs and the bandwidth between the storage and compute nodes as these can be absolutely critical bottlenecks which could make your app run more slowly in a cloud than it would on premises. One cloud management software company engineer with whom I spoke said he had been working on a whitepaper to try to benchmark these statistics for various cloud providers. The whitepaper was never published because the results were “consistently inconsistent”. He attributed this to the shared environment that is cloud, and whether or not neighbors on his compute node were doing disk I/O or not at the time of the testing.
Although we are deploying SSD storage to our own private cloud, none of the public cloud providers with whom I spoke have SSD on the radar near-term. Further, only Dell knew right away what the bandwidth was between their compute nodes and their storage (10G). It took multiple visits over three days to complete my little “What’s your compute-to-storage bandwidth?” survey. HP was at the low end with 400M though most others provide 4G or 8G. One very high end provider uses Infiniband for compute-to-storage connectivity and provides 80G (yes, eighty gigs) of bandwidth. If you have to ask the price…
Even in premises-based deployments, as more and more applications become more and more database or data-intensive, we are finding that the IOPs of the storage backend becomes increasingly important for maximizing performance, shortening backup windows, etc.
Two metrics sysadmins should look at when evaluating cloud providers come from running “top” in a Linux instance that is doing some consistent work, like running an rsync job. The “%wa” metric means the percentage of CPU time the cores are waiting for (almost always) IO to complete before a core can do any more compute. It’s hard to compare apples to apples easily because of the differing CPU clock speeds disparate cloud providers offer, but if %wa is consistently in the mid-to-high double digits, then storage IOPs and/or bandwidth is a problem.
The second useful cloud metric “top” provides is “%st”, which means “steal time”. This represents the percentage of CPU time that your virtual machine could have been doing real work, except that the hypervisor was giving those CPU cycles to another virtual machine. If this metric is constantly above even a few percentage points your virtual machine is on an overloaded host and you are well within your rights to have your provider move your virtual machine to a different host.
For Security Gurus: One Plus One Equals Three
Hybrid clouds represent the fastest growing cloud segment currently, with many companies placing their bursty web front ends in the cloud while keeping their database engines and data behind the corporate firewall. Security testing now has to be expanded to cover not only the traditional premises-based kit but the cloud environment AND the connectivity between prem and cloud. Many vendors here at the Cloud Expo are deploying “security as a service”; virtual appliances that run in the cloud environment to keep a continuous eye on things. The good news is that new compute node deployments increasingly are being automated and scripted from templates which have already undergone rigorous security testing, forcing the bad guys to rely increasingly on social engineering to get in.
To streamline security reviews, vendors catering to customers with higher-end needs are getting SSAE16 credentials. The security gurus with whom I spoke were a little dismissive of a SOC 1 report and felt that a SOC 2 Type II report would give them a lot more comfort in letting their clients do business with that vendor.
That’s All Very Interesting, But Should I Cloud or Not?
To try to keep things simple, we have four screening criteria for helping clients evaluate whether cloud is right for them.
Your Data. The more sensitive/regulated your data is, the less likely cloud is right for you. Not because it can’t be done securely, but because (as reported by several companies here at the Expo) it’s hard to get access and transparency from private cloud providers sufficient to satisfy the company’s data breach insurance carrier. None of the healthcare companies here with whom I spoke put ePHI repositories outside of the corporate firewall, though some have placed web front end systems with private cloud providers, and these front-end systems retrieve ePHI from the data repositories behind the corporate firewall.
Workload Variance. The more bursty your workloads, or if you are expecting difficult-to-predict growth, the more likely cloud is right for you. The ability to scale up/down on demand and “right-size” the infrastructure near continuously (and sometimes in a fully automated way) really helps manage costs and better serves your customers.
Size Matters. With less than 25-50 existing servers, building your own private cloud infrastructure will not be cost-effective. You’ll wind up with compromises on resiliency, performance and capacity to try to keep costs down. You wouldn’t use a VW Golf to Lowes to bring home five sheets of wallboard, so don’t use a single-controller NAS to be the storage backend when you really need a proper SAN.
Time To Market and Developer Expertise. If time-to-market is critical and you are coding BASE applications, cloud is absolutely right for you — but long-term it may be more cost-effective to bring some of the continuous workload back in house.
Time to head in for the last few sessions, so I’m signing off for now.
If your company is curious about cloud, we’d be happy to arrange a consult to see whether cloud makes sense for you.