Over the last few years I have heard increasingly more confusion about what it means to be "in" or "doing" the whole cloud thing, I have heard explanations from "Thats just dropbox isn't it?" to "I am doing cloud, I have a <insert cloud server technology name here> instance at <insert cloud name here> hosting my WordPress blog". However it is worth noting cloud isn't so much about the technologies you use but much more about how you use said technologies and design choices you make when architecting your app, really it is much more than simply "using someone else's computer".
So why isn't dumping your LAMP stack on an EC2 instance considered cloudy? Well, this is probably the worst way to use the cloud, by it's design having a single point of failure like this in the cloud is a super bad idea. The cloud is designed around using cheap commodity hardware under the hood without the features to offer proper HA for an individual machine. High availability is acheived in an alternative way in the cloud which relates to the architecture of your application, simply said if you put a single point of failure on the cloud you can expect it to fail, randomly, without warning and without a SLA typical of more traditional hosting methods to claim upon. A good example of this is AWS EC2 instance-store based instances, these do not utilise high availability storage and upon unrecoverable hardware based issues on the hypervisor you are placed on you may see your instance terminated without notice effectively blasting your LAMP server out of the water. If you are non cloudy you'd probably consider this a fault, but actually this is by design.
You may have heard the pets vs cattle metaphor before, this is a great way of describing how to architect for the cloud. The host described above can be considered as a pet, just like your dog, you'd be devastated if it died, lovingly hand crafted and unique. In the cloud these sorts of hosts / dogs are banned, you must twist your thinking to be much more like a farmer managing a herd of cattle. If one gets sick its not too much of a problem, once an animal becomes economically unviable you simply dispatch it and move on, after all your still going to get a good profit this year as all his buddies are still mooing in the field, and of course if you anticipate more demand you can always pick up more calves at the local cattle market to bulk out the herd. Of course, this metaphor is probably flawed in the eyes of a farmer as they'll be the first to tell you a good milk producing heifer is worth a few bob and well worth looking after.
By now you are probably scratching your head at how to acheive a no pets application, luckily most clouds give you a decent set of tools to help you achieve a cattle like setup. All the major clouds provide really comprehensive APIs which allow you to programatically list, deploy, and delete resources on the fly, furthermore resources are not just limited to server like instances, typical basic features you can expect include software defined networking, load balancers, databases and object stores. Larger more established clouds such as AWS, Google Compute Platform and Azure even more features stretching as far as machine learning, logging and analytics and so on. The good news is using the cloud provided resources such as a database takes away a lot of the headaches of running a database yourself, you just say "Hey give me a database!" and within minutes you have a database up and running in a high available and scalable fashion which is usually tuned by the cloud provider to make the most of their platform.
Meet the App
For this article I'll be using a recent project of mine as a case study for architecting for the cloud. Free WP Health Check is a website which allows visitors to scan their WordPress blog for security vulnerabilities and out of date plugins, behind the scenes it uses wpscan, a command line Ruby application which allows you to output to JSON format.
Obviously, we could setup a single VPS with a database to store results configured to run the Flask Python app running the front end and the scanner. However this has several limitations...
- The VPS is a single point of failure, if the host dies the website is offline.
- If we get lots of visitors we are limited to the concurrent amount of executions the one VPS can handle which would result in a bad user experience if the host is at capacity.
- All of the services are hosted on one machine, a bad scan process could impact the database or webserver.
- If the host dies we'd need to take considerable effort to restore from a backup (if we indeed configure some form of backups).
- We need to actively manage the application server and database ourselves.
The cloud turns the whole sorry situation on it's head, if we architect for the cloud we get the following:
- No longer need to manage the database, the Database as a Service product will provide a highly available, scalable database for us.
- We can scale backend scanners based on demand, if we get an influx of visitors we just scale more on the cloud.
- We can autoscale a group of front end application servers running the Flask app, if one goes bad we just kill it and the others carry on / more spawn as required.
Of course to acheive our cloudy dreams we need to do a few extra things to make our app cloud ready, here are a few things to consider...
Which technologies to use
You'll notice in this section many technologies are mentioned but it's best not to strive to use a particular one because it is considered cool or because it's what everyone else is using, instead it is import to choose based on your use case, in any case you'll also probably end up using a mixture of technologies to run your app unless it is very simple indeed...
The golden rule of architecting for cloud is that application servers such as the front end Flask app and the scanner nodes must be 100% stateless, this is due to the fact they may be scaled up or down, or get deleted at any time and any data on them will be lost. Therefore any data such as user uploads, the user's session, or other application data must be stored in some sort of datastore. Here are some common datastores in the cloud and their typical use cases:
- Database as a Service: provides a relational database such as MySQL or MSSQL in a highly available automatically managed configuration. This is useful for storing relational application data which would traditionally get stored in a relational database, however you may hit scalability issues with DBaaS.
- NoSQL as a Service: provides a NoSQL database such as MongoDB, CouchDB and so on, by their nature these databases provide high availability and scalability, however these database are not well suited to relational data as they do not enforce relationships and are more suited to flattened data, also they are not transactional so you cannot be gauranteed all shards have the most upto date data at read time, they are eventually consistent.
- Key Value as a Service: With cloud you cannot store session data on the application nodes as a user may be moved from one application node to another mid way through their visit by the load balancer, therefore you must store session data in a shared placed for each application node to have the user's session state. KVaaS services based on software such as Redis or Riak are useful for this task, they are lighweight and fast whilst being highly available and scalable.
- Object Stores: These are useful for storing raw objects such as files and are great for storing dynamic content for your site such as images or user uploads. Some popular object stores include Ceph, S3 and OpenStack Swift. Object stores are by their nature highly available and scalable services typically offering atleast 3 replicas of each object stored.
For the Free WP Health Check only one datastore is required to store the results of the user's scan, naturally the data is not relational but there may be many rows so a NoSQL datastore is a good choice, the eventual consisteny is not an issue as the only effect of this may be the user has to wait slightly longer to retrieve their results, in anycase this would be negligable compared to the total scan time, my preferred choice for this use case is DynamoDB from AWS due to it's simplicity in day to day management, however you may wish to use a more complex NoSQL datastore if you require specific features.
Next up is how all the components hang together, in the single VPS example it is pretty easy to see how the user's URL can be passed to the scanner, however in the cloud the scan servers are disassociated from the front end Flask servers and hence some method is needed to reliably pass the URLs from the web ui to the scanners.
- Queue as a Service: most clouds provide queue as a service which allows you to queue and dequeue messages. Some queues can also offer advanced features such as topics, complex routing of messages and so on. Queues are usually highly available and scalable. Examples of queue technologies include Amazon SQS, RabbitMQ, and Rackspace Cloud Queues.
In the case of Free WP Health Check only a very simple queue is required and hence AWS's Simple Queue Service fits the bill just right.
OK, so now we have a place to store the user's results and a way to communicate between the front end web servers and the scan servers, but how does all of this scale and acheive fault tolerance? Well, currently there are two commonly used methods for scaling these pools for application servers...
- Golden Images: The first and simplist way to autoscale a pool of application servers is to build the perfect application server and take an image of it, you can then ask the autoscale tools in the cloud you are using to scale clones of this server on demand using whichever rules you specify. This is great as typically building a new host like this is pretty quick, but remember it won't have all the most up to date packages as it is frozen in time, so you must rebuild the golden images regularly.
- Config Managment Build Systems: With config management build systems your cloud provider's autoscaler builds an empty machine with a basic install of the OS you'd like to use, then after it's built a tool such as Salt, Ansible or Puppet goes in and configures the host as the required application node and puts it into service, the good thing about building like this is you always get good up to date packages, however it is harder to manage as you may need to hold specific packages back and it takes longer to spin up an app server this way.
For the Free WP Health Check app I decided to use the golden images approach with AWS Auto Scaling Groups as the pool can be scaled much quicker using golden images reducing the user's time in the queue if there is a sudden influx and the scan servers get exhausted. The golden images themselves will be built periodically by a Jenkins job which will build the latest Python Flask app / wpscan server from Git as well as install the latest OS packages. Prior to releasing the image to the Auto Scaling Group the image will be tested for functionality by Jenkins in a staging environment.
This is not an exhaustive list of cloud technologies that are out there, and there are many more to explore, but it gives us the basics to get Free WP Health Check up and running in a cloudy way. Of course we could extend this further by serving the front end over CDN, or converting the front end into a Function as a Service and Web Services Gateway style setup so we can scale the front end much faster and without the headache of building golden images.
Putting it all together
Now the technologies have been chosen the applications components can be written, the Free WP Health Check app consists of two components, the front end Flask app and a backend scanner worker. These have both been written in Python using the Boto3 library to interact with DynamoDB and SQS. Here is how it all hangs together...
Of course this is a super high level overview of how to make a very simple application cloud ready, but hopefully you have learnt something along the way. If you'd like to play with the app mentioned above please bear in mind the auto scaling groups are limited to stop my AWS budget spiralling out of control and hence if lots of people hit it at once you may end up getting jobs in a "queued" status. Of course there are more steps which can be taken with the app discussed above, something we did not cover include spanning multiple availability zones, and fronting the app with CloudFront. In the future I may modify the front end app to use S3 to store and serve the static contents like the HTML pages and CSS and use JS to submit URLs and retrieve results via the API Gateway and Lambda services, in this case I can completely remove the front end autoscaling group which would reduce costs as there wouldn't be EC2 instances burning away when no users are on the site and increase the speed to scale more users on the front end. Unfortunately the scanner pool will need to remain as the WPScan ruby script cannot be run via Lambda. Remember when building your apps keep to the rules of the cloud; no pets, horizontal scaling stateless servers, use the provided as a Service offerings to reduce what you need to manage, expect and architect for failure.