By Brandon Davis, Chief Software Architect at On Center Software
The cloud has been a buzzword for many years now. We are inundated with marketing material promoting the cloud for everything from thermostats, to file sharing, to accounting. Are they selling something more advanced? Or is it basically the same technology from the 1990s? In fact, there is a popular meme sticker that says “There is no cloud, it’s just someone else’s computer.”
So, what does define the cloud?
Let’s start with a brief history of the modern application infrastructure. We’ll consider the modern era to be when DNS and TCP/IP protocol was implemented.
DNS stands for Domain Name System, and it maps the IP address of a computer to a user-friendly name. When your computer connects to a network it is provided the address of the DNS server it should use. When you type the address of a website into your browser the computer queries that DNS server to find the address of the website you entered. If that DNS server does not know the answer, it will ask the next DNS server above it in the hierarchy. There are many layers of DNS servers, eventually they all connect to one of the 13 root name servers. It didn’t always work that way though. When DNS was first used, root name servers did not exist. Instead, the operators of DNS servers would share the directory as a file that was copied to every DNS server. This file distribution-based architecture is considered horrific today. In 1995 this was built into every system; 20 years ago it was still quite common, by 2001 distribution-based systems had been almost completely replaced by client-server architectures. Services that were originally copy-based were re-architected as client-server systems. DNS became hierarchical with 13 root name servers.
In client-server systems there is a single machine that coordinates all of the operations, and numerous client machines that communicate with the server. This architecture enables multiple clients to request changes, simultaneously, and the server coordinates their changes. The scalability of the client-server architecture was further expanded with the advent of multi-tier architectures. This style involved splitting the server-side workload among multiple machines. A website might have a web tier that handles serving data to clients, and a database tier that handles updates to that data. It was now possible to create asymmetric tiers. You could deploy 10 web servers connected to a single database. DNS could be leveraged to map all of the web servers to the same name, and round-robin requests among them.
We gained much higher availability by having multiple copies deployed. A single copy of the system has 90% availability with 9 copies. We could now calculate how many copies we need in order to achieve 99.999% reliability, with 5 x 9 copies.
Reliability increased, but databases were still difficult to scale. As demand on these systems grew, we were forced to either scale up the system (add more CPU, memory, and disks), or splinter the data, scale out (split the data up into smaller chunks that could be stored and queried on different machines.)
Whether you scale up or out, each system has a single point of failure. In scale-out systems this actually reduces the reliability of the system since you now have multiple machines, and the failure of any one of them may cause the whole system to fail.
Failure is not an option
A new way was to keep a hot standby of everything. If your system consisted of five machines you deploy 10 hot standbys (2 copies of each of the 5). Every update to a machine simultaneously updates the copy. When one machine has a problem such as a disk crash, or memory failure, the copy takes over responding to requests. This sort of architecture vastly improved reliability of systems that could handle large scale workloads, but as the user base grew, cracks began to show up with networks.
There is not one giant network. Networks owned by AT&T, Sprint, Verizon, Google, CenturyLink, L3, and NTT are all connected. A problem on one network rarely causes problems for the others, but it can make it impossible to reach portions of the network from some locations. If all of the servers are hosted on one segment of Google’s network, and there is an issue routing traffic to that segment, then the system will appear to be working fine for customers that are also on that segment, but unreachable for anyone else.
We place servers on numerous geographically separated network segments. We keep copies of everything, and spread those copies across multiple network segments. We then use special algorithms such as Paxos that help us detect and react to partitions or failures. This gives us the ability to keep our system functioning even when large segments of the network are out of service.
Now we can build systems that almost never go down from a client’s point of view. Servers fail, networks have issues, but a copy of the customer’s data is always available to them, and they are never aware of the failures.
The next step is to make systems able to handle any possible load. The most flexible approach is to make as many layers of the system as possible, stateless. If one layer of the service is a web server that serves static files this layer is stateless. It doesn’t need anything from another layer, and doesn’t store any changes from the client. If a single server can handle 1,000 clients, then we can run 10 copies to handle 10,000 clients. Stateless services generally scale linearly, stateful services do not.
Modern web scale platforms typically combine the distributed and multi-tier architectures. Most layers of the system are stateless. A small number of services are stateful and provide persistence. The stateless services deliver the ability to scale up or down to meet customer demand. The stateful services are still problematic. In some cases, it is possible to shard relational data into isolated chunks, and then replicate those chunks to multiple locations. In other cases, clustered datastores are required in order to store large numbers of key value pairs in a very elastic manner. Other times file-based storage is enough, provided it is replicated for availability.
Systems that have this level of availability and resiliency are classified as cloud computing (specifically SaaS systems) by the National Institute of Standards and Technology (NIST). http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf.
Services that do not provide this level of redundancy and availability are not cloud computing.
For the operator of these fail-safe cloud systems, the cost can be quite high. It requires a massive number of servers, and a substantial amount of bandwidth between them. It also comes with major engineering costs to design and keep it functioning.
Cloud computing has many benefits for clients of On Center Software’s Oasis platform. Construction teams have the ability to access their data any time, from any place, with a very low probability of failure.
Come back to this blog to find more of the challenges we faced, the technologies we used, how we operate them, and why you can trust our platform to support the needs of your company.