Everybody is getting high on cloud computing. Clouds are really all over the place. There is only one catch: it keeps everybody's mind hazy. For instance Gordon Haff from c|net writes:
Software as a Service (SaaS), Hardware as a Service (HaaS), Data as a Service (DaaS), and Web 2.0 are all part of the cloud. Even hosting providers are a sort of specialized, narrow case.
Sure SaaS, HaaS (go Bears) , DaaS ... and I'm sure MyaaS is part of the cloud too.
Why not? And I'm sure WEB 2.0 is about having boxes with rounded corners which are easily drawn through an adequate abstraction layer.
Let's go to Wikipedia who does barely a better job:
The term derives from the fact that most technology architecture diagrams depict the Internet or IP availability by using a drawing of a cloud
By the same token one has to wonder why is not everybody calling DataBase technology Cylinder computing... Wikipedia goes on:
The architecture behind cloud computing is a massive network of "cloud servers" [...]
Excellent. Did you know of Bozo computing? This is a massive network of "bozo servers"!
Ok, let's clean it up. For the most part the confusion stems from 4 sources:
- Amazon: they popularize the term therefore everybody started to equate their CPU and Storage utility computing business models to cloud computing: Good marketing job.
- Marketing people: who have made fashionable and trivialized words like Utility, Grid, Mesh or even SaaS to re-launch over and over again this notion of on-demand computing so dear to Gartner: when an analyst prediction falls through you can always wait for the next hype cycle by inventing a new name.
- Virtualization: VMware discovered that manipulating runtime images (nicely reminiscent of LISP) would allow for dynamically transferring running applications from one server to another and as such would be an incredible asset for on-demand provisioning and incremental hyper-scalability. Hence the strong association with on-demand and utility computing.
- Supercomputing: not only because the CPU performances are compared (silly since most often the coupling between super computer nodes is of much finer grain than the loose coupling between the nodes of the cloud) but also because supercomputing nodes are often linked according to specific network topology using the same vocabulary grid, cube etc.
So what is really cloud computing? Let's go back to the defining sources, originally (short of a few previous obscure examples) they were:
- SETI @ home (and its modern avatar BOINC) that defined popular grid computing. Remember the times when SETI was tracking aliens from your computer screen saver?
- Google really defined (without exposing it at first) the genre and probably carved the word through its deployment of Redundant Array of Inexpensive Servers
- Amazon S3 whose CEO Jeff Bezos understood so well UPS business he decided to transpose it to Amazon IT infrastructure: it's not about selling books or distributing package, it's -in both cases- about leveraging an extraordinary logistic infrastructure.
Defining technologies and concepts:
- P2P services who pioneered key technologies such as highly scalable and redundant look up services and massively distributed hash tables to name a few. (1)
- Distributed AI that pioneered key concepts like Lisp map/reduce, actors and agents.
To be fair Google's and Amazon contribution in terms of technology is also substantial.
Like any new paradigm Cloud computing represents a shift. In this case, it is best described by the addition of a new layer we could call a Cloud Operating System.
At its core an operating system is really a task/process manager, a memory manager and an I/O manager. Similarly, a Cloud Operating System (COS) defines how are managed tasks/applications, how memory/storage is organized and the mechanisms by which massive information flow is handled. COS is a network operating system running atop of a cloud that is, an hyper network of computers.
- Massive distribution: often more than 100,000 nodes, maybe a million or more at Google.
- Hyper reliability: Tens of nodes can go up and down all the time without disturbing much the applications.(3)
Information flow management
- Semi-autonomy & near P2P coupling: since the overhead of having a completely centralized architecture over so many nodes would prevent any kind of scalability or reliability, cloud computing is heavily relying on neighboring algorithms where issues like discovery, monitoring, redundancy and hot swapping of tasks are managed/decided locally in a dynamic cluster of topological neighbors.
Memory and Storage management
- Distributed hash tables: In such an environment you cannot directly use a classical database systems. Data bases even when as "simple" as an efficient hash table (e.g. Berkeley DB) would not scale enough since the cost of duplication/synchronization would quickly become higher than the cost of storage/retrieval. (see Google BigTable, Amazon Dynamo)
Note that depending on the granularity of the system, a classical DB can be associated to a node or to a small cluster of nodes.
CPU and Load management
- Hardware transparency: There is no notion of hardware (or guaranty of permanence thereof) on the application side. Computing units are usually referred to as nodes.
- CPU distribution: An application-level-only distribution (often achieved through virtualization only), like that of Amazon EC2 or SUN network.com, is not equivalent to a Google cloud where applications themselves can be explicitly parallelized through special primitives like map/reduce.
Examples of Cloud Computing infrastructure
Example of Grid/utility computing
- Amazon EC2
- BOINC (SETI-like)
- SUN network.com
- Enomalism (EC2-like)
- update (06/05/08): Amazon releases Eucalyptus under a FreeBSD license
A computing cloud is a massively distributed network operating system allowing to build applications on an abstract layer implementing computing, storage and information flow management (key technology: Cloud Operating System). In a computer cloud the control is largely semi-autonomous and near-peer coupled.
A computing grid is a highly distributed network of computing resources providing applications with transparent duplication (key technology: Virtualization). In a computer grid the control is largely centralized.
- On-demand business models like SaaS can be implemented on a cloud, a grid or just the classical way.
- Companies can implement both grid and partial cloud (e.g. Amazon)
- Cloud computing can be further enhanced by using virtualization as well.
This article is cited or quoted by: Virtual Strategies
(1) See the relationship with P2P where nodes and super nodes are always going up and down as users are powering up and down their computers and an array of hundreds of thousands of computers where (by virtue of their sheer numbers) machines are always breaking down or coming up.
(2) One could also imagine Amazon S3 but distributed among clusters of end users. Would be cool to store redundantly those pictures of the dear little ones. One could sync pictures directly (from phones or from desktops) into a general or private (your friends/family) cloud. No more fear of disaster: a hard drive that dies that's often many drawers of photographies burning. The paying version could even overflow on Amazon S3 itself. Maybe Fotonauts will bring us that. Would be cool.
(3) Imagine a server which reliability is 99.98 %. Pretty high ... But if you want an hyper network of cheaper servers maybe you'll require a still optimistic 99.9% reliability only. If you have 600,000 such servers it means that at all time, you have 600 servers down (best case) 6,000 servers down (worst case w/ a 99.0% reliability only) And remember, if you want to replace them, you've got to find them!