Sunday, March 15, 2009

Security and Validation of data in the Cloud

The promise of Cloud Computing as a solution to both on-demand scalability issues and an opportunity to outsource a lot of the infrastructure headaches of grid computing is proving very attractive to IT departments. However, the nature of these service also emphasizes issues which are usually lower down the list of priorities for software deployed in-house, namely:
  1. Privacy and security of the data is outsourced along with the data itself.
  2. The integrity of the data is no longer under the organisation's control.
  3. Personnel administering the data are not part of the in-house team.
  4. Access to servers holding data may be compromised without the Cloud provider or originating organisation being aware of it immediately.
  5. The Cloud provider itself may have divergent drivers from the organisation (e.g attempting to cover up data loss incidents, inefficient failure reporting and server monitoring etc)
While there have been a flurry of suggestions around the use of encryption for data stored on the Cloud to address issue 1 (which is the most obvious problem), ranging from the roll your own variety to the Cloud provider automatically encrypting data, less prominence has been given to the other factors.

I suppose this is unsurprising as the Cloud vendors are unlikely to actively draw everyone's attention to issues inherent in their business model. Even so, for areas like Finance these problems are proving part of a stumbling block to use of the Cloud for important applications.

Essentially, the problem can be broken down into some basic questions:
  1. How do you minimise the possibility that problems in the Cloud infrastructure result in data loss or corruption.
  2. How do you ensure no one has tampered with the data on the server
  3. How do you ensure the Cloud provider is providing the level of data integrity and availability it says it is, and is not gaming the system to your disadvantage.
  4. How do you perform all these things with as low an overhead as possible in a distributed environment.
The cryptography community has recently started investigating these questions and the most promising area of research revolves around 'Proof Of Retrievability' schemes. A POR scheme is essentially a challenge response protocol that enables a Cloud provider to prove to a client that a file is retrievable without loss or corruption and enables a client to verify that its files are fully available, and that they have not been altered. The benefit of POR over simple retrieval of a or set of files is efficiency. The response can be compact (in the region of tens of bytes) and the verifier can use a fraction of the original file or files to complete the proof.

Such a scheme consists of a couple of technical components that are used to manipulate the blocks of data stored in the Cloud. Its essence is simply to split the data up in a way that the loss of a set of blocks means we can still recover all the original data and to sign each block to ensure it has not been tampered with. The mechanisms to achieve this are:

1) Erasure correction codes (such as Reed Solomon)
Erasure correction codes are used to tolerate data failure in many areas. We can use these to disperse the data across n =(m+k) servers from which the original data can be recovered from any subset m of the overall n servers. By placing each of the m data vectors + k parity vectors on a different server, the original data file can survive the failure of any k of the m+k servers without any data loss, with a space overhead of k/m.

2) Universal hash tokens (usually homomorphic)
A homomorphic hash is used as this allows signature hashes to be aggregated over many data blocks and thus enables us to support compact proofs over large file samples.

The combination of these approaches allows a client to issue regular challenge response queries against the Cloud servers for a random set of blocks across the full data set. The servers would then need to compute a signature over the selected blocks which are returned to the client. These are then used to validate the correctness and retrievability of the blocks. This process is fairly lightweight and can be performed orthogonally to the data functions freeing up the developer to concentrate on the data and allow the development of tools to monitor the data on the cloud in a reliable, independantly verifiably manner.

One of the first papers to discuss this in detail orginated at RSA labs and discusses a system called Hail (http://eprint.iacr.org/2008/489.pdf), which provides integrity verification for static data in the Cloud.

Later papers have presented improvements on this and a recent publication extends this idea to provide not only functions to cope with dynamic data, but also the ability to identify which servers in a cluster are misbehaving/compromised.

This later approach seems ideally positioned as the the basis of either an application embedded or independant monitoring tool that provides protection against byzantine failure, malicious data modification and server collusion attacks for the price of a computationally small overhead.

Such a system would also enable end users to reliably validate a Cloud provider's SLAs, ensure data is available on the tier it is supposed to be and add resilience against infrastructure issues (even to the point of pre-emptive identification of damaged nodes in the cluster).

Now all I need to do is start building one.