Clouds, Big Data, Dark Data and the next big thing


In a recent blog post we wrote about Gartner’s definition of “dark data”. The definition includes that dark data is generally unused, often kept for compliance reasons and ultimately costs more in keeping than it is worth. A typical example mentioned is the sprawl of PST files in corporate environments and on people’s laptops.

I can confirm from own experience that even in corporations that discouraged (or I could say prohibited) the use of PST files for email archival, this did not stop people from creating them – as it is just more convenient to have your old emails accessible at any time than having to rely on your company’s email archive system – which is only accessible when online and connected to your company’s VPN.

This behaviour is not that different from the use of cloud providers like Dropbox for file sharing, which is also discouraged by most companies, but used by many employees. The more tech-savvy employees are, the bigger the risk that your company’s IT policies will be ignored, circumvented, or adapted.

Coming back to dark data in the form of PST files – these become an issue, if people decide to store copies on your network shares, which suddenly results in them being automatically backed up; they take up space in your infrastructure and suddenly you end up with unknown data that may be queried for compliance. Instead of being able to manage the scheduled deletion of old emails, suddenly they are not only back, but also hidden away in a PST container. In summary you end up with no value at added cost!

Ultimately I see this as the main risk of dark data: if it adds no value and is not required for compliance (as compliance should be handled as part of an internal policy), then it should be deleted to avoid unnecessary costs.

There are two possible approaches to the handling of dark data:

As part of a Big Data project, which analyses the existing data. We can also just continue to call this part of Business Intelligence/Analytics. The advantage of this approach is that it actually should generate additional value; on the downside it requires significant effort.

Perform a Data Profiling exercise, which will provide useful information about your data like its age (how many files have not been touched for years?), their ownership (how much data is owned by people no longer with the company?) and the content (what is the percentage of PST, zip, audio and video files?). The emphasis of this approach is to reduce the costs for maintaining unrequired – or “dark” – data.

Ultimately dark data is just a subset of big data and I am not convinced we needed a label for it. However the last few years have seen a number of new buzzwords in IT, some of which have actually taken of (“cloud”). No doubt there will be a new flavour of the year soon – now if I only could come up with something catchy for all things “software-defined”…