Big Data Introduction


Big Data has become one of the buzzwords of the last three years or so with Big Banks, Telco’s, Science/Research/Medical, Advertising/Marketing organisations seeming to lead the field in talking up the adoption. So what is Big Data? How is it being used? What’s available to make sense of it? This article will give some sense of what Big Data is and what it can do and what you might need. I will add another article in the future to look at some more specifics around uses and implementation.

What is Big Data?
So, what is Big Data? This is a bit like saying what is cloud! Although we can make some general remarks, there are probably as many definitions of Big Data as there are of cloud. So lets pull together a few and come up with some consensus with my take influencing the outcome. Where did the first use of the term Big Data come from? It seems to be a paper written by NASA (who else) in 1997, talking about the amount of data they were collecting and the problems in storing this data. It is not until 2008 that the term Big Data or “big data” is starting to be used regularly in the press and other articles. In 2013 the Oxford English Dictionary (OED) included “big data” definition in its June quarterly update. There is an excellent article on the history of “big data” on the Forbes site by Gil Press (

Big Data

To me Big Data is a rather glib term for something that is changing our world. What is happening is the hugely increasing datafication of our world (personally and professionally) and, our increasing ability to analyse that data. The OED describes big data as “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.” This is pretty apt, along with this is Wikipedia’s definition “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.” For me Big Data is a set of characteristics that then define the term:

  • Large datasets typically beyond the ability of traditional database software to process effectively
  • Often but not always based on semi or unstructured data
  • Focus of use for analysis and prediction
  • Extraction of value from data (turning it into information)
  • In the enterprise, the change from processing internal data to mining external data
  • I note that the Internet of Things has replaced Big Data at the top of Gartner’s hype cycle and Big Data is sliding into the ’trough of disillusionment’. Data science is now on its way to the top of the hype cycle which represent a more mature approach.

    How is Big Data being used?
    Some industries have been using Big Data long before Big Data as a term existed. A branch of medicine – epidemiology, has been involved in looking at large amounts of data to assess trends and make medical decisions generally in preventative medicine. You can thank epidemiology for the following and others:

  • The incredible decline in child mortality over the last 150 years,
  • The widespread use of vaccine especially polio and measles that are provided by governments based on the effectiveness seen through epidemiology,
  • Public health improvements through sanitation and clean water,
  • Malaria prevention and controls,
  • Tuberculosis control.
  • So those are examples of Big Data improving the lives of people but where has Big Data been used by business to make decisions, improve bottom line and customer satisfaction, etc? Lets take some areas and have a quick look at them.

    Marketing and Advertising
    Google, Yahoo, Facebook etc process massive amounts of clickstream data to deliver targeted ads to consumers and decide, in real time, the best placement for their clients adverts. Since the decision uses both historic and real time information the amount of data to assess and respond on is enormous. This cannot be processed by a relational database and Google, for example, developed their own database – Big Table. This was subsequently commercialised as part of their cloud offering.

    Amazon are another organisation that makes extensive use of Big Data to send you offers that are personalised. They hold all the purchases you have made, the pages you have visited, how long you have visited, time between purchases etc, as well as comparing you to others who have similar profile to you and making the suggestions “people like you like…”.

    The major banks have made strides with marketing on their websites to make you personalised offers while browsing their website or internet banking site. This is based on the financial information they have on you, your purchases via credit cards and where you have been on their sites and where your have come from.

    Customer satisfaction and loyalty
    Retailers, especially those with well trafficked websites are linking your website visits, to your foot traffic in the store, making offers to you in real time including to your mobile when instore. While this might not be quite here yet it is likely coming.

    Sensors and predictive analysis
    A number of manufacturing companies like Otis and others like Union Pacific railroad are using sensors in their equipment generating large amounts of data. This data is analysed using Big Data techniques, From this failures are predicted and then corrected before they actually happen. Union Pacific reduced the number of derailments by 30% using the information and proactively tacking issues.

    Financial Markets
    Analysing trends and making decisions on derived information can have significant impact on trading profitability. Banks trading in financial markets are using market data along with external data (government statistics, weather, etc) in significant quantities to try to gain a small advantage which result in significant profit increases. Being first to a market opportunity can mean millions of dollars. Even intraday trading will be impacted if trades can be made seconds or milliseconds before others see the same opportunity. The intersection of big data and algorithmic trading is likely to show promise but risks as well.

    Probably less happy about this in some ways but governments are using the broad range of contact and data they have with you to check for things like welfare fraud, tax fraud. They combine large amounts of data across multiple agencies to profile citizens for tax, benefit entitlement etc. This happens regularly so that as circumstances change benefits can be checked and adjusted. The amount of data is quite sizeable across multiple data stores and uses significant data science to extract the right information.

    Non traditional uses
    It is understood that during an overseas crisis, a large cluster was set up to ingest all of Twitter. This was to identify the protagonists and other identities who were for and against the government of certain state. The analysis ran across many languages identifying keywords, sentiment, profile and other factors in semi real time. This enabled a clearer picture of what was happening to make decisions on support for particular groups.

    Big Data tools
    So now we know what Big Data is and what it is being used for, we now need to get a view on the tools and techniques of Big Data. One of the things about Big Data is that on its own it does not really do anything except store and organise data. Many of the tools are focussed on this aspect however, storing and indexing the data is one thing but actually making sense of the data is another. This is where I believe much of the Big Data talk falls down. Analysis and visualisation of the data is where ‘the rubber hits the road’ and business value is generated. This area is often neglected by the hype around Big Data but to me is probably the most interesting due to the need to understand what could be possible and then, how to visualise the desired result effectively.

    First let’s look at the Big Data tools and what they are. The majority of these are open source. Generally the most often mentioned tool is Hadoop.

    Hadoop grew out of a project at Google called Nutch which Google open sourced. Hadoop usually refers to two tools – Hadoop and MapReduce. Hadoop is essentially a large scale file system that allows data to be accessed wherever it is stored. It divides the data up into smaller pieces that can be stored and accessed more easily. MapReduce indexes/executes code on the data but does that by taking the code to the distributed files on nodes and executing on the node rather than brining all the data to a central location and processing it there. The MapReduce capability implements a programming model that can be used to create fault tolerant and distributed analysis applications that take advantage of the Hadoop file system. The programming model takes care of the transport, execution and return of the results. The display of the results etc requires the use of other tools, which can run as Hadoop jobs. There are a number of companies providing commercial implementations of Hadoop including Hortonworks, Cloudera and MapR.

    Apache Spark is similar to Hadoop (MapReduce) but performs operations in memory enabling much higher performance. Spark requires a distributed storage system such as Hadoop or Cassandra but also runs across Amazon S3. Spark has the ability to support SQL with some limitations but makes it easier for RDBMS users to transition.

    Splunk is used for analysing machine generated Big Data. It excels at processing data real time – capturing, indexing and correlating, and finally reporting/visualisation of the results. Spunk is often used to capture small changes in machine data that can lead to the exposition of a larger trend or event. Spunk supports its own ‘search processing’ language. Spunk is often used for machine log analysis for functions such as security, traffic analysis, operations support etc.

    Twitter now Apache Storm is another real time processing framework like Splunk and promotes itself as the realtime version of Hadoop. Twitter acquisition BackType originally developed the system; Twitter moved it to the open source world on the acquisition. Key users of Storm include Twitter, Spotify, Yahoo, Yelp and GroupOn. ‘Locally’, Telstra owned Ooyala (technology behind the Presto TV service) use the system for their analytics. Storm is written in Clojure and implements a directed graph computation. This takes input from a stream/queue, process it and emits another stream. Other processes can read from the stream if they declare they can and this can be repeated. So, this leads to an input queue, a set of process acting on the input data in a internally declared order and, an eventual output. Key to Storm is that the processing never ends unless it is killed (which is what you want in real time system). Storm integrates with many queuing technologies including Amazon Kinesis.

    Cassandra is a distributed database system with emphasis on performance and high availability. It was originally developed by Facebook and used in their messaging platform before Facebook Messenger. Facebook open sourced the software and it is now a top level project in the Apache Foundation. Cassandra supports MapReduce as well as a number of other frameworks and has its own Cassandra Query Language (CQL) which is similar to (but not the same as) SQL. It is a hybrid key-value and columnar (table) database. It has been a popular ’NoSQL’ database with a large number of high profile users including Instagram, Apple, Netflix, Facebook, Nutanix, Reddit and Twitter.

    Analysis tools – selected with a focus on open source

    While Splunk has analysis interface built in, there are large numbers of specific tools to analyse and present the information from Big Data. While isometrical analysis tools have moved to support the Big Data sources. The majority of software is open source and often comes from the scientific community.

    R is an open source programming language for statistical computing and graphing. Its commercial counterparts would be SAS, SPSS and Matlab. R was created at the University of Auckland based on a language called S (the names are so original!). To use R on Big Data one would first process the data with MapReduce or a stream processing framework and then act on the result set with an R program to deliver results in a meaningful way. R’s ability to visualise the data through its graphic capabilities is outstanding. Generally the software is used by end users rather than IT but both can work together. R has a number of IDE’s including its own RStudio and these integrate with enterprise code management tools. The open source nature has made a number of commercial applications (Matlab, SAS, Tableau, etc) provide interfaces to R or the ability to include R resources into their products

    Pentaho is open core software – having both an open source edition and a paid enterprise version with extra features and support. The software provides a layer to interface to a very wide range of Big Data and traditional data sources. Modules of the software above this layer provide the job construction including Hadoop (MapReduce) jobs and pipelining of these jobs and providing the output capabilities.

    The visualisation and analysis market is where I think the next set of battles will be fought. It is all very well to have the ability to process the data but making sense of it and presenting the information in a digestible way isa skill that is currently in short supply.

    The Last Word
    As for any technology article, one should turn to Dilbert for a final word on the subject!