Real-Time Streaming on Hadoop

Real-time streaming is gathering a lot of attention. It is highly desirable especially for industries like retail, banking, logistics etc. But implementing it for big data is challenging and despite lots of discussions on the subject, it still has an air of mysteriousness. Having worked on Hadoop for many years, I’d like to share my thoughts and experience on the different implementations of real-time streaming.

The Hadoop Distributed File System (HDFS) is a system that stores files by spreading them relatively evenly across your hadoop cluster. The key feature of Hadoop version 2.x is YARN, which manages the allocation and deallocation of things like memory, cpu-time, disk-space, device and network connections. YARN essentially co-ordinates their usage between processes and jobs that are open and running on the cluster. In Hadoop version 1.x, all jobs had to be done in "batch-mode" (data-related operations were grouped together in order to perform them in more coarse-grained chunks). The development of YARN enabled "streaming" applications (incoming data is continuously streamed) to run on Hadoop.

Hadoop 2.x was released in the Summer of 2013. Apache Flume was among the first few applications to take full advantage of YARN’s “streaming” capabilities by taking log-files (as they were being written onto HDFS) and allowing for them to be analysed. However, Storm, another early utiliser of Hadoop 2.x and written by Nathan Marz at Twitter, was a generic real-time streaming framework. This differed from Apache Flume which was simply written to process log-files. This meant that Storm could stream applications by plugging together ‘spouts, edges, bolts and sinks’ (streams of data could come in from several spouts, be plugged together and outputted to multiple sinks). This system is proven to work efficiently and fluidly and is used by many large corporations like Twitter.

Storm, however, is simply a programming framework. It was written in the Clojure language with only a Java API later added. There was no 'visual interface' or control buttons and to create a streaming application with Storm, it was obligatory to write Clojure or Java in code. Now, whilst this may be like riding a bike for a developer at Java or Clojure, for everyone else it’s like flying a plane.

In the meantime, Kafka, another similar system, has been open-sourced by LinkedIn as well as the more recent Spark whose developers have come along and started building their own “Spark Streaming” library.

You will witness arguments over which of these three (excluding Apache Flume) is the best as all have their pros and their cons. Storm is proven technology but not easy to use whereas Kafka is easier to use but rumoured to be slower than Storm. SparkStreaming is promising but, is a comparatively new project and so is considered developmentally behind Storm and Kafka.

An important con to remember regarding all three of them however is the lack of visual interface. Businessmen and women would indefinitely welcome a dashboard-type interface that updates itself in real-time as data is continuously incoming, shows possibly a few numbers and figures and potentially has a few attractive looking on/off buttons to mess around with. In comparison, developers would appreciate a ‘drag and drop’-type interface where code can be written and moved into a specific stage of a pipeline or directed acyclic graph (DAG) of tasks. Commercial vendors are aware of these wishes and are currently developing such interface.

The underlying technologies for Hadoop Streaming are clearly all there but, vendors are trying to make it easier, or more visual, to develop and manage streaming applications. This is an exciting space to watch.

Share on: Twitter, Facebook, LinkedIn or Google+