Friday 27 May 2016

Introduction

Apache Zeppelin is a web based notebook that enables interactive analysis of data. It is a very powerful tool that combines the analytics capabilities of many of the components of the Apache Big Data ecosystem and presents them in a notebook form that allows for beautiful visualizations of the results of the analysis performed using those components. It facilitates easy work-flow of the data science pathway which generally involves exploratory analysis of data, followed by model building, usage of algorithms, report generation and even real time collaboration. With Zeppelin, one can take data from on-line sources, perform analysis and generate report, all in a single notebook.

Zeppelin Welcome Page
Zeppelin was conceived at NFLabs, a South Korean big data analytics company, as an internal project which was later open sourced and subsequently accepted into incubation at the Apache incubator. Zeppelin has a well formed community of developers with numerous contributions. After almost one and a half year of incubation, Zeppelin has finally graduated from the incubator and has been accepted as a top level project at the Apache Software Foundation. To download Zeppelin one can get its binary package at this link. Or to build it from source, please clone the repository and build using maven.

The main platform of doing analysis in Zeppelin is a notebook.
Zeppelin Notebook
Zeppelin can be thought of as a keeper of notebooks; one can create, search, filter, delete, import and export notebooks. A notebook is easy to create, one just has to provide a name for it. After a notebook has been created one can perform analysis in the notebook, in what is called a paragraph. A notebook is said to comprise of  many paragraphs where each such paragraph may support a single back-end processing system such as Apache Spark. Paragraphs are flexible, they can be created, removed, re-ordered and changed at will. To get output, we run the code in a paragraph. This forms the basis of the data science life cycle work-flow where the initial paragraphs may be used to load the data and perform exploratory analysis and the later paragraphs may be used for the visualization of the results.

The code in the paragraphs are interpreted by the back-end systems that are currently available as interpreters for Zeppelin. Each notebook is said to have interpreter binding for certain interpreters. The interpreter binding can be changed for interpreters from 'interpreter binding' option displayed as a 'settings' icon in the top right corner of a notebook. Interpreters can also be easily created from the interpreter menu which has a create button to create new interpreter using the existing back-end systems. New interpreters can  be written by simply extending the Interpreter class in the source code. Many default interpreters are supported such as Spark, Hive, Cassandra and others from the Apache Big Data ecosystem along with interpreter for shell and markdown for writing marked text in the paragraphs. The most recent release of Zeppelin version 0.5.6 supports Spark up to version 1.6.0, future versions will support newer releases of Spark. As a notebook is created or loaded, three contexts namely SparkContext(sc), SQLContext(sqlContext) and ZeppelinContext(z) are automatically injected into the system. In a paragraph, they may be used by the name indicated inside parantheses, the user does not need to create them. There is an interpreter option displayed on the top, clicking on which displays the interpreter menu and the settings and configurations for the currently used interpreters. A 'Configuration' option also present on the top, displays various other configurations of Zeppelin such as the local repository. Zeppelin also supports runtime jar dependency loading from the local file system or maven repository using the %dep interpreter, although this should happen before spark interpreters are used.

Zeppelin has the following display formats: text, html, table and Angular Display system. By default all the output from the language back-end in a paragraph is displayed as text. With the %html directive, zeppelin treats the output as html as in: print("%html <h1> Your text here </h3>"). The %table directive, allows zeppelin to recognize input to be displayed as table, only the input itself should be in table format, that is, rows separated by new lines and column separated by tabs. The Angular Display system treats the output as a view template of Angular JS. The output statement should start with the %angular directive, for the angular display system to work. Note that the display system is back-end independent. One can also bind/unbind and watch/unwatch variables. Another feature is the ability to create dynamic forms. With Zeppelin, one can dynamically create forms using the back-end systems such as markdown, shell, and spark sql. ZeppelinContext provides form creation API to create forms programmatically.

Default visualization capabilities of Zeppelin include bar, line, pie charts, displaying as table and scatter plots. Once the results are obtained after processing from a back-end system the results may be converted into a table format to be displayed in any of the available visualizations. The visualizations themselves are interactive and allow the grouping of data by keys, performing operations such as max, min, sum, avg and count interactively and displaying the data by the size of the data points.

The default visualizations can be extended by another powerful collaborative feature of Zeppelin called Helium. Helium is a pluggable application tool that can be used to generate custom display and use flexible back-end code to drive output in Zeppelin front-end. A helium application is said to comprise of three components: the back-end application, the view part and the resources available for the application. To write a Helium application in Java, the user just needs to extend the Application class in the Helium package. In the application the user can define the output behavior based on the input from the front-end or even otherwise. This part forms the application component. Then there must be a view part in the form of html, css or javascript to decide what is to be displayed on the front-end. Finally, the currently running instance of Zeppelin will have some resources, based on the resources available in the resource pool, the application is run on the front-end.

Various notebook storage mechanisms are provided for storing the notebooks, the default being the notebooks folder inside the zeppelin installation/build directory. Others include versioning it using the local git repository and the amazon S3 service. Notebooks can be viewed directly by the Zeppelin Hub viewer by just specifying the public url of the notebook.

Finally after all the analysis has been done, the user can generate report of the work done, from the option displayed as 'default' and selecting 'report' option. Doing so presents the notebook in a report format that can be viewed by collaborators.

No comments:

Post a Comment