ZeppelinNotes: May 2016

Introduction

The World is a Zeppelin notebook that performs analysis on the World Bank data sets.

The World Notebook

The World Bank Organization provides all of the data that it collects from various countries and regions of the world for developers to perform analysis of the data and find interesting patterns in the data. The analysis as part of this notebook, first downloads the data that will be used and then loads the data into Zeppelin. Subsequent paragraphs perform the needed analysis and display the result and visualizations. The formatted text in the paragraphs is written using markdown. For a paragraph to run markdown, the first line of the paragraph should start with %md.

Choosing the Data Sets

The World Bank provides a large amount of data that can be used for performing analysis. Since this notebook intends to investigate few of the most important factors that affect the world, only few of the data sets have to be considered. In order to do this, the datasets that are chosen must be relevant for the analysis. Data is available for numerous topics such as arable land percentage, employment rates, literacy rates, etc. These are referred to as indicators. This data may be filtered for collection by country, time span, income levels, format(xml or json) and records per page. Three main subjects or domains have been taken up for analysis, namely, Growth, Population and Energy. To extract the indicators relevant to a particular subject from the entire raw list of indicators, we perform domain extraction which is to go over the list once and choose the indicators that may affect a subject, for example, population growth will be affected by birth rates, death rates and life expectancies. This is done iteratively until the entire list has been exhausted. At the end, the indicators that we get for a particular subject may be grouped into two sets: one that comprise the subject itself and give a picture of how the subject has been performing over the recent years(for example, GDP for growth); the other one is the set of indicators that affect a particular subject and caused the change that was visible in the first set of indicators. These sets of indicators are referred to as the component and the dependent indicators respectively. The initial paragraphs in each of the subject analysis first fetch the data set from the World bank site using the shell interpreter and the next paragraph loads the data as a DataFrame. To be able to use the shell interpreter the first line of the paragraph should be %sh. SqlContext that is available by default in Zeppelin loads the json files directly(after the first lines have been removed) and infers the schema of the records in the files, so that the various fields of a record can be accessed by using the dataframe object of that dataset. This simplifies the loading of the datasets and getting the needed data from them. After the analysis for a particular subject has been done, the tables are dropped using the 'dropTempTable' function of sqlContext.

Analysis using Apache Spark

1. Growth in the World

The first subject that is considered is the growth that has taken place in the world.

World Growth

Nothing better than the Gross Domestic Product(GDP) is indicative of growth of a region. Five kinds of GDP data are available on the World Bank site: Growth GDP, GDP per Unit of Energy Use, GDP per Capita, GDP per Person Employed and GDP at Market Prices. It would be good to first have a look at how the various GDP(s) have been varying over the recent years. To use the default visualization capabilities of Zeppelin, we register each of the dataframes as a temporary table by using the 'registerTempTable' function of dataframe object. This allows the data set to accessed as a table in the next paragraph using sql. The first line of this paragraph should be %sql for sql queries to be run in the paragraph. All of the GDP(s) present a positive picture of the world, all have been growing except for a few slumps while the growth GDP has remained fairly constant. Euro region has the lowest Growth GDP overall because of the small population of the region and so it is unable to generate as much GDP as the Asian and the Pacific region which has the largest population. On the other hand, the Euro region has the greatest GDP per Capita because the Growth GDP gets divided by the small population. In contrast, the Asian and Pacific regions have the smallest GDP per Captita because the population is so large.

After going through a research paper on the Internet published by the Symbiosis Institute of Management, Pune which highlighted the various factors that may affect the GDP of a region, the following dependent indicators were chosen for analysis: Exports, Foreign Direct Investment, Employment to Population ratio and CPIA(Country Policy and Institutional Assessment) values for economic management cluster average. Exports showed a slight positive correlation between the data while the FDI values showed a negative correlation which means that the Growth GDP of a region is inversely related to the FDI values. Employment to Population Ratio showed a very high correlation with the GDP per Person Employed values and the regions of the world tend to form their own clusters when a plot of GDP per Person Employed and Employment to Population ratio was plotted. Finally, the CPIA values also showed a positive correlation with the Growth GDP values, indicating that these values are also important in determining the GDP of a region. The graphs were plotted using sql in the paragraphs, in the scatter plots the size of the various plots can changed by dropping one of the variables in the size box available in the settings option for the scatter plots and groups can be generated interactively by dropping one of the variables available as keys in the groups box.

2. World Population

The next subject to be examined is the population growth in the world.


World Population

The component indicators chosen for population are the Total Population, Population Growth and the Rural and Urban Populations. The total population is growing, except for slight slump for the year 2003. The Urban population is also growing while the Rural Population is decreasing partly because of less availability of resources and mostly because in the recent years, the rural populations have been migrating to the cities. This presents a progressive picture of the world since more urbanization is taking place. The urban and rural populations also show the population slump in the year 2003. The population growth graph separately grouped into regions of the world shows the Arab World growing at the fastest rate while the population growth rate for all other regions has been declining after having stayed constant for all of the recent years. The Euro region ranks the last in the growth graph indicating that the growth has been low, in fact the population of Europe has been declining sharply and had it not been for international immigration, the poor population growth rate would be a matter of concern.

For the analysis of the dependent indicators affecting this subject, we first find the correlation between the health expenditure values and the urban population values to find out whether the health expenditure positively affects the population in cities and as expected, the Karl Pearson's correlation computed using mlllib package in Spark gives a high positive value of 0.8644. So the growth in the urban population can be explained by the health expenditures as these expenses indicate the availability of life saving health services. Surprisingly, the Arab World is not among the regions that have their health expenses above mean for the recent years and it is still able to maintain a stable and healthy population growth rate. This may indicate that the region is not largely affected by diseases or other health related illnesses. The Europe region on the other hand has the health expenses above mean for all of the recent years indicating that the region relies on medical treatments to keep its population stable. Even the death rates as plotted in the subsequent paragraphs, reveal that Arab World has slightly declining death rates after remaining fairly constant for the initial years. The other regions are much above the Arab region in this graph with Europe distinctly at the top followed by the Asian and the Pacific countries. The birth rates for the regions are in complete agreement with the analysis done in the previous paragraphs and depict the Arab World as having the highest birth rates, followed by the Asian and the Pacific regions and lastly Europe; to investigate the cause of these differing birth rates that play an important role in deciding the population growth rate of a region, we calculate the Pearson's coefficient of correlation between the birth rates and the adolescent fertility rate values of the regions. A correlation value of 0.7419 is revealed, which indicates that the birth rates are to quite some extent dependent on the fertility values, as is also revealed in the graph that follows. The various regions tend to form their own groups; Europe that has the least fertility rate values also has the least birth rates and this may be one of the strongest causes of the slow population growth in the region. The Arab World that has relatively high fertility rate values also has high birth rate values, followed by the Asian and the Pacific region countries. The next graph presents the increase rate values calculated as (birth rate - death rate). This graph coincides exactly with the population growth graph, confirming the results of the previous paragraphs. Lastly, we examine the Life expectancy values of the various regions and try to correlate them with the availability of improved sanitization facilities in the different regions of the world. Here, Europe leads the other regions of the World, as it has the greatest values of both the improved sanitization facilities and the greatest life expectancy values which is indicative of the relatively healthier surroundings and consequent good expectancy values. As opposed to this, the Arab World has the least share of living surroundings pie chart and also has the bottom most position in the life expectancies plot. The Asian and Pacific countries lie midway in both the chart as well as the scatter plot.

3. Energy

The last subject to be examined in the notebook is the energy consumption in the world.

Energy Consumption

Here also we begin with the component indicators analysis and then move on to the dependent indicators. We first take a look on the net energy use patterns for all of the recent years and find that the energy used per capita peaked in the year 2006, took a slump in the subsequent years and has remained fairly constant since then(although greater than what it had previously been). Other component indicators that have been taken up for this subject include: fossil fuel(non-renewable), renewable, and alternative and nuclear energy usage in the recent years, for the various regions of the world. From the graphs drawn for all of these indicators it has been found that the regions may be classified on the basis of their usage of the various forms of energy. The Arab World with its vast reserves of oil and natural gas leads the other regions in the amount of fossil fuel consumption, the Asian and Pacific regions have the greatest amount of renewable resource consumption while Europe relies to a very great extent on the usage of alternative and nuclear energy. This is followed by a bar chart showing the disparities in the usages of the three forms of energies.The chart shows that the usage of fossil fuels is largely preferred over the other forms of energy, followed by the renewables and the least preferred method is the nuclear energy.

The dependent indicator analysis for this subject begins with the electric power consumption(in kWh per Capita) which is largely the main factor responsible for the energy demands of the various regions of the world. The maximum electricity consumed is for Europe which is much greater than all of the other regions. The Asian and Pacific countries initially had lowest consumption values, but the region overtook the Arab World region in the year 2009 to become a greater electricity consumer. One of the factors that may affect electricity consumption patterns is industrial employment. More the number of industries, more will be the demand for electricity. To measure the impact of industries, we find the regions that have above average industrial employment. The Arab region that appeared only once in the list of above mean industrial employment, was also the least consumer of electricity while the Europe region with the highest(much above average) values of industrial employment was the greatest consumer of electricity. The Asian and Pacific countries showed above average industrial employment for the years 2003 - 2007 and as is evident in the electric power consumption graph, the region also showed the fastest growth in the power consumption for those years. The next factor to be examined is the Total Natural Resource Rent for the various regions, the Arab region has the greatest share of the pie charts for all of the years with a growing share and the Euro region has an unseen small chunk of the pie. This is well explained because the Arab region has surplus of resources due to lesser industrial employment while the other regions have more of industrial activity

. Next, we measure the dependence of the energy use(per Capita) on the Gross National Income of a region and find a high correlation between the values(correlation coefficient of 0.664) which is also depicted in the plot between the two quantities. This indicates that the energy consumption of a region is facilitated by the incomes; more the income, more will be the capacity to spend on energy demands resulting in increased consumption. Lastly, we try to find the impact of Energy intensities on the consumption patterns. On plotting a scatter plot of the Total Natural Resource Rent vs. the energy intensity values, we find that the quantities are slightly positively correlated which can be explained as follows: low energy intensity means inefficient processes that utilize the energy so the actual demand remaining constant, the energy that is available is unable to meet the demands resulting in an apparent increase in demand and consequent low natural resource rent, where as, if the energy intensity values are high, it means the energy utilization processes are efficient and the demand is easily fulfilled by the currently available energy resulting in a high natural resource rent. Again, the Arab World is an exception because the region has low industrial employment(less demand) and huge surplus of natural resources(oil and natural gas) and so even after having a low energy intensity value, it shows a high value of natural resource rent.

Conclusion

This sums up the analysis on the various subjects of the world. We were able to see the interdependence of the various factors and how they affect the subject under consideration.

Apache Zeppelin

Apache Zeppelin reduces a lot of the time involved in doing the analysis(generating graphs and plots) which would otherwise have to be done manually(maybe by using other libraries). We get to see a lot of insights about the data rather quickly by the custom visualizations. Further, there is no limit to the extent of visualizing the results and performing exploration by the use of Helium package and html displays. The usage of Apache Spark in different paragraphs enables us to use the variables and the results declared in one paragraph, in all of the paragraphs that follow it. Spark and Sql can be used hand in hand to get the graphical results right after an analysis has been done and we want to visualize it. Finally, the results can be explained with formatted text using markdown.

Apache Zeppelin is a web based notebook that enables interactive analysis of data. It is a very powerful tool that combines the analytics capabilities of many of the components of the Apache Big Data ecosystem and presents them in a notebook form that allows for beautiful visualizations of the results of the analysis performed using those components. It facilitates easy work-flow of the data science pathway which generally involves exploratory analysis of data, followed by model building, usage of algorithms, report generation and even real time collaboration. With Zeppelin, one can take data from on-line sources, perform analysis and generate report, all in a single notebook.


Zeppelin Welcome Page

Zeppelin was conceived at NFLabs, a South Korean big data analytics company, as an internal project which was later open sourced and subsequently accepted into incubation at the Apache incubator. Zeppelin has a well formed community of developers with numerous contributions. After almost one and a half year of incubation, Zeppelin has finally graduated from the incubator and has been accepted as a top level project at the Apache Software Foundation. To download Zeppelin one can get its binary package at this link. Or to build it from source, please clone the repository and build using maven.

The main platform of doing analysis in Zeppelin is a notebook.

Zeppelin Notebook

Zeppelin can be thought of as a keeper of notebooks; one can create, search, filter, delete, import and export notebooks. A notebook is easy to create, one just has to provide a name for it. After a notebook has been created one can perform analysis in the notebook, in what is called a paragraph. A notebook is said to comprise of many paragraphs where each such paragraph may support a single back-end processing system such as Apache Spark. Paragraphs are flexible, they can be created, removed, re-ordered and changed at will. To get output, we run the code in a paragraph. This forms the basis of the data science life cycle work-flow where the initial paragraphs may be used to load the data and perform exploratory analysis and the later paragraphs may be used for the visualization of the results.

The code in the paragraphs are interpreted by the back-end systems that are currently available as interpreters for Zeppelin. Each notebook is said to have interpreter binding for certain interpreters. The interpreter binding can be changed for interpreters from 'interpreter binding' option displayed as a 'settings' icon in the top right corner of a notebook. Interpreters can also be easily created from the interpreter menu which has a create button to create new interpreter using the existing back-end systems. New interpreters can be written by simply extending the Interpreter class in the source code. Many default interpreters are supported such as Spark, Hive, Cassandra and others from the Apache Big Data ecosystem along with interpreter for shell and markdown for writing marked text in the paragraphs. The most recent release of Zeppelin version 0.5.6 supports Spark up to version 1.6.0, future versions will support newer releases of Spark. As a notebook is created or loaded, three contexts namely SparkContext(sc), SQLContext(sqlContext) and ZeppelinContext(z) are automatically injected into the system. In a paragraph, they may be used by the name indicated inside parantheses, the user does not need to create them. There is an interpreter option displayed on the top, clicking on which displays the interpreter menu and the settings and configurations for the currently used interpreters. A 'Configuration' option also present on the top, displays various other configurations of Zeppelin such as the local repository. Zeppelin also supports runtime jar dependency loading from the local file system or maven repository using the %dep interpreter, although this should happen before spark interpreters are used.

Zeppelin has the following display formats: text, html, table and Angular Display system. By default all the output from the language back-end in a paragraph is displayed as text. With the %html directive, zeppelin treats the output as html as in: print("%html <h1> Your text here </h3>"). The %table directive, allows zeppelin to recognize input to be displayed as table, only the input itself should be in table format, that is, rows separated by new lines and column separated by tabs. The Angular Display system treats the output as a view template of Angular JS. The output statement should start with the %angular directive, for the angular display system to work. Note that the display system is back-end independent. One can also bind/unbind and watch/unwatch variables. Another feature is the ability to create dynamic forms. With Zeppelin, one can dynamically create forms using the back-end systems such as markdown, shell, and spark sql. ZeppelinContext provides form creation API to create forms programmatically.

Default visualization capabilities of Zeppelin include bar, line, pie charts, displaying as table and scatter plots. Once the results are obtained after processing from a back-end system the results may be converted into a table format to be displayed in any of the available visualizations. The visualizations themselves are interactive and allow the grouping of data by keys, performing operations such as max, min, sum, avg and count interactively and displaying the data by the size of the data points.

The default visualizations can be extended by another powerful collaborative feature of Zeppelin called Helium. Helium is a pluggable application tool that can be used to generate custom display and use flexible back-end code to drive output in Zeppelin front-end. A helium application is said to comprise of three components: the back-end application, the view part and the resources available for the application. To write a Helium application in Java, the user just needs to extend the Application class in the Helium package. In the application the user can define the output behavior based on the input from the front-end or even otherwise. This part forms the application component. Then there must be a view part in the form of html, css or javascript to decide what is to be displayed on the front-end. Finally, the currently running instance of Zeppelin will have some resources, based on the resources available in the resource pool, the application is run on the front-end.

Various notebook storage mechanisms are provided for storing the notebooks, the default being the notebooks folder inside the zeppelin installation/build directory. Others include versioning it using the local git repository and the amazon S3 service. Notebooks can be viewed directly by the Zeppelin Hub viewer by just specifying the public url of the notebook.

Finally after all the analysis has been done, the user can generate report of the work done, from the option displayed as 'default' and selecting 'report' option. Doing so presents the notebook in a report format that can be viewed by collaborators.

Friday, 27 May 2016

The World

Introduction

Choosing the Data Sets

Analysis using Apache Spark

1. Growth in the World

2. World Population

3. Energy

Conclusion

Introduction