Friday 27 May 2016

The World

Introduction


The World is a Zeppelin notebook that performs analysis on the World Bank data sets. 
The World Notebook

The World Bank Organization provides all of the data that it collects from various countries and regions of the world for developers to perform analysis of the data and find interesting patterns in the data. The analysis as part of this notebook, first downloads the data that will be used and then loads the data into Zeppelin. Subsequent paragraphs perform the needed analysis and display the result and visualizations. The formatted text in the paragraphs is written using markdown. For a paragraph to run markdown, the first line of the paragraph should start with %md.

 

Choosing the Data Sets


The World Bank provides a large amount of data that can be used for performing analysis. Since this notebook intends to investigate few of the most important factors that affect the world, only few of the data sets have to be considered. In order to do this, the datasets that are chosen must be relevant for the analysis. Data is available for numerous topics such as arable land percentage, employment rates, literacy rates, etc. These are referred to as indicators. This data may be filtered for collection by country, time span, income levels, format(xml or json) and records per page. Three main subjects or domains have been taken up for analysis, namely, Growth, Population and Energy. To extract the indicators relevant to a particular subject from the entire raw list of indicators, we perform domain extraction which is to go over the list once and choose the indicators that may affect a subject, for example, population growth will be affected by birth rates, death rates and life expectancies. This is done iteratively until the entire list has been exhausted. At the end, the indicators that we get for a particular subject may be grouped into two sets: one that comprise the subject itself and give a picture of how the subject has been performing over the recent years(for example, GDP for growth); the other one is the set of indicators that affect a particular subject and caused the change that was visible in the first set of indicators. These sets of indicators are referred to as the component and the dependent indicators respectively. The initial paragraphs in each of the subject analysis first fetch the data set from the World bank site using the shell interpreter and the next paragraph loads the data as a DataFrame. To be able to use the shell interpreter the first line of the paragraph should be %sh. SqlContext that is available by default in Zeppelin loads the json files directly(after the first lines have been removed) and infers the schema of the records in the files, so that the various fields of a record can be accessed by using the dataframe object of that dataset. This simplifies the loading of the datasets and getting the needed data from them. After the analysis for a particular subject has been done, the tables are dropped using the 'dropTempTable' function of sqlContext.

 

Analysis using Apache Spark

 

1.  Growth in the World


The first subject that is considered is the growth that has taken place in the world. 
World Growth

Nothing better than the Gross Domestic Product(GDP) is indicative of growth of a region. Five kinds of GDP data are available on the World Bank site: Growth GDP, GDP per Unit of Energy Use, GDP per Capita, GDP per Person Employed and GDP at Market Prices. It would be good to first have a look at how the various GDP(s) have been varying over the recent years. To use the default visualization capabilities of Zeppelin, we register each of the dataframes as a temporary table by using the 'registerTempTable' function of dataframe object. This allows the data set to accessed as a table in the next paragraph using sql. The first line of this paragraph should be %sql for sql queries to be run in the paragraph. All of the GDP(s) present a positive picture of the world, all have been growing except for a few slumps while the growth GDP has remained fairly constant. Euro region has the lowest Growth GDP overall because of the small population of the region and so it is unable to generate as much GDP as the Asian and the Pacific region which has the largest population. On the other hand, the Euro region has the greatest GDP per Capita because the Growth GDP gets divided by the small population. In contrast, the Asian and Pacific regions have the smallest GDP per Captita because the population is so large. 

After going through a research paper on the Internet published by the Symbiosis Institute of Management, Pune which highlighted the various factors that may affect the GDP of a region, the following dependent indicators were chosen for analysis: Exports, Foreign Direct Investment, Employment to Population ratio and CPIA(Country Policy and Institutional Assessment) values for economic management cluster average. Exports showed a slight positive correlation between the data while the FDI values showed a negative correlation which means that the Growth GDP of a region is inversely related to the FDI values. Employment to Population Ratio showed a very high correlation with the GDP per Person Employed  values and the regions of the world tend to form their own clusters when a plot of GDP per Person Employed and Employment to Population ratio was plotted. Finally, the CPIA values also showed a positive correlation with the Growth GDP values, indicating that these values are also important in determining the GDP of a region. The graphs were plotted using sql in the paragraphs, in the scatter plots the size of the various plots can changed by dropping one of the variables in the size box available in the settings option for the scatter plots and groups can be generated interactively by dropping one of the variables available as keys in the groups box.

 

2. World Population

 

The next subject to be examined is the population growth in the world. 
World Population

The component indicators chosen for population are the Total Population, Population Growth and the Rural and Urban Populations. The total population is growing, except for slight slump for the year 2003. The Urban population is also growing while the Rural Population is decreasing partly because of less availability of resources and mostly because in the recent years, the rural populations have been migrating to the cities. This presents a progressive picture of the world since more urbanization is taking place. The urban and rural populations also show the population slump in the year 2003. The population growth graph separately grouped into regions of the world shows the Arab World growing at the fastest rate while the population growth rate for all other regions has been declining after having stayed constant for all of the recent years. The Euro region ranks the last in the growth graph indicating that the growth has been low, in fact the population of Europe has been declining sharply and had it not been for international immigration, the poor population growth rate would be a matter of concern.

For the analysis of the dependent indicators affecting this subject, we first find the correlation between the health expenditure values and the urban population values to find out whether the health expenditure positively affects the population in cities and as expected, the Karl Pearson's correlation computed using mlllib package in Spark gives a high positive value of 0.8644. So the growth in the urban population can be explained by the health expenditures as these expenses indicate the availability of life saving health services. Surprisingly, the Arab World is not among the regions that have their health expenses above mean for the recent years and it is still able to maintain a stable and healthy population growth rate. This may indicate that the region is not largely affected by diseases or other health related illnesses. The Europe region on the other hand has the health expenses above mean for all of the recent years indicating that the region relies on medical treatments to keep its population stable. Even the death rates as plotted in the subsequent paragraphs, reveal that Arab World has slightly declining death rates after remaining fairly constant for the initial years. The other regions are much above the Arab region in this graph with Europe distinctly at the top followed by the Asian and the Pacific countries. The birth rates for the regions are in complete agreement with the analysis done in the previous paragraphs and depict the Arab World as having the highest birth rates, followed by the Asian and the Pacific regions and lastly Europe; to investigate the cause of these differing birth rates that play an important role in deciding the population growth rate of a region, we calculate the Pearson's coefficient of correlation between the birth rates and the adolescent fertility rate values of the regions. A correlation value of 0.7419 is revealed, which indicates that the birth rates are to quite some extent dependent on the fertility values, as is also revealed in the graph that follows. The various regions tend to form their own groups;  Europe that has the least fertility rate values also has the least birth rates and this may be one of the strongest causes of the slow population growth in the region. The Arab World that has relatively high fertility rate values also has high birth rate values, followed by the Asian and the Pacific region countries. The next graph presents the increase rate values calculated as (birth rate - death rate). This graph coincides exactly with the population growth graph, confirming the results of the previous paragraphs. Lastly, we examine the Life expectancy values of the various regions and try to correlate them with the availability of improved sanitization facilities in the different regions of the world. Here, Europe leads the other regions of the World, as it has the greatest values of both the improved sanitization facilities and the greatest life expectancy values which is indicative of the relatively healthier surroundings and consequent good expectancy values. As opposed to this, the Arab World has the least share of living surroundings pie chart and also has the bottom most position in the life expectancies plot. The Asian and Pacific countries lie midway in both the chart as well as the scatter plot.


3. Energy

 

The last subject to be examined in the notebook is the energy consumption in the world.
Energy Consumption
Here also we begin with the
component indicators analysis and then move on to the dependent indicators. We first take a look on the net energy use patterns for all of the recent years and find that the energy used per capita peaked in the year 2006, took a slump in the subsequent years and has remained fairly constant since then(although greater than what it had previously been). Other component indicators that have been taken up for this subject include: fossil fuel(non-renewable), renewable, and alternative and nuclear energy usage in the recent years, for the various regions of the world. From the graphs drawn for all of these indicators it has been found that the regions may be classified on the basis of their usage of the various forms of energy. The Arab World with its vast reserves of oil and natural gas leads the other regions in the amount of fossil fuel consumption, the Asian and Pacific regions have the greatest amount of renewable resource consumption while Europe relies to a very great extent on the usage of alternative and nuclear energy. This is followed by a bar chart showing the disparities in the usages of the three forms of energies.The chart shows that the usage of fossil fuels is largely preferred over the other forms of energy, followed by the renewables and the least preferred method is the nuclear energy.


The dependent indicator analysis for this subject begins with the electric power consumption(in kWh per Capita) which is largely the main factor responsible for the energy demands of the various regions of the world. The maximum electricity consumed is for Europe which is much greater than all of the other regions. The Asian and Pacific countries initially had lowest consumption values, but the region overtook the Arab World region in the year 2009 to become a greater electricity consumer. One of the factors that may affect electricity consumption patterns is industrial employment. More the number of industries, more will be the demand for electricity. To measure the impact of industries, we find the regions that have above average industrial employment. The Arab region that appeared only once in the list of above mean industrial employment, was also the least consumer of electricity while the Europe region with the highest(much above average) values of industrial employment was the greatest consumer of electricity. The Asian and Pacific countries showed above average industrial employment for the years 2003 - 2007 and as is evident in the electric power consumption graph, the region also showed the fastest growth in the power consumption for those years. The next factor to be examined is the Total Natural Resource Rent for the various regions, the Arab region has the greatest share of the pie charts for all of the years with a growing share and the Euro region has an unseen small chunk of the pie. This is well explained because the Arab region has surplus of resources due to lesser industrial employment while the other regions have more of industrial activity
. Next, we measure the dependence of the energy use(per Capita) on the Gross National Income of  a region and find a high correlation between the values(correlation coefficient of 0.664) which is also depicted in the plot between the two quantities. This indicates that the energy consumption of a region is facilitated by the incomes; more the income, more will be the capacity to spend on energy demands resulting in increased consumption. Lastly, we try to find the impact of Energy intensities on the consumption patterns. On plotting a scatter plot of the Total Natural Resource Rent vs. the energy intensity values, we find that the quantities are slightly positively correlated which can be explained as follows: low energy intensity means inefficient processes that utilize the energy so the actual demand remaining constant, the energy that is available is unable to meet the demands resulting in an apparent increase in demand and consequent low natural resource rent, where as, if the energy intensity values are high, it means the energy utilization processes are efficient and the demand is easily fulfilled by the currently available energy resulting in a high natural resource rent. Again, the Arab World is an exception because the region has low industrial employment(less demand) and huge surplus of natural resources(oil and natural gas) and so even after having a low energy intensity value, it shows a high value of natural resource rent.


Conclusion


This sums up the analysis on the various subjects of the world. We were able to see the interdependence of the various factors and how they affect the subject under consideration.
Apache Zeppelin
Apache Zeppelin
reduces a lot of the time involved in doing the analysis(generating graphs and plots) which would otherwise have to be done manually(maybe by using other  libraries). We get to see a lot of insights about the data rather quickly by the custom visualizations. Further, there is no limit to the extent of visualizing the results and performing exploration by the use of Helium package and html displays. The usage of Apache Spark in different paragraphs enables us to use the variables and the results declared in one paragraph, in all of the paragraphs that follow it. Spark and Sql can be used hand in hand to get the graphical results right after an analysis has been done and we want to visualize it. Finally, the results can be explained with formatted text using markdown.

No comments:

Post a Comment