Thursday, 18 August 2016

SnapBook

Introduction

 

The notebook 'SnapBook' performs analysis on the SNAP(Stanford Network Analysis Project) datasets. These datasets are made publicly available by the Stanford University's
SnapBook
SNAP website and can be freely used for analysis. They are mainly graph datasets and generally contain the files for edges of graphs and the nodes in the graphs. Few other datasets contain the files for the social network circles for the current user. The analysis in the notebook emphasizes more on the social network data mainly due to the more scope of analysis with the data. As in the other notebooks, the first few paragraphs download the datasets and load them for analysis, thereafter analysis is performed using Apache Spark and the visualization capabilities of Apache Zeppelin along with D3 libraries.



Getting the Datasets

 

The Snap website lists datasets that contain graph data for various other websites such as Facebook, Google Plus, Twitter and graph data for web pages between Berkeley and Stanford. All the datasets are nearly of the same type although they belong to different domain of information. For this notebook, the social network datasets were specifically chosen. The datasets from the site contain the following files : the edges file, the features for each of the users, the names of the features and the circles for the users. These files and their data have been used in conjunction to extract information and perform analysis.


 

Analysis using Apache Spark

 

The Social Network

 

Social Network datasets have gained popularity in the recent years due to the scope of research that has opened up as these datasets have grown in magnitude. More articles and research papers are now being published as a consequence of the availability of the data. In
The Social Network
the notebook, we focus primarily on analyzing the circles that users of the social network tend to form amongst themselves as well as the communities that are present in the network. One of the papers that was referenced in doing the analysis in the notebook, described the terms that could be used to interpret the datasets. The user currently being analyzed is referred to as the 'ego' user. The 'ego' user has a network that is represented by the edges file for that 'ego' user and the members in the circles for the 'ego' user are represented by the circles file. The feature names for the users are in a separate file. The features for the 'ego' user would best describe his interests. Apart form that, the circles that have the largest number of friends for this 'ego' user would also have some common features because of which the circle is more prominent than others. So we also analyze the circles that have the largest number of users and the second largest circle - both of these have users with features that were also present in the feature list of the 'ego' user. Further, a lot of users appear in multiple circles for this user. On analyzing the features of even those users, we find that the feature labels still overlap with the features of the 'ego' user which very strongly indicates that this user is more likely to make friends with people who have those features that are repeatedly exhibited.

In the succeeding sections of the notebook, we attempt to find communities by an algorithm that depends on the 'conductance' values in the graph - a parameter inherently exhibited by
Communities in Social Networks
  graphs. The paper mentioned here best explains the 'conductance' parameter. Consider a graph partition of the original graph, then this partition will have some cut edges and some internal edges. For this partition, the ratio of the cut edges to the internal edges is the conductance value. Our goal is to find the partitions in the original graph such that the conductance values for each of the partitions that are obtained are minimized. The algorithm as inspired by the mentioned paper was first written in rough and finally run on the data in the notebook and tested visually, the outcomes clearly show the separation of communities in the graph. The dark blue circles represent the core of the communities and the light blue circles represent the borders of the communities. Actually, they had been represented in the algorithm as black and gray nodes respectively. The algorithm traversed breadth-wise using queue data structure and colored the nodes as it encountered them. Gray nodes represented the periphery of the communities and black nodes represented those that had been put in the core of the communities.


 Conclusion

 

The notebook demonstrated how conveniently Zeppelin can be used to integrate all aspects of analysis that can performed on graph type datasets along with visualizations that follow the analysis as a visual proof. The graph visualizations were done using D3 libraries and the data for the graph producing code was generated in Zeppelin after the analysis was done. Reordering of paragraphs gives the flexibility that enables one to first perform analysis in rough, on a  testing basis before finalizing on the results. After confirmed results have been obtained, they can be neatly described using 'markdown' and displayed to the user.

Monday, 8 August 2016

WorldWideWeb

Introduction

 

The WorldWideWeb notebook performs analysis on the web crawl data that is collected and managed by the CommonCrawl organization. We know that web crawl data has traditionally been collected and used only by commercial companies but with increasing
WorldWideWeb Notebook
awareness about the importance of data, especially crawl data, there needs to be someone who does the task of crawling and providing the data to those who can use it and CommonCrawl just does that.
Moreover, the data is provided for free and in an organized form in three formats : WARC(raw web archive), WAT(extracted meta data) and the WET format( extracted plain text). Of these formats, the WARC and WET formats have been used for analysis in the notebook. With the change in the type and nature of the datasets, the layout of the notebook has changed from the previous ones. The notebook consists of seven sections each doing a particular type of analysis. For the ingestion and analysis of the WARC format types, the Warcbase library has been extensively used. Due to the scale of the datasets, the notebook could not be accommodated on the local machine and so all the analysis was done on an m4.xlarge instance on Amazon EC2 running Spark inside Apache Zeppelin. The first few paragraphs download the datasets and get them in the RDD form using the warcbase 'loadArchives()' function which directly loads the web archives from the disk(EBS volume). The subsequent paragraphs perform the required analysis on the datasets. 


Getting the Data Sets

 

 Web Crawl data is huge and intimidating; considering only the data for May, the entire crawl content is divided into some 24500 segments with each segment being around 1GB in size,
which made it impossible to even peep into the data using any application program on my local machine. Therefore, it was decided to run the zeppelin instance on an EC2 instance to perform analysis. This required the installation of spark and zeppelin on the m4.xlarge machine which had 4 virtual CPU(s) and 16GB of RAM. Initially, the data sets were loaded with no hassles but then I hit some 'OutOfMemory' which have been described below. After the root cause of these issues was found, the rest of the notebook was developed keeping them in mind. Also, not all segments of the month of May could be accommodated so only few have been used.


Analysis of Data using Apache Spark

 

1. Domain Frequencies

 

The first section of the notebook examines the domain frequencies, that is, how frequently a domain occurs in the web crawl data. The dataset was divided into two parts - one for the beginning and other for the ending segments of May and the domain frequencies were
Domain Frequencies
deduced accordingly. The beginning segments had 'fangraphs.com' as the top domain, followed by 'osnews.com' and 'google.com' while the ending segments had 'economist.com' as the top domain. We know that advertising is the main source of revenue on the internet, therefore the domains that are more frequent have more chances of being visited by web users than others that have lesser frequencies. Comparison of the domain frequency values also gives one, a chance to compare the relative positions of a domain on the web. Two domains that present the same content, may have different positions in terms of their domain frequencies and this may lead to difference in the amount of web traffic that is being drawn to a particular domain.



2. Analysis of Site Link Structure

 

The next section deals with the analysis of the site link structure. We know from the previous section that certain websites have more domain frequencies than the rest. Here, we use the sankey diagrams to examine which websites link to these popular domains that they
Site Link Structure
may benefit from the traffic that goes to the most popular domains. For this purpose, we first extract the links from the crawl data using the 'ExtractLinks' class of warcbase library. This proved to be a tough task as this operation repeatedly gave 'java.lang.OutOfMemoryError' when finally after tuning the Spark engine and finding and eliminating the root cause, I was able to proceed. This led to the following configuration of Spark : driver using 16GB of memory, using G1 garbage collector, using compressed Oops, and reducing the memory used for persisting to 0.1 percent of the available memory. Apart from this, the primary cause of the error was a burst of data : the original RDD loaded with the warcbase library contained exactly 53307 records and when 'flatMap()' operation was used to extract the links from these many records(web pages), the driver could not hold the amount of data. So a simple solution to the problem was to filter the domains right after the dataset was loaded to include only the ones being considered. From the sankey diagrams, it was found that few other sports sites had their links mentioned on 'fangraphs.com' and other technology sites had their's on 'osnews.com'. Google had links only to its own domains with the maximum being to 'scholar.google.com'.




3. Analysis of Web Graph Structure


In this section, we take a look at the Web Graph structure from the web graph extracted by the 'ExtractGraph' class of the warcbase library which in turn uses the GraphX library of Spark. The extraction of the data was followed by visualization using the D3 libraries.
Web Graph Structure
The visualization represents the webpages(domains) as nodes and the links between the nodes as edges of the graph. The nodes that have more page rank values have bigger size than the rest. We get to see that page rank values are an indicator of the relative importance of the web pages. The other pages that are linked to the larger(bigger) domains tend to stick close to them visibly forming communities on the web. Edges between the nodes indicate the strength of the links. One way to test this, is to pin two domains preferably larger ones far apart from each other and then release them. The nodes will appear to be pulled towards each other by a force proportional to the strength of the links. A slide bar at the top of the visualization allows us to control the number of web pages being displayed in the visualization. Authorities and Hubs are also evident in the visualization.




4. Impact Measurement of Google Analytics


Measurement of the impact of google analytics in this section was inspired by the work of Stephen Merity on this topic. Google analytics is the most preferred solution for the measurement of web traffic to a particular site. Most of the websites deploy google analytics
Google Analytics
on their websites. Two fundamental questions need to be answered with respect to the usage of google Analytics : first - how many websites is it deployed on, and secondly what percentage of the browsing history of a user is leaked to Google through this analytics feature. As measured by the operations on the web page RDD, nearly 54% of the pages have google analytics on their websites which indicates that more than half of the websites have the feature enabled. Further, to measure the leaked browsing history we calculate the total links in the web graph obtained from the previous section and the number of links that have analytics enabled on either of the pages forming the links. The ratio of these two quantities is the leaked browsing history. In the section presented in the notebook, this value turns out to be a hundred percent which is due to the low scale of data being considered and that which can be afforded by the EC2 instance. The output of the algorithm changes and increases in accuracy with increasing amount of data.



5.  Analysis of the context of Locations


The WET data for the segments of May contains all the extracted text that can be used to find the locations present in the data and then to analyze the context in which those locations
Context of Locations 
were mentioned. To get the list of locations, we first manually form a sufficiently good list of locations in a file called places.txt and then compare the RDD of text data with the list of locations by using the intersect operator. This gives the list of locations. Now, to analyze the context in which
those locations were mentioned, we extract the words that were mentioned closest to the locations and store them in a map for easy retrieval of data. Finally, this is presented in the form of a search interface wherein users can enter the locations and get the context of those locations.



6. Mapping entities onto Wikipedia concepts

 

This section was inspired by the work of Chris Hans, who has explained how to map the entities to external ontologies like wikipedia for deriving their contexts. The first operation involved is the extraction of entities from the raw warc data. For this purpose, the Stanford
Wikipedia Concepts
Natural Language Processing packages have been used which extract the entities from the warc data. To be able to
map the entities to concepts, it is important that the entities be extracted from only wikipedia URL Strings so that we may have the wikipedia address of that particular entity. After all the entities and their corresponding wikipedia concepts have been mapped, we calculate the total number of links for an entity and the count of individual links for an entity. The ratio of these quantities for each link of each entity is the probability of that concept representing that entity. Finally, all the entities and their concepts have been displayed in the form of an indented tree for ease of viewing.




7. Search Engine

 

Search Engine
 The last section of the notebook presents a search engine using Apache Lucene. Using the warcbase library, we first bring the raw data to tuples of URL and page Content pairs both in the form of Strings. Then these are used to form the search index on disk using the indexing classes of Apache Lucene. Once the index is generated, the 'IndexSearcher' class can be used to repeatedly search the index for terms entered into the system. This feature is finally presented as an interface in the last paragraph where the user may enter the search queries and get the results back from the search indexer. As is obvious, the results that are returned are only from the pages that form the index on disk.


 

Conclusion

 

This notebook presented the possibilities of analysis with the Common Crawl datasets which are likely to become more popular with time and increasing demand for web crawl data. More importantly, the analysis and visualization of data in Zeppelin was remarkably easy and intuitive. More number of packages might have been involved for analysis of this type of data without Zeppelin. Visualizations can follow right after the paragraph analysing the data which makes it easy to relate the context in which the visualization is presented. Explanations can be given alongside the diagrams for the viewer to be able to understand them. Furthermore, multiple backend engines and libraries can be combined in a single notebook.

Friday, 1 July 2016

Transportation

Introduction


The notebook Transportation focuses on the transport system of Europe and performs analysis on the data made available by the EuroStat organization which provides a large number of datasets on every possible topic. This notebook uses Apache Flink for analysis of data and the visualization capabilities of Apache Zeppelin.
Transportation Notebook
The analysis has been divided into three broad sections : transportation overview, road transport and rail transport. The overview section provides general information and performs informative analysis on the availability and the usage of transport services in the region. Thereafter, the analysis focuses on the most used means of transport - road and rail transport, and considers various parameters provided by the datasets that might indicate the conditions of the means of transport. In both the sections, the main aim has been the analysis of congestion in the traffic and the measures that the countries have taken to handle the pressures of traffic. As in the previous notebooks, the first few paragraphs are used to load the datasets and define some general functions that have been used throughout the notebook. At the beginning of each section, a paragraph defines the case classes needed for loading the datasets for that particular section. Since, a large number of datasets with different formats have been used two case classes : 'CommonType1' and 'CommonType2' have been defined at the start for converting any dataset to a common type for visualization. Additional functions have been defined to get the data from the 'common types' into the table display format. The notebook makes use of html display and Helium Application for displaying custom visualizations and maps respectively.


Getting the data from EuroStat


The EuroStat organization is a European organization that provides data on various indicators in Europe. All the data was fetched from this organization. Navigating to the bulk download feature of the website provides complete information about the available datasets in a pdf file. From that file one may find the relevant datasets and download them using the bulk download facility. In the case of this notebook, all the datasets that were related to transportation were downloaded except for a few that were individually at the country level. After downloading, inconsistencies were removed using the stream editor. Since the number of countries is significantly large, only five representative nations have been chosen.


Analysis using Apache Flink

 

1. Transportation  Overview


The first section in the notebook examines the transportation facilities available in the region, starting with the lengths of the various means of ground transport such as canals, rivers, roads and rail tracks and moving on to the passenger usage of the road and rail transport measured as percentage of total traffic.
Transportation Overview
All the countries have greatest usage of cars and low net usage of the rest of the means of transport such as buses and trains. Spain which is low on the usage of trains, has significantly more of bus usage. The next dataset that has been considered : the difficulty in accessing public transport has multiple dimensions with every dimension spanning into multiple dimensions, so a radial reingold tilford tree would be the best possible visualization for such a dataset. The first paragraph for this indicator generates the json string of the data required for the visualization code. The subsequent paragraph displays the visualization. Since the tree is reasonably big, only two of the nations - Germany and France have been considered. Then we consider the number of trips grouped by the means of transport. United Kingdom has had the greatest number of air trips both within and outside the country while Germany and France that had the greatest length of roads also have the greatest number of road trips. Lastly, we consider the consumer price indexes for the road and rail transport and find that they both have fallen drastically over the recent years which indicates the ease in the access of transport services in terms of the amount paid for a particular service.


2. Road Transport


Road Transport has been the most used means of transport historically and has been taken up in the next section of the notebook. We begin with the first dataset for the section - the number of vehicles on road the vehicle stock.
Road Transport
For the period for which the number of vehicles data was available, the graph appeared to be exponential and highly zoomed in for the years considered(2009 - 2012). Then we consider the vehicle stock individually for the most prevalent means of commuting on road - the number of  buses and the number of cars and compare them against the usage to get a  picture of how the usage is related to the stock. The number of buses is the maximum for United Kingdom while the usage is the greatest for Spain. For cars, the usage is maximum for the UK but the number of cars is maximum for Germany. Next, we perform comparison between the length of roads and the number of cars and buses to get an idea of the congestion on roads. Since the number of cars is an independent factor, the road infrastructure should scale up accordingly. Only France shows this trend, the rest of the nations only have increasing vehicle stock but no increase in the road infrastructure. Motorization rate which is the number of passenger vehicles per thousand inhabitants is the greatest  for Italy and Germany. The consumer price index for the purchase of vehicles has been decreasing constantly which implies that the countries have made it easier for the people to own their personal vehicles. Lastly, we turn our attention to the road accidents and find that the number of road accidents have been constantly decreasing with the maximum being for drivers and in the rural areas. This section ends with a map showing some of the cities of Europe along with the congestion expressed as a percentage and the road network of Europe.



3. Rail Transport

 

The second most used means of land transport is the railways. The last section of the notebook focuses on the rail transportation facilities in the region. The first dataset visualized in the form of pie charts shows the number of trains in the region and we get to see that Germany by far has the greatest number of goods and passenger trains, followed by the UK and France. In the comparison of the track length vs. the number of rail cars, we find that the lengths have not been increasing in comparison to the increase in the rail cars, infact for some of the nations the track length has decreased for the recent years.
Rail Transport
The number of passe
nger rail vehicles have largely stayed constant and for Germany and Italy they have decreased. As compared to the passenger rail vehicles the train  usage has gone up, however this may not be a problem as the net usage of train services has been low overall so that slight increases in the train usage can be accommodated without any significant increase in the train services or the number of rail cars. France and Germany that have the greatest length of tracks also have the longest distances of train travel while the number of passengers have been the maximum for Germany and the UK indicating that the commuters may prefer the usage of these services over other means of transportation. The number of passenger rail vehicles have been increased according to  increase in the number of passengers only for France and to some extent for Italy, for others there has been no increase in the number of passenger rail vehicles. France and Germany also have the maximum number of people employed in the railway as they have the largest network. Lastly, we take a look at the railway accidents and find that they have been decreasing over the years with the greatest number of accidents occurring due to rolling stock in motion. A map showing all the railway lines of Europe ends the last section.


Visualizations

 

The visualizations for this notebook were generated using the nvd3 and the d3 libraries using the html display. The data for the reingold tilford tree was generated using a custom function written right above the paragraph showing the tree.
Road Network and Congestion Map
This function converts
the data present as rows in a table to the json format needed for the visualization code. Other functions that are defined at the top in a generic paragraph contain the functions to get the datasets in  common type format and produce the corresponsing table display. The congestion map contains the markers that were generated using javascript which was run through Helium Application. The last map showing the rail network was produced using open layers 3 library and html display. See sample demo of the notebook here.

Wednesday, 15 June 2016

Economics

Introduction


The notebook titled Economics focuses on two of the most recent economic issues of the world : the Sub-Saharan African Growth and the economic condition of Latin America and Paraguay. Datasets for both of these topics for the notebook were provided
Economics Notebook
by the International Monetary Fund. The IMF website provides a large and formidable list of  data indicators to choose from and extracting the relevant datasets proved to be a tough task. The method adopted for choosing the datasets has been described below. Analysis and visualization for the notebook was done using Apache Spark, the default visualization capabilities of Zeppelin( as described in the introductory post), the D3 libraries using html display and Helium Application. As in the previous notebook, the datasets were downloaded first using the shell interpreter and then loaded and converted into appropriate 'DataFrames' for analysis by using a custom 'explode()' function that takes the input dataframe and returns a dataframe that can easily be queried and and registered as tables. The first two paragraphs in each section perform these data fetching and loading tasks while the rest are used for analysis.


 Getting the Data Sets


Once the topics for analysis were chosen, the datasets relevant to them had to be figured out. For this purpose, certain blogs and publications such as those on the IMF website itself, were referred and the list of data indicators was manually scanned to extract the needed datasets. For example, the IMF site has two good articles on the recent advances in economic growth by Africa and Sub-Saharan Africa, links to which have been provided in the notebook. With the aid of these posts and publications, cause and effect diagrams were drawn to roughly find the interdependence of the indicators and their impact on the economy. For example, in the case of Africa the recent slowdown can be examined by first considering the variation in growth by country, revenues, commodity prices and government policies. Similarly for other topics and indicators.


Analysis using Apache Spark


1. The African Growth Story

 

Sub-Saharan Africa had been experiencing good economic growth for the last decade. Only recently has the growth slowed down. In the first topic of the notebook, we examine the factors that have effects on the African economy and what the region might do to improve. Six representative nations have been chosen for the analysis : Nigeria, Central African Republic, Democratic Republic of Congo, Madagascar, Mauritius and South Africa.
African Growth Story
We begin with the Broad Money and Revenues for the region both of which have stayed fairly constant over time or have fallen over time, especially the revenues for Nigeria. Even the Current Account Holdings indicate that they will be low for the years to come as indicated by the clustering results. On the other hand, Nigeria has been able to get maximum national savings unlike South Africa and Mauritius which have low savings and high expenditures. The expenditures indicate that South Africa and Mauritius might have the highest debt but the pie charts reveal that initially Central African Republic and Congo had the highest debts even with low expenditures, although expenditures caused the debts for South Africa and Mauritius to increase in the subsequent years. The capital formation values were again high for Mauritius, Madagascar and South Africa.

Among the factors that have an impact on the economy, the Consumer Price Index has been considered first and it is observed that it has only been rising in the recent years. Increasing consumer prices are an indication of inflation, even the gross debt increases linearly with the consumer prices. The next factor, access to financial services has been measured by considering the total number of bank branches and the bank usage to indicate the level of financial activity in the region. From the charts drawn, it is revealed that only South Africa has a great number of Bank Branches and also above mean broad money. Some nations have been consistently performing well even in terms of the value of exports namely, South Africa and Nigeria. However, an important fact is revealed in the graph between the Current Account values and the value of exports : the current account values don't increase with the increase in the value of exports, this may attributed to the net imports being either equal to or greater than the value of exports for the region. Finally, we get to see that the total expenditures of the nations and the region are almost entirely dependent on the government expenditures, so to keep the expenditures and debt under control, the government policies are going to play a crucial role, this is confirmed even in the IMF report which states that a change of government and fiscal policies is all that is needed to put the region back on the track of progress.



2.  Latin America and Paraguay

 

An IMF publication recently reported that growth in Paraguay would stay resilient at 3% even amid regional slowdown. In the second topic for the notebook, we examine the economic condition of the region and analyze some of the factors that have been affecting it.
Latin America and Paraguay Economic Growth 
We begin with the Balance of Payments and the Gross Debt. For the Balance of Payments, we find that for some nations such as Bolivia, it has fallen drastically over the recent years, while for others it has either been constant or is on the rise as in the case of Brazil and
Paraguay. In terms of spending and the gross debt, Brazil takes the lead and Paraguay stands good with debt projections below 30% of the GDP. This is an important point, as illustrated in the graphs that follow, that have the GDP values on the left axis and the debt values on the right axis. As Brazil would be having the maximum GDP, it would also have the highest debt at almost 90% of the GDP. On the other hand, Paraguay which has low debt percentage and still increasing GDP would stay strong economically.


Next, we proceed to examine the factors that lead to the present economic condition of the region, beginning with the general government net lending/borrowing. In the correlation that we perform between the net lending/borrowing and the Balance of Payments, we get a correlation value of 0.6012, as is also indicated by the graphs. This shows that the lending and borrowing largely affect the Balance of Payments. The next factor to be examined is the net cash inflow from financing activities which has been consistently high for Brazil and has been swinging up and down for other nations. Only for Paraguay has it been rising over the recent years and chances are that it will come close to Brazil in the years to come. In the correlation that is performed between the cash surplus/deficit values as dependent variable and the cash expenditures and cash inflow values, we get negative weights for both the features. This can be explained by the larger negative weight of the expenditures which overshadows the positive effect of the cash inflows resulting in an apparent negative weight for the cash inflows. Two of the nations, Brazil and Argentina distinctly stand out in terms of the Purchasing Power Parity of the nations leading the other nations by a huge margin. Differences in the purchasing parities also explain the net change in stock of cash - nations having higher purchasing parity have less inflow or outflow of cash as compared to nations having lesser parities assuming debt liabilities remaining constant and the same is observed in the net cash stock graph plotted alongside. Finally, we take a look at the changing Consumer Prices and get to see that they have been fluctuating the maximum for Argentina reaching the maximum and the minimum in consecutive years. For others, such as Brazil and Paraguay, they have largely remained constant over time indicating greater stability of internal markets. In the pie charts that follow, we get to see how the most important sectors of internal markets share the consumer price indexes amongst themselves. Food, housing and miscellaneous items form the greatest share with the housing share slightly increasing over time.


Visualizations


The visualizations for this notebook were created using the default visualization capabilities of Zeppelin along with the custom graphs and charts using the D3 visualization libraries. Default visualizations are simple, only the dataframes containing the data must be registered as Tables and then they can be visualized using sql. As for the custom visualizations, they can either be written( D3 code ) inside html display system of Zeppelin which requires the data in good formed json format or the helium pluggable application can be used. The html display would only require the data to be visualized which can be provided using string interpolation. Running the paragraph would create the visualization. Its positioning and colors can be adjusted using css. 

Helium application can also be used to create visualizations by first getting the data generated by the spark repl interpreter. The data can either be repl result or in table format.
Helium Visualization
A simple helium application would simply extend Application class provided in the helium package of Zeppelin. Then to get the result of the previous paragraph, we first need to get the Interpreter Context corresponding to the currently running instance of Zeppelin which we can get using a static getter method of the 'InterpreterContext' class. The interpreter context can provide the resource pool from which we can get the result of the previous paragraph as a 'Resource' type, which can then be used to provide the data to the html or javascript file used to create the visualizations. Since the visualizations( the html and javascript
) are displayed through the angular display system, we first need to add the data to the 'angular object registry' to be able to use it in the html file. See sample demo here.