Thursday, 18 August 2016

SnapBook

Introduction

 

The notebook 'SnapBook' performs analysis on the SNAP(Stanford Network Analysis Project) datasets. These datasets are made publicly available by the Stanford University's
SnapBook
SNAP website and can be freely used for analysis. They are mainly graph datasets and generally contain the files for edges of graphs and the nodes in the graphs. Few other datasets contain the files for the social network circles for the current user. The analysis in the notebook emphasizes more on the social network data mainly due to the more scope of analysis with the data. As in the other notebooks, the first few paragraphs download the datasets and load them for analysis, thereafter analysis is performed using Apache Spark and the visualization capabilities of Apache Zeppelin along with D3 libraries.



Getting the Datasets

 

The Snap website lists datasets that contain graph data for various other websites such as Facebook, Google Plus, Twitter and graph data for web pages between Berkeley and Stanford. All the datasets are nearly of the same type although they belong to different domain of information. For this notebook, the social network datasets were specifically chosen. The datasets from the site contain the following files : the edges file, the features for each of the users, the names of the features and the circles for the users. These files and their data have been used in conjunction to extract information and perform analysis.


 

Analysis using Apache Spark

 

The Social Network

 

Social Network datasets have gained popularity in the recent years due to the scope of research that has opened up as these datasets have grown in magnitude. More articles and research papers are now being published as a consequence of the availability of the data. In
The Social Network
the notebook, we focus primarily on analyzing the circles that users of the social network tend to form amongst themselves as well as the communities that are present in the network. One of the papers that was referenced in doing the analysis in the notebook, described the terms that could be used to interpret the datasets. The user currently being analyzed is referred to as the 'ego' user. The 'ego' user has a network that is represented by the edges file for that 'ego' user and the members in the circles for the 'ego' user are represented by the circles file. The feature names for the users are in a separate file. The features for the 'ego' user would best describe his interests. Apart form that, the circles that have the largest number of friends for this 'ego' user would also have some common features because of which the circle is more prominent than others. So we also analyze the circles that have the largest number of users and the second largest circle - both of these have users with features that were also present in the feature list of the 'ego' user. Further, a lot of users appear in multiple circles for this user. On analyzing the features of even those users, we find that the feature labels still overlap with the features of the 'ego' user which very strongly indicates that this user is more likely to make friends with people who have those features that are repeatedly exhibited.

In the succeeding sections of the notebook, we attempt to find communities by an algorithm that depends on the 'conductance' values in the graph - a parameter inherently exhibited by
Communities in Social Networks
  graphs. The paper mentioned here best explains the 'conductance' parameter. Consider a graph partition of the original graph, then this partition will have some cut edges and some internal edges. For this partition, the ratio of the cut edges to the internal edges is the conductance value. Our goal is to find the partitions in the original graph such that the conductance values for each of the partitions that are obtained are minimized. The algorithm as inspired by the mentioned paper was first written in rough and finally run on the data in the notebook and tested visually, the outcomes clearly show the separation of communities in the graph. The dark blue circles represent the core of the communities and the light blue circles represent the borders of the communities. Actually, they had been represented in the algorithm as black and gray nodes respectively. The algorithm traversed breadth-wise using queue data structure and colored the nodes as it encountered them. Gray nodes represented the periphery of the communities and black nodes represented those that had been put in the core of the communities.


 Conclusion

 

The notebook demonstrated how conveniently Zeppelin can be used to integrate all aspects of analysis that can performed on graph type datasets along with visualizations that follow the analysis as a visual proof. The graph visualizations were done using D3 libraries and the data for the graph producing code was generated in Zeppelin after the analysis was done. Reordering of paragraphs gives the flexibility that enables one to first perform analysis in rough, on a  testing basis before finalizing on the results. After confirmed results have been obtained, they can be neatly described using 'markdown' and displayed to the user.

Monday, 8 August 2016

WorldWideWeb

Introduction

 

The WorldWideWeb notebook performs analysis on the web crawl data that is collected and managed by the CommonCrawl organization. We know that web crawl data has traditionally been collected and used only by commercial companies but with increasing
WorldWideWeb Notebook
awareness about the importance of data, especially crawl data, there needs to be someone who does the task of crawling and providing the data to those who can use it and CommonCrawl just does that.
Moreover, the data is provided for free and in an organized form in three formats : WARC(raw web archive), WAT(extracted meta data) and the WET format( extracted plain text). Of these formats, the WARC and WET formats have been used for analysis in the notebook. With the change in the type and nature of the datasets, the layout of the notebook has changed from the previous ones. The notebook consists of seven sections each doing a particular type of analysis. For the ingestion and analysis of the WARC format types, the Warcbase library has been extensively used. Due to the scale of the datasets, the notebook could not be accommodated on the local machine and so all the analysis was done on an m4.xlarge instance on Amazon EC2 running Spark inside Apache Zeppelin. The first few paragraphs download the datasets and get them in the RDD form using the warcbase 'loadArchives()' function which directly loads the web archives from the disk(EBS volume). The subsequent paragraphs perform the required analysis on the datasets. 


Getting the Data Sets

 

 Web Crawl data is huge and intimidating; considering only the data for May, the entire crawl content is divided into some 24500 segments with each segment being around 1GB in size,
which made it impossible to even peep into the data using any application program on my local machine. Therefore, it was decided to run the zeppelin instance on an EC2 instance to perform analysis. This required the installation of spark and zeppelin on the m4.xlarge machine which had 4 virtual CPU(s) and 16GB of RAM. Initially, the data sets were loaded with no hassles but then I hit some 'OutOfMemory' which have been described below. After the root cause of these issues was found, the rest of the notebook was developed keeping them in mind. Also, not all segments of the month of May could be accommodated so only few have been used.


Analysis of Data using Apache Spark

 

1. Domain Frequencies

 

The first section of the notebook examines the domain frequencies, that is, how frequently a domain occurs in the web crawl data. The dataset was divided into two parts - one for the beginning and other for the ending segments of May and the domain frequencies were
Domain Frequencies
deduced accordingly. The beginning segments had 'fangraphs.com' as the top domain, followed by 'osnews.com' and 'google.com' while the ending segments had 'economist.com' as the top domain. We know that advertising is the main source of revenue on the internet, therefore the domains that are more frequent have more chances of being visited by web users than others that have lesser frequencies. Comparison of the domain frequency values also gives one, a chance to compare the relative positions of a domain on the web. Two domains that present the same content, may have different positions in terms of their domain frequencies and this may lead to difference in the amount of web traffic that is being drawn to a particular domain.



2. Analysis of Site Link Structure

 

The next section deals with the analysis of the site link structure. We know from the previous section that certain websites have more domain frequencies than the rest. Here, we use the sankey diagrams to examine which websites link to these popular domains that they
Site Link Structure
may benefit from the traffic that goes to the most popular domains. For this purpose, we first extract the links from the crawl data using the 'ExtractLinks' class of warcbase library. This proved to be a tough task as this operation repeatedly gave 'java.lang.OutOfMemoryError' when finally after tuning the Spark engine and finding and eliminating the root cause, I was able to proceed. This led to the following configuration of Spark : driver using 16GB of memory, using G1 garbage collector, using compressed Oops, and reducing the memory used for persisting to 0.1 percent of the available memory. Apart from this, the primary cause of the error was a burst of data : the original RDD loaded with the warcbase library contained exactly 53307 records and when 'flatMap()' operation was used to extract the links from these many records(web pages), the driver could not hold the amount of data. So a simple solution to the problem was to filter the domains right after the dataset was loaded to include only the ones being considered. From the sankey diagrams, it was found that few other sports sites had their links mentioned on 'fangraphs.com' and other technology sites had their's on 'osnews.com'. Google had links only to its own domains with the maximum being to 'scholar.google.com'.




3. Analysis of Web Graph Structure


In this section, we take a look at the Web Graph structure from the web graph extracted by the 'ExtractGraph' class of the warcbase library which in turn uses the GraphX library of Spark. The extraction of the data was followed by visualization using the D3 libraries.
Web Graph Structure
The visualization represents the webpages(domains) as nodes and the links between the nodes as edges of the graph. The nodes that have more page rank values have bigger size than the rest. We get to see that page rank values are an indicator of the relative importance of the web pages. The other pages that are linked to the larger(bigger) domains tend to stick close to them visibly forming communities on the web. Edges between the nodes indicate the strength of the links. One way to test this, is to pin two domains preferably larger ones far apart from each other and then release them. The nodes will appear to be pulled towards each other by a force proportional to the strength of the links. A slide bar at the top of the visualization allows us to control the number of web pages being displayed in the visualization. Authorities and Hubs are also evident in the visualization.




4. Impact Measurement of Google Analytics


Measurement of the impact of google analytics in this section was inspired by the work of Stephen Merity on this topic. Google analytics is the most preferred solution for the measurement of web traffic to a particular site. Most of the websites deploy google analytics
Google Analytics
on their websites. Two fundamental questions need to be answered with respect to the usage of google Analytics : first - how many websites is it deployed on, and secondly what percentage of the browsing history of a user is leaked to Google through this analytics feature. As measured by the operations on the web page RDD, nearly 54% of the pages have google analytics on their websites which indicates that more than half of the websites have the feature enabled. Further, to measure the leaked browsing history we calculate the total links in the web graph obtained from the previous section and the number of links that have analytics enabled on either of the pages forming the links. The ratio of these two quantities is the leaked browsing history. In the section presented in the notebook, this value turns out to be a hundred percent which is due to the low scale of data being considered and that which can be afforded by the EC2 instance. The output of the algorithm changes and increases in accuracy with increasing amount of data.



5.  Analysis of the context of Locations


The WET data for the segments of May contains all the extracted text that can be used to find the locations present in the data and then to analyze the context in which those locations
Context of Locations 
were mentioned. To get the list of locations, we first manually form a sufficiently good list of locations in a file called places.txt and then compare the RDD of text data with the list of locations by using the intersect operator. This gives the list of locations. Now, to analyze the context in which
those locations were mentioned, we extract the words that were mentioned closest to the locations and store them in a map for easy retrieval of data. Finally, this is presented in the form of a search interface wherein users can enter the locations and get the context of those locations.



6. Mapping entities onto Wikipedia concepts

 

This section was inspired by the work of Chris Hans, who has explained how to map the entities to external ontologies like wikipedia for deriving their contexts. The first operation involved is the extraction of entities from the raw warc data. For this purpose, the Stanford
Wikipedia Concepts
Natural Language Processing packages have been used which extract the entities from the warc data. To be able to
map the entities to concepts, it is important that the entities be extracted from only wikipedia URL Strings so that we may have the wikipedia address of that particular entity. After all the entities and their corresponding wikipedia concepts have been mapped, we calculate the total number of links for an entity and the count of individual links for an entity. The ratio of these quantities for each link of each entity is the probability of that concept representing that entity. Finally, all the entities and their concepts have been displayed in the form of an indented tree for ease of viewing.




7. Search Engine

 

Search Engine
 The last section of the notebook presents a search engine using Apache Lucene. Using the warcbase library, we first bring the raw data to tuples of URL and page Content pairs both in the form of Strings. Then these are used to form the search index on disk using the indexing classes of Apache Lucene. Once the index is generated, the 'IndexSearcher' class can be used to repeatedly search the index for terms entered into the system. This feature is finally presented as an interface in the last paragraph where the user may enter the search queries and get the results back from the search indexer. As is obvious, the results that are returned are only from the pages that form the index on disk.


 

Conclusion

 

This notebook presented the possibilities of analysis with the Common Crawl datasets which are likely to become more popular with time and increasing demand for web crawl data. More importantly, the analysis and visualization of data in Zeppelin was remarkably easy and intuitive. More number of packages might have been involved for analysis of this type of data without Zeppelin. Visualizations can follow right after the paragraph analysing the data which makes it easy to relate the context in which the visualization is presented. Explanations can be given alongside the diagrams for the viewer to be able to understand them. Furthermore, multiple backend engines and libraries can be combined in a single notebook.