Thursday 18 August 2016

SnapBook

Introduction

 

The notebook 'SnapBook' performs analysis on the SNAP(Stanford Network Analysis Project) datasets. These datasets are made publicly available by the Stanford University's
SnapBook
SNAP website and can be freely used for analysis. They are mainly graph datasets and generally contain the files for edges of graphs and the nodes in the graphs. Few other datasets contain the files for the social network circles for the current user. The analysis in the notebook emphasizes more on the social network data mainly due to the more scope of analysis with the data. As in the other notebooks, the first few paragraphs download the datasets and load them for analysis, thereafter analysis is performed using Apache Spark and the visualization capabilities of Apache Zeppelin along with D3 libraries.



Getting the Datasets

 

The Snap website lists datasets that contain graph data for various other websites such as Facebook, Google Plus, Twitter and graph data for web pages between Berkeley and Stanford. All the datasets are nearly of the same type although they belong to different domain of information. For this notebook, the social network datasets were specifically chosen. The datasets from the site contain the following files : the edges file, the features for each of the users, the names of the features and the circles for the users. These files and their data have been used in conjunction to extract information and perform analysis.


 

Analysis using Apache Spark

 

The Social Network

 

Social Network datasets have gained popularity in the recent years due to the scope of research that has opened up as these datasets have grown in magnitude. More articles and research papers are now being published as a consequence of the availability of the data. In
The Social Network
the notebook, we focus primarily on analyzing the circles that users of the social network tend to form amongst themselves as well as the communities that are present in the network. One of the papers that was referenced in doing the analysis in the notebook, described the terms that could be used to interpret the datasets. The user currently being analyzed is referred to as the 'ego' user. The 'ego' user has a network that is represented by the edges file for that 'ego' user and the members in the circles for the 'ego' user are represented by the circles file. The feature names for the users are in a separate file. The features for the 'ego' user would best describe his interests. Apart form that, the circles that have the largest number of friends for this 'ego' user would also have some common features because of which the circle is more prominent than others. So we also analyze the circles that have the largest number of users and the second largest circle - both of these have users with features that were also present in the feature list of the 'ego' user. Further, a lot of users appear in multiple circles for this user. On analyzing the features of even those users, we find that the feature labels still overlap with the features of the 'ego' user which very strongly indicates that this user is more likely to make friends with people who have those features that are repeatedly exhibited.

In the succeeding sections of the notebook, we attempt to find communities by an algorithm that depends on the 'conductance' values in the graph - a parameter inherently exhibited by
Communities in Social Networks
  graphs. The paper mentioned here best explains the 'conductance' parameter. Consider a graph partition of the original graph, then this partition will have some cut edges and some internal edges. For this partition, the ratio of the cut edges to the internal edges is the conductance value. Our goal is to find the partitions in the original graph such that the conductance values for each of the partitions that are obtained are minimized. The algorithm as inspired by the mentioned paper was first written in rough and finally run on the data in the notebook and tested visually, the outcomes clearly show the separation of communities in the graph. The dark blue circles represent the core of the communities and the light blue circles represent the borders of the communities. Actually, they had been represented in the algorithm as black and gray nodes respectively. The algorithm traversed breadth-wise using queue data structure and colored the nodes as it encountered them. Gray nodes represented the periphery of the communities and black nodes represented those that had been put in the core of the communities.


 Conclusion

 

The notebook demonstrated how conveniently Zeppelin can be used to integrate all aspects of analysis that can performed on graph type datasets along with visualizations that follow the analysis as a visual proof. The graph visualizations were done using D3 libraries and the data for the graph producing code was generated in Zeppelin after the analysis was done. Reordering of paragraphs gives the flexibility that enables one to first perform analysis in rough, on a  testing basis before finalizing on the results. After confirmed results have been obtained, they can be neatly described using 'markdown' and displayed to the user.

No comments:

Post a Comment