Introduction
The WorldWideWeb notebook performs analysis on the web crawl data that is collected and managed by the CommonCrawl organization. We know that web crawl data has traditionally been collected and used only by commercial companies but with increasing
|
WorldWideWeb Notebook |
|
awareness about the importance of data, especially crawl data, there needs to be someone who does the task of crawling and providing the data to those who can use it and CommonCrawl just does that. Moreover,
the data is provided for free and in an organized form in three formats
: WARC(raw web archive), WAT(extracted meta data) and the WET format(
extracted plain text). Of these formats, the WARC and WET formats have
been used for analysis in the notebook. With the change in the type and
nature of the datasets, the layout of the notebook has changed from the
previous ones. The notebook consists of seven sections each doing a
particular type of analysis. For the ingestion and analysis of the WARC
format types, the Warcbase library has been extensively used. Due to the
scale of the datasets, the notebook could not be accommodated on the
local machine and so all the analysis was done on an m4.xlarge instance
on Amazon EC2 running Spark inside Apache Zeppelin. The first few
paragraphs download the datasets and get them in the RDD form using the
warcbase 'loadArchives()' function which directly loads the web archives
from the disk(EBS volume). The subsequent paragraphs perform the
required analysis on the datasets.
Getting the Data Sets
Web Crawl data is huge and intimidating; considering only the data for May, the entire crawl content is divided into some 24500 segments with each segment being around 1GB in size,
which made it impossible to even peep into the data using any application program on my local machine. Therefore, it was decided to run the zeppelin instance on an EC2 instance to perform analysis. This required the installation of spark and zeppelin on the m4.xlarge machine which had 4 virtual CPU(s) and 16GB of RAM. Initially, the data sets were loaded with no hassles but then I hit some 'OutOfMemory' which have been described below. After the root cause of these issues was found, the rest of the notebook was developed keeping them in mind. Also, not all segments of the month of May could be accommodated so only few have been used.
Analysis of Data using Apache Spark
1. Domain Frequencies
The first section of the notebook examines the domain frequencies, that is, how frequently a domain occurs in the web crawl data. The dataset was divided into two parts - one for the beginning and other for the ending segments of May and the domain frequencies were
|
Domain Frequencies |
|
deduced accordingly. The beginning segments had 'fangraphs.com' as the top domain, followed by 'osnews.com' and 'google.com' while the ending segments had 'economist.com' as the top domain. We know that advertising is the main source of revenue on the internet, therefore the domains that are more frequent have more chances of being visited by web users than others that have lesser frequencies. Comparison of the domain frequency values also gives one, a chance to compare the relative positions of a domain on the web. Two domains that present the same content, may have different positions in terms of their domain frequencies and this may lead to difference in the amount of web traffic that is being drawn to a particular domain.
2. Analysis of Site Link Structure
The next section deals with the analysis of the site link structure. We know from the previous section that certain websites have more domain frequencies than the rest. Here, we use the sankey diagrams to examine which websites link to these popular domains that they
|
Site Link Structure |
may benefit from the traffic that goes to the most popular domains. For this purpose, we first extract the links from the crawl data using the 'ExtractLinks' class of warcbase library. This proved to be a tough task as this operation repeatedly gave 'java.lang.OutOfMemoryError' when finally after tuning the Spark engine and finding and eliminating the root cause, I was able to proceed. This led to the following configuration of Spark : driver using 16GB of memory, using G1 garbage collector, using compressed Oops, and reducing the memory used for persisting to 0.1 percent of the available memory. Apart from this, the primary cause of the error was a burst of data : the original RDD loaded with the warcbase library contained exactly 53307 records and when 'flatMap()' operation was used to extract the links from these many records(web pages), the driver could not hold the amount of data. So a simple solution to the problem was to filter the domains right after the dataset was loaded to include only the ones being considered. From the sankey diagrams, it was found that few other sports sites had their links mentioned on 'fangraphs.com' and other technology sites had their's on 'osnews.com'. Google had links only to its own domains with the maximum being to 'scholar.google.com'.
3. Analysis of Web Graph Structure
In this section, we take a look at the Web Graph structure from the web graph extracted by the 'ExtractGraph' class of the warcbase library which in turn uses the GraphX library of Spark. The extraction of the data was followed by visualization using the D3 libraries.
|
Web Graph Structure |
|
The visualization represents the webpages(domains) as nodes and the links between the nodes as edges of the graph. The nodes that have more page rank values have bigger size than the rest. We get to see that page rank values are an indicator of the relative importance of the web pages. The other pages that are linked to the larger(bigger) domains tend to stick close to them visibly forming communities on the web. Edges between the nodes indicate the strength of the links. One way to test this, is to pin two domains preferably larger ones far apart from each other and then release them. The nodes will appear to be pulled towards each other by a force proportional to the strength of the links. A slide bar at the top of the visualization allows us to control the number of web pages being displayed in the visualization. Authorities and Hubs are also evident in the visualization.
4. Impact Measurement of Google Analytics
Measurement of the impact of google analytics in this section was inspired by the work of Stephen Merity on this topic. Google analytics is the most preferred solution for the measurement of web traffic to a particular site. Most of the websites deploy google analytics
|
Google Analytics |
|
on their websites. Two fundamental questions need to be answered with respect to the usage of google Analytics : first - how many websites is it deployed on, and secondly what percentage of the browsing history of a user is leaked to Google through this analytics feature. As measured by the operations on the web page RDD, nearly 54% of the pages have google analytics on their websites which indicates that more than half of the websites have the feature enabled. Further, to measure the leaked browsing history we calculate the total links in the web graph obtained from the previous section and the number of links that have analytics enabled on either of the pages forming the links. The ratio of these two quantities is the leaked browsing history. In the section presented in the notebook, this value turns out to be a hundred percent which is due to the low scale of data being considered and that which can be afforded by the EC2 instance. The output of the algorithm changes and increases in accuracy with increasing amount of data.
5. Analysis of the context of Locations
The WET data for the segments of May contains all the extracted text that can be used to find the locations present in the data and then to analyze the context in which those locations
|
Context of Locations | |
were mentioned. To get the list of locations, we first manually form a sufficiently good list of locations in a file called places.txt and then compare the RDD of text data with the list of locations by using the intersect operator. This gives the list of locations. Now, to analyze the context in which those locations were mentioned, we extract the words that were mentioned closest to the locations and store them in a map for easy retrieval of data. Finally, this is presented in the form of a search interface wherein users can enter the locations and get the context of those locations.
6. Mapping entities onto Wikipedia concepts
This section was inspired by the work of Chris Hans, who has explained how to map the entities to external ontologies like wikipedia for deriving their contexts. The first operation involved is the extraction of entities from the raw warc data. For this purpose, the Stanford
|
Wikipedia Concepts |
|
Natural Language Processing packages have been used which extract the entities from the warc data. To be able to map the entities to concepts, it is important that the entities be extracted from only wikipedia URL Strings so that we may have the wikipedia address of that particular entity. After all the entities and their corresponding wikipedia concepts have been mapped, we calculate the total number of links for an entity and the count of individual links for an entity. The ratio of these quantities for each link of each entity is the probability of that concept representing that entity. Finally, all the entities and their concepts have been displayed in the form of an indented tree for ease of viewing.
7. Search Engine
|
Search Engine |
|
The last section of the notebook presents a search engine using Apache Lucene. Using the warcbase library, we first bring the raw data to tuples of URL and page Content pairs both in the form of Strings. Then these are used to form the search index on disk using the indexing classes of Apache Lucene. Once the index is generated, the 'IndexSearcher' class can be used to repeatedly search the index for terms entered into the system. This feature is finally presented as an interface in the last paragraph where the user may enter the search queries and get the results back from the search indexer. As is obvious, the results that are returned are only from the pages that form the index on disk.
Conclusion
This notebook presented the possibilities of analysis with the Common Crawl datasets which are likely to become more popular with time and increasing demand for web crawl data. More importantly, the analysis and visualization of data in Zeppelin was remarkably easy and intuitive. More number of packages might have been involved for analysis of this type of data without Zeppelin. Visualizations can follow right after the paragraph analysing the data which makes it easy to relate the context in which the visualization is presented. Explanations can be given alongside the diagrams for the viewer to be able to understand them. Furthermore, multiple backend engines and libraries can be combined in a single notebook.
No comments:
Post a Comment