Relating Web pages to enable information-gathering tasks

Amitabha Bagchi and Garima Lahoti


Abstract:

In this paper we argue that relationships between Web pages are functions of the user’s intent. We identify a class of Web tasks - information-gathering - that can be facilitated by a search engine that provides links to pages which are related to the page the user is currently viewing. We define three kinds of intentional relationships that correspond to whether the user is a) seeking sources of information, b) reading pages which provide information, or c) surfing through pages as part of an extended information-gathering process. We show that these three relationships can be productively mined using a combination of textual and link information and provide three scoring mechanisms that correspond to them: SeekRel, FactRel and SurfRel. These scoring mechanisms incorporate both textual and link information. We build a set of capacitated subnetworks - each corresponding to a particular keyword - that mirror the interconnection structure of the World Wide Web. The scores are computed by computing flows on these subnetworks. The capacities of the links are derived from the hub and authority values of the nodes they connect, following the work of Kleinberg (1998) on assigning authority to pages in hyperlinked environments. We evaluated our scoring mechanism by running experiments on four data sets taken from the Web. We present user evaluations of the relevance of the top results returned by our scoring mechanisms and compare those to the top results returned by Google’s Similar Pages feature, and the Companion algorithm proposed by Dean and Henzinger (1999).

Experimental Setup:

We performed our experiments on four data sets taken from the Web. Creating these data sets was a multi-stage process that began by querying AltaVista with a search string and taking the top 100 results to form a core set. We then used the open source Web crawler Nutch to retrieve the pages linked from the core set. Then we found the top 1000 pages that link to these new pages using Altavista’s advanced feature providing inlinks for a queried page. Finally, we found the inlinks of the pages in the core using Altavista then went back to Nutch to find the outlinks of these pages. We followed Dean and Henzinger and took only 10 outlinks in the manner they specified i.e. if we were looking at the outlinks of a page u which pointed to a core page v, we took only the links on u which were “around” the link to v in the sense that we took the 5 links immediately preceding the link to v on the page and the 5 links immediately following v. Having obtained this data set we preprocessed it by computing the hub and authority values of all the pages in it.

Our four data sets were generated using the keyword strings “automobile” (54952 pages), “motor company”(14973 pages), “clothes shopping” (37724 pages) and “guess” (12101 urls). For repeatability purposes, these data sets have been made available here. We conducted extensive experiments on these data sets by taking one page out of them as a query, then scoring all three relationships for this page with all the other pages in the data set. We compared our 10 top scoring pages for FactRel and SeekRel with the top 10 pages returned by Google’s Similar Pages feature. We also implemented the Companion algorithm and compared our results to the top 10 results returned by it. For SurfRel we simply took our top 10 results and evaluated them. The evaluation in all these cases was done by conducting user surveys.


Data Set Organization:
Data set directory (available below as .tgz file) for each keyword contains the following files:
(1) C.txt : Contains list of all the urls present in the core set.
(2) BnBF.txt : Contains list of inlinks for each url in C.txt with each inlink followed by its outlinks.
(3) CnF.txt: Contains list of all the outlinks for each url in C.txt.
(4) FB.txt: Contains list of inlinks for each outlink url in CnF.txt.
(5) CnB.txt (optional): Contains list of inlinks for each url in C.txt.
(6) urls.txt: Contains list of unique urls present in the final graph. It is created after processing the above files. It has been provided for reference.
Files (1) - (4) contain sufficient information to build the graph for the keyword.


Data Set:
Data sets used to construct graphs for the follwoing four keywords can be downloaded from here:
(1) automobile
(2) clothes shopping
(3) motor company
(4) guess


Results:
URLs used for the survey and the corresponding result files containing top 10 results for each of three algorithms (companion.txt, google.txt and our.txt) are as follows:

                    FactRel
S.No. URL Result files
1 www.aveda.com
Results
2 www.biblio.com
Results
3 www.bmw.com
Results
4 www.cars.com
Results
5 www.cartoday.com
Results
6 www.ctda.com
Results
7 www.guess.com
Results
8 www.harrods.com
Results
9 www.mysimon.com
Results
  
                    SeekRel

S.No. URL Result files
1 www.cardust.com
Results
2 www.driversdrive.com/mags.htm
Results
3 www.ersys.com/usa/06/0669084/mall.htm
Results
4 www.thefabricofourlives.com/DenimRules/DenimShoppingGuide
Results
5 www.truck-supply.com
Results
6 www.tucsonisgreat.com/TucsonShopping.html
Results
7 www.volition.com/automanu.html
Results

                    SurfRel
S.No. URL Result file
1 www.driversdrive.com/mags.htm
Results
2 www.thefabricofourlives.com/DenimRules/DenimShoppingGuide
Results
3 www.guess.com
Results
4 www.mysimon.com
Results
5 www.skinnerdamulis.com
Results