Relating Web pages to enable
information-gathering tasks
Amitabha Bagchi and Garima Lahoti
Abstract:
In this paper we argue that
relationships between Web pages are functions of the user’s intent. We
identify a class of Web tasks - information-gathering - that can be
facilitated by a search engine that provides links to pages which are
related to the page the user is currently viewing. We define three kinds
of intentional relationships that correspond to whether the user is a)
seeking sources of information, b) reading pages which provide
information, or c) surfing through pages as part of an extended
information-gathering process. We show that these three relationships
can be productively mined using a combination of textual and link
information and provide three scoring mechanisms that correspond to
them: SeekRel, FactRel and SurfRel. These scoring mechanisms
incorporate both textual and link information. We build a set of
capacitated subnetworks - each corresponding to a particular keyword -
that mirror the interconnection structure of the World Wide Web. The
scores are computed by computing flows on these subnetworks. The
capacities of the links are derived from the hub and authority values
of the nodes they connect, following the work of Kleinberg (1998) on
assigning authority to pages in hyperlinked environments. We evaluated
our scoring mechanism by running experiments on four data sets taken
from the Web. We present user evaluations of the relevance of the top
results returned by our scoring mechanisms and compare those to the top
results returned by Google’s Similar Pages feature, and the Companion
algorithm proposed by Dean and Henzinger (1999).
Experimental Setup:
We performed our experiments on four
data sets taken from the Web. Creating these data sets was a
multi-stage process that began by querying AltaVista with a search
string and taking the top 100 results to form a core set. We then used
the open source Web crawler Nutch to retrieve the
pages linked from the core set. Then we found the top 1000 pages that
link to these new pages using Altavista’s advanced feature providing
inlinks for a queried page. Finally, we found the inlinks of the pages
in the core using Altavista then went back to Nutch to find the outlinks
of these pages. We followed Dean and Henzinger and took only 10
outlinks in the manner they specified i.e. if we were looking at the
outlinks of a page u which pointed to a core page v, we took only the
links on u which were “around” the link to v in the sense that we took
the 5 links immediately preceding the link to v on the page and the 5
links immediately following v. Having obtained this data set we
preprocessed it by computing the hub and authority values of all the
pages in it.
Our four data sets were generated using the keyword strings
“automobile” (54952 pages), “motor company”(14973 pages), “clothes
shopping” (37724 pages) and “guess” (12101 urls). For repeatability
purposes, these data sets have been made available here. We conducted
extensive experiments on these data sets by taking one page out of them
as a query, then scoring all three relationships for this page with all
the other pages in the data set. We compared our 10 top scoring pages
for FactRel and SeekRel with the top 10 pages returned by Google’s
Similar Pages feature. We also implemented the Companion algorithm and
compared our results to the top 10 results returned by it. For SurfRel
we simply took our top 10 results and evaluated them. The evaluation in
all these cases was done by conducting user surveys.
Data
Set Organization:
Data set directory (available below as .tgz file) for each keyword
contains the following files:
(1) C.txt : Contains list of all the urls present in the core set.
(2) BnBF.txt : Contains list of inlinks for each url in C.txt with each
inlink followed by its outlinks.
(3) CnF.txt: Contains list of all the outlinks for each url in C.txt.
(4) FB.txt: Contains list of inlinks for each outlink url in CnF.txt.
(5) CnB.txt (optional): Contains list of inlinks for each url in C.txt.
(6) urls.txt: Contains list of unique urls present in the final graph.
It is created after processing the above files. It has been provided
for reference.
Files (1) - (4) contain sufficient information to build the graph for
the keyword.
Data
Set:
Data sets used to construct graphs for the follwoing four keywords can
be downloaded from here:
(1) automobile
(2) clothes
shopping
(3) motor company
(4) guess
Results:
URLs used for the survey and the corresponding result files containing
top 10 results for each of three algorithms (companion.txt, google.txt
and our.txt) are as follows:
FactRel
SeekRel
SurfRel