- [1]
- G. Coghill, A. Srinivasan, and R.D. King.
Qualitative System Identification from Imperfect Data.
Journal of Artificial Intelligence Research, 2008.
(To appear).
- [2]
- S. Joshi, G. Ramakrishnan, and
A. Srinivasan.
Feature construction using theory-guided sampling and randomised search.
In Proceedings of the Eighteenth International Conference on Inductive
Logic Programming (ILP 2008), LNAI, Berlin, 2008. Springer.
(To appear).
- [3]
- N. Fonseca, A. Srinivasan, F. Silva, and
R. Camacho.
Parallel ILP for Distributed-Memory Architectures.
Machine Learning Journal, 2008.
(Conditionally accepted).
- [4]
- G. Ramakrishnan, S. Joshi, S. Balakrishnan,
and A. Srinivasan.
Using ILP to Assist Information Extraction from Semi-Structure Text.
In Proceedings of the Seventeenth International Conference on Inductive
Logic Programming, LNAI 4894, pages 211-224, Berlin, 2008.
Springer.
- [5]
- A. Srinivasan and R.D. King.
Incremental Identification of Qualitative Models of Biological Systems.
Journal of Machine Learning Research, 2008.
(To appear).
- [6]
- S. Tyagi, C. Vaz, V. Gupta, R. Bhatia,
S. Maheshwari, A. Srinivasan, and A. Bhattacharya.
CID-miRNA: A web server for prediction of novel miRNA precursors in the human
genome.
Journal of Biochemical and Biophysical Research Communications,
2008.
(To appear).
- [7]
- A. Paes, F. Zelezny, G. Gaverucha, C.D.
Page, and A. Srinivasan.
ILP through Propositionalization and Stochastic k-Term DNF Learning.
In Proceedings of the Sixteenth International Conference on Inductive
Logic Programming, LNAI 4455, pages 379-393, Berlin, 2007.
Springer.
- [8]
- L. Specia, A. Srinivasan, G. Ramakrishnan,
and M. Nunes.
Word Sense Disambiguation with ILP.
In Proceedings of the Sixteenth International Conference on Inductive
Logic Programming, LNAI 4455, pages 409-423, Berlin, 2007. Springer.
Winner of the Best Application Prize.
The identification of the correct sense of a word is necessary for
most tasks in automatic natural language processing like machine translation,
information retrieval, speech and text processing. Automatic Word Sense
Disambiguation (WSD) is difficult and accuracies with state-of-the art
methods are substantially lower than in other areas of text understanding
like part-of-speech tagging. One shortcoming of these methods is that they do
not utilise substantial sources of background knowledge like parsers,
semantic taxonomies, dictionaries and so on that are now available in an
electronic form (the methods largely use shallow syntactic features).
Empirical results from the use of Inductive Logic Programming (ILP) have
repeatedly shown the ability of ILP systems to use diverse sources of
background knowledge. In this paper, we investigate the use of ILP for WSD in
two different ways: (a) as a stand-alone constructor of models for WSD; and
(b) to build interesting features, which can then used by standard
model-builder such as a SVM. Our investigation is in the form of experiments
that examine a monolingual WSD task using the 32 English verbs contained in
the SENSEVAL-3 benchmark data; and a bilingual WSD task using 7 highly
ambiguous verbs that arise in machine translation from English to Portuguese.
Background knowledge available is from eight sources that provide a wide
range of syntactic and semantic information that may be relevant. For both
WSD tasks, experimental results show that ILP-constructed models and models
constructed using ILP-generated features have higher average (mean and
median) accuracies than that obtained using a state-of-the art feature-based
technique equipped with shallow syntactic features. This suggests that the
use of ILP with diverse sources of background knowledge can provide one way
for making substantial progress in the field of automatic
WSD.
- [9]
- A. Srinivasan, C.D. Page, R. Camacho, and
R.D. King.
Quantitative Pharmacophore Models with Inductive Logic Programming.
Machine Learning Journal, 64(1,2,3):65-90, 2006.
Three-dimensional models, or pharmacophores, describing Euclidean
constraints on the location on small molecules of functional groups (like
hydrophobic groups, hydrogen acceptors and donors, etc.), are often used in
drug design to describe the medicinal activity of potential drugs (or
`ligands'). This medicinal activity is produced by interaction of the
functional groups on the ligand with a binding site on a target protein. In
identifying structure-activity relations of this kind there are three
principal issues: (1) It is often difficult to ``align'' the ligands in order
to identify common structural properties that may be responsible for
activity; (2) Ligands in solution can adopt different shapes (or
`conformations') arising from torsional rotations about bonds. The 3-D
molecular substructure is typically sought on one or more low-energy
conformers; and (3) Pharmacophore models must, ideally, predict medicinal
activity on some quantitative scale. It has been shown that the logical
representation adopted by Inductive Logic Programming (ILP) naturally
resolves many of the difficulties associated with the alignment and
multi-conformation issues. However, the predictions of models constructed by
ILP have hitherto only been nominal, predicting medicinal activity to be
presentor absent. In this paper, we investigate the construction of two kinds
of quantitative pharmacophoric models with ILP: (a) Models that predict the
probability that a ligand is ``active''; and (b) Models that predict the
actual medicinal activity of a ligand. Quantitative predictions are obtained
by the utilising the following statistical procedures as background
knowledge: logistic regression and naive Bayes, for probability prediction;
linear and kernel regression, for activity prediction. The multi-conformation
issue and, more generally, the relational representation used by ILP results
in some special difficulties in the use of any statistical procedure. We
present the principal issues and some solutions. Specifically, using data on
the inhibition of the protease Thermolysin, we demonstrate that it is
possible for an ILP program to construct good quantitative structure-activity
models. We also comment on the relationship of this work to other recent
developments in statistical relational learning.
- [10]
- F. Zelezny, A. Srinivasan, and C.D. Page.
Randomised Restarted Search in ILP.
Machine Learning Journal, 64(1,2,3):183-208, 2006.
Recent statistical performance surveys of search algorithms in
difficult combinatorial problems have demonstrated the benefits of
randomising and restarting the search procedure. Specifically, it has been
found that if the search cost distribution of the non-restarted randomised
search exhibits a slower-than-exponential decay (that is, a ``heavy tail''),
restarts can reduce the search cost expectation. We report on an empirical
study of randomised restarted search in ILP. Our experiments conducted on a
high-performance distributed computing platform provide an extensive
statistical performance sample of five search algorithms operating on two
principally different classes of ILP problems, one represented by an
artificially generated graph problem and the other by three traditional
classification benchmarks (mutagenicity, carcinogenicity, finite element mesh
design). The sample allows us to (1) estimate the conditional expected value
of the search cost (measured by the total number of clauses explored) given
the minimum clause score required and a ``cutoff'' value (the number of
clauses examined before the search is restarted), (2) estimate the
conditional expected clause score given the cutoff value and the invested
search cost, and (3) compare the performance of randomised restarted search
strategies to a deterministic non-restarted search. Our findings indicate
striking similarities across the five search algorithms and the three
domains, in terms of the basic trends of both the statistics (1) and (2).
Also, we observe that the cutoff value is critical for the performance of the
search algorithm, and using its optimal value in a randomised restarted
search may yield a mean search cost lower by several orders of magnitude
lower than that obtained with a deterministic non-restarted
search.
- [11]
- H. Blockeel, C.D. Page, and
A. Srinivasan.
Multi-Instance Tree Learning.
In Proceedings of the Twenty Second International Conference on Machine
Learning (ICML 2005), 2005.
We introduce a novel algorithm for decision tree learning in the
multi-instance setting adn originally defined by Dietterich et al. It
differs from existing multi-instance tree learners in a few crucial,
well-motivated details. Experiments on synthetic and real-life datasets
confirm the beneficial effect of these differences and show that the
resulting system outperforms the existing multi-instance decision tree
learners.
- [12]
- G.M. Coghill, S.M. Garrett,
A. Srinivasan, and R.D. King.
Qualitative System Identification from Imperfect Data.
Technical Report TR0501, Dept. of Computer Science, University of Aberdeen,
Aberdeen, 2005.
Experience in the physical sciences suggests that the only
realistic means of understanding complex systems is through the use of
mathematical models. Typically, this has come to mean the identification of
quantitative models expressed as differential equations. Quantitative
modelling works best when the structure of the model i.e., the form of the
equations is known (theoretically) and the primary concern is one of
estimating the values of the constant parameters in the model. For complex
biological systems, the model-structure is rarely known and the modeler has
to deal with both model-identification and parameter-estimation. In this
paper, we are concerned with providing automated assistance to the first of
these problems. Specifically, we examine the identification by machine, of
the structural relationship between experimentally observed variables. The
relationship will be expressed in the form qualitative abstractions of a
quantitative model. Such qualitative models can not only provide clues to the
precise quantitative model, but also assist in understanding the essence of
that model. Our position in this paper is that background knowledge
incorporating system modelling principles can be used to constrain
effectively the set of good qualitative models. Utilising the
model-identification framework provided by Inductive Logic Programming (ILP)
we present empirical support for this position using a series of increasingly
complex artificial datasets. The results are obtained with qualitative and
quantitative data subject to varying amounts of noise and different degrees
of sparsity. The results also point to the presence of a set of qualitative
states, which we term kernel subsets, that may be necessary for a
qualitative model-learner to learn correct models. We also demonstrate
scalability to biological system modelling by identification of the core of
the glycolysis metabolic pathway from data.
- [13]
- A. Srinivasan and R. Kothari.
A study of applying dimensionality reduction to restrict the size of a
hypothesis space.
In Proceedings of the Fifteenth International Conference on Inductive
Logic Programming (ILP2005), LNAI 3625, pages 348-365, Berlin, 2005.
Springer.
Given sample data and background knowledge encoded in the form of
logic programs, a predictive Inductive Logic Programming (ILP) system
attempts to find a set of rules (or clauses) for predicting classification
labels in the data. Most present-day systems for this purpose rely on some
variant of a generate-and-test procedure that repeatedly examines a set of
potential candidates (termed here as the ``hypothesis space''). On each
iteration a search procedure is employed to find the best clause. The
worst-case time-complexity of such systems depends critically on: (1) the
size of the hypothesis spaces examined; and (2) the cost of estimating the
goodness of a clause. To date, attempts to improve the efficiency of such ILP
systems have concentrated either on examining fewer clauses within a given
hypothesis space, or on efficient means of estimating the goodness of
clauses. The principal means of restricting the size of the hypothesis space
itself has been through the use of language and search constraints. Given
such constraints, this paper is concerned with investigating the use of a
dimensionality reduction method to reduce further the size of the hypothesis
space. Specifically, for a particular kind of ILP system, clauses in the
search space are represented as points in a high-dimension space. Using a
sample of points from this geometric space, feature selection is used to
discard dimensions of little or no (statistical) relevance. The resulting
lower dimensional space translates directly, in the worst-case, to a smaller
hypothesis space. We evaluate this approach on one controlled domain (graphs)
and two real-life datasets concerning problems from biochemistry (mutagenesis
and carcinogenesis). In each case, we obtain unbiased estimates of the size
of the hypothesis space before and after feature selection; and compare the
the results obtained using a complete search of the two
spaces.
- [14]
- R. Camacho, R.D. King, and A. Srinivasan,
editors.
Inductive Logic Programming.
LNAI 3194. Springer, Berlin, 2004.
- [15]
- F. Zelezny, A. Srinivasan, and C.D. Page.
A Monte-Carlo Study of Randomised Restarted Search in ILP.
In Proceedings of the Fourteenth International Conference on Inductive
Logic Programming (ILP2004), LNAI 3194, pages 341-358, Berlin, 2004.
Springer.
Recent statistical performance surveys of search algorithms in
difficult combinatorial problems have demonstrated the benefits of
randomising and restarting the search procedure. Specifically, it has been
found that if the search cost distribution (SCD) of the non-restarted
randomised search exhibits a slower-than-exponential decay (that is, a
``heavy tail''), restarts can reduce the search cost expectation. Recently,
this heavy tail phenomenon was observed in the SCD's of benchmark ILP
problems. % [?]. Following on this work, we report on an empirical study
of randomised restarted search in ILP. Our experiments, conducted over a
cluster of a few hundred computers, provide an extensive statistical
performance sample of five search algorithms operating on two principally
different ILP problems (artificially generated graph data and the well-known
``mutagenesis'' problem). The sample allows us to (1) estimate the
conditional expected value of the search cost (measured by the total number
of clauses explored) given the minimum clause score required and a ``cutoff''
value (the number of clauses examined before the search is restarted); and
(2) compare the performance of randomised restarted search strategies to a
deterministic non-restarted search. Our findings indicate that the cutoff
value is significantly more important than the choice of (a) the specific
refinement strategy; (b) the starting element of the search; and (c) the
specific data domain. We find that the optimal value for the cutoff parameter
remains roughly stable across variations of these three factors and that the
mean search cost using this value in a randomised restarted search is up to
three orders of magnitude (i.e. 1000 times) lower than that obtained with a
deterministic non-restarted search.
- [16]
- A. Srinivasan, R.D. King, and M.E.
Bain.
An empirical study of the use of relevance
information in Inductive Logic Programming.
Machine Learning Research, 4(Jul):369-383, 2003.
Inductive Logic Programming (ILP) systems construct models for data
using domain-specific background information. When using these systems, it is
typically assumed that sufficient human expertise is at hand to rule out
irrelevant background information. Such irrelevant information can, and
typically does, hinder an ILP system's search for good models. Here, we
provide evidence that if additional expertise is available that can provide a
partial-ordering on sets of background predicates in terms of relevance to
the analysis task, then this can be used to good effect by an ILP system. In
particular, using data from biochemical domains, we investigate an
incremental strategy of including sets of predicates in decreasing order of
relevance. Results obtained suggest that: (a) the incremental approach
identifies, in less time, a model that is comparable in predictive accuracy
to that obtained with all background information in place; and (b) the
incremental approach using the relevance ordering performs better than one
that does not (that is, one that adds sets of predicates randomly). For a
practitioner concerned with use of ILP, the implication of these findings are
two-fold: (1) when not all background information can be used at once (either
due to limitations of the ILP system, or the nature of the domain) expert
assessment of the relevance of background predicates can assist substantially
in the construction of good models; and (2) good ``first-cut'' results can be
obtained quickly by a simple exclusion of information known to be less
relevant.
- [17]
- H. Toivonen, A. Srinivasan, R.D. King,
S. Kramer, and C. Helma.
Statistical Evaluation of The Predictive Toxicology Challenge
2000-2001.
Bioinformatics, 19:1183-1193, 2003.
The development of in silico models to predict chemical
carcinogenesis from molecular structure would help greatly to prevent
environmentally caused cancers. The Predictive Toxicology Challenge (PTC)
competition was organized to test the state-of-the-art in applying machine
learning to form such predictive models. Fourteen machine learning groups
generated 111 models. The use of Receiver Operating Characteristic (ROC)
space allowed the models to be uniformly compared regardless of the error
cost function. We developed a statistical method to test if a model performs
significantly better than random in ROC space. Using this test as criteria
five models performed better than random guessing at a signi ficance level p
of 0.05. Statistically the best predictor was the Viniti model for female
mice, with p value below 0.002. The toxicologically most interesting models
were Leuven2 for male mice, and Kwansei for female rats. These models
performed well in the statistical analysis and they are in the middle of ROC
space, i.e., distant from extreme cost assumptions. These predictive models
were also independently judged by domain experts to be among the three most
interesting, and are believed to include a small but significant amount of
empirically learned toxicological knowledge.
- [18]
- V. S. Costa, A. Srinivasan, R. C. Camacho,
H. Blockeel, B.Demoen, G. Janssens, J. Struyf, H. Vandecasteele, and W. Van
Laer.
Query Transformations for Improving the Efficiency of ILP Systems.
Journal of Machine Learning Research, 4(Aug):465-491, 2003.
Relatively simple transformations can speed up the execution of
queries for data mining considerably. While some ILP systems use such
transformations, relatively little is known about them or how they relate to
each other. This paper describes a number of such transformations. Not all of
them are novel, but there have been no studies comparing their efficacy. The
main contributions of the paper are: (a) it clarifies the relationship
between the transformations; (b) it contains an empirical study of what can
be gained by applying the transformations; and (c) it provides some guidance
on the kinds of problems that are likely to benefit from the
transformations.
- [19]
- C.D. Page and A. Srinivasan.
ILP: A Short Look Back and a Longer Look Forward.
Journal of Machine Learning Research, 4(Aug):415-430, 2003.
Inductive logic programming (ILP) is built on a foundation laid by
research in machine learning and computational logic. Armed with this strong
foundation, ILP has been applied to important and interesting problems in the
life sciences, engineering and the arts. In turn, the applications have
brought into focus the need for more research into specific topics. We
enumerate five of these: (1) novel search methods; (2) incorporation of
explicit probabilities; (3) incorporation of special-purpose reasoners; (4)
parallel execution using commodity components; and (5) enhanced human
interaction. It is our hypothesis that progress in each of these areas can
greatly improve the contributions that can be made with ILP; and that, with
assistance from research workers in other areas, significant progress in each
of these areas is possible.
- [20]
- A. Srinivasan.
The
Applicability to ILP of Results Concerning the Ordering of Binomial
Populations.
In Proceedings of the Twelfth International Conference on Inductive Logic
Programming (ILP2002), LNAI, Berlin, 2002. Springer.
Results obtained in the 1950's on the ordering of k binomial
populations are directly applicable to computing the size of data sample
required to identify the best amongst k theories in an ILP search. The
basic problem considered is that of selecting the best one of k (k geq
2) binomial populations on the basis of the observed frequency of `success'
on n observations. In the statistical formulation of the problem the
experimenter specifies that he or she would like to guarantee that the
probability of correct selection is at least P* (0 leq P* < 1)
whenever the true difference between the best and second best population is
at least d* (0 < d* leq 1). The principal result of interest is
that for large values of k, the value of n required to meet any fixed
specification (d*,P*) is approximately equal to some constant
multiple of ln k. Depending on the k, d* and P*
specified, this value of n can be substantially lower than that obtained
within the corresponding PAC formulation. Applied to the search conducted by
an ILP algorithm, the result guarantees with confidence level P*, that
the true accuracy of the theory selected will be no more than d* below
that of the best theory. In this paper, we review this basic result and test
its predictions empirically. We also comment on the relevance to ILP of more
recent work in the field.
- [21]
- F. Zelezny, A. Srinivasan, and C.D. Page.
Lattice-Search Runtime Distributions May Be Heavy-Tailed.
In Proceedings of the Twelfth International Conference on Inductive Logic
Programming (ILP2002), LNAI, Berlin, 2002. Springer.
Recent empirical studies show that runtime distributions of
backtrack procedures for solving hard combinatorial problems often have
intruiging properties. Unlike standard distributions (such as the normal),
such distributions decay slower than exponentially and have ``heavy tails''.
Procedures characterised by heavy-tailed runtime distributions exhibit large
variability in efficiency, but a very straightforward method called rapid randomized restarts has been designed to essentially improve their
average performance. We show on two experimental domains that heavy-tailed
phenomena can be observed in ILP, namely in the search for a clause in the
subsumption lattice. We also reformulate the technique of randomized rapid
restarts to make it applicable to ILP and show that it can reduce the average
search-time.
- [22]
- C. Helma, R.D. King, S. Kramer, and
A. Srinivasan.
The Predictive Toxicology Challenge 2000-2001.
Bioinformatics, 17:107-108, 2001.
(Application Note).
We initiated the Predictive Toxicology Challenge (PTC) to stimulate
the development of advanced SAR techniques for predictive toxicology models.
The goal of this challenge is to predict the rodent carcinogenicity of new
compounds based on the experimental results of the US National Toxicology
Program (NTP). Submissions will be evaluated on quantitative and qualitative
scales to select the most predictive models and those with the highest
toxicological relevance.
- [23]
- R. D. King, A. Srinivasan, and L. DeHaspe.
WARMR: A Data Mining Tool for
Chemical Data.
Computer Aided Molecular Design, 15:173-181, 2001.
Data mining techniques are becoming increasingly important in
chemistry as databases become too large to examine manually. Data mining
methods from the field of Inductive Logic Pr ogramming (ILP) have potential
advantages for structural chemical data. In this paper we present Warmr, the
first ILP data mining algorithm to be applied to chemoinformatic data. We
illustrate the value of Warmr by applying it to a well studied database of
chemical compounds tested for carcinogenicity in rodents. Data mining was
used to find all frequent substructures in the database , and knowledge of
these frequent substructures is shown to add value to the database. One use
of the frequent substructures was to convert them into probabilistic
prediction rules relating compound description to carcinogenesis. These rules
were found to be accurate on test data and to give some insight into the
relationship between structure and activity in carcinogenesis. In addition,
the substructures were used to prove that there existed no accurate
probabilistic rule, based purely on atom-bond substructure with less than
seven conditions, that could predict carcinogenicity. This result put a lower
bound on the complexity of the relationship between chemical structure and
carcinogenicity. Only by using a data mining algorithm, and by doing a
complete search, is it possible to prove such a result. The frequent
substructures were also found to add value by increasing the accuracy of
statistical and machine learning programs that were trained to predict
chemical carcinogenicity. We conclude that Warmr, and ILP data mining methods
generally, are an important new tool for analysing chemical
databases.
- [24]
- S.H. Muggleton, C.H. Bryant,
A. Srinivasan, A. Whittaker, S. Topp, and C. Rawlings.
Are
grammatical representations useful for learning from biological sequence
data?.
Computational Biology, 8(5):493-522, 2001.
This paper investigates whether Chomsky-like grammar
representations are useful for learning cost-effective, comprehensible
predictors or members of biological sequence families. The Inductive Logic
Programming (ILP) Bayesian approach to learning from positive examples is
used to generate a grammar for recognising a class of proteins known as human
neuropeptide precursors (NPPs). Collectively, five of the co-authors of this
paper have extensive expertise on NPPs and general bioinformatics methods.
Their motivation in generating a NPP grammar was that non of the existing
bioinformatic methods could provide sufficient cost-savings during the search
for new NPPs. Prior to this project experienced specialists at SmithKline
Beechan had tried for many months to hand-code such a grammar without
success. Our best predictor makes the search for novel NPPs more than 100
times more efficient than randomly selecting proteins for synthesis and
testing them for biological activity. As far as these authors are aware, this
is both the first biological grammar learnt using ILP and the first
real-world scientific application of the ILP Bayesian approach to learning
from positive examples. A group of features is derived from this grammar.
Other groups of features of NPPs are derived using other learning strategies.
Amalgams of these groups are formed. A recognition model is generated for
each amalgam using C4.5 and C4.5rules and its performance measured using both
predictive accuracy and a new cost function, Relative Advantage (RA). The
highest RA was achieved by a model which includes grammar-derived features.
This RA is significantly higher than the best RA achieved without the use of
the grammar-derived features. Predictive accuracy is not a good measure of
performance for this domain because it does not discriminate well between NPP
recognition models: despite covering varying numbers of (the rare) positives,
all the models are awarded a similar (high) score by predictive accuracy
because they all exclude most of the abundant negatives.
- [25]
- A. Srinivasan.
Extracting context-sensitive models in Inductive Logic Programming.
Machine Learning, 44:301-324, 2001.
Given domain-specific background knowledge and data in the form of
examples, an Inductive Logic Programming (ILP) system extracts models in the
data-analytic sense. We view the model-selection step facing an ILP system as
a decision problem, the solution of which requires knowledge of the context
in which the model is to be deployed. In this paper, ``context'' will be
defined by the current specification of the prior class distribution and the
client's preferences concerning errors of classification. Within this
restricted setting, we consider the use of an ILP system in situations where:
(a) contexts can change regularly. This can arise for example, from changes
to class distributions or misclassification costs; and (b) the data are from
observational studies. That is, they may not have been collected with any
particular context in mind. Some repercussions of these are: (a) any one
model may not be the optimal choice for all contexts; and (b) not all the
background information provided may be relevant for all contexts. Using
results from the analysis of Receiver Operating Characteristic curves, we
investigate a technique that can equip an ILP system to reject those models
that cannot possibly be optimal in any context. We present empirical results
from using the technique to analyse two datasets concerned with the toxicity
of chemicals (in particular, their mutagenic and carcinogenic properties).
Clients can and typically do, approach such datasets with quite different
requirements. For example, a synthetic chemist would require models with a
low rate of commission errors which could be used to direct efficiently the
synthesis of new compounds. A toxicologist on the other hand, would prefer
models with a low rate of omission errors. This would enable a more complete
identification of toxic chemicals at a calculated cost of misidentification
of non-toxic cases as toxic. The approach adopted here attempts to obtain a
solution that contains models that are optimal for each such user according
to the cost function that he or she wishes to apply. In doing so, it also
provides one solution to the problem of how the relevance of background
predicates is to be assessed in ILP.
- [26]
- A. Srinivasan.
Four Suggestions and a Rule Concerning the Application of ILP.
In Nada Lavrac and Saso Dzeroski, editors, Relational Data Mining,
pages 365-374. Springer-Verlag, Berlin, 2001.
Since the late 1980s there has been a sustained research effort
directed at investigating the application of Inductive Logic Programming
(ILP) to problems in biology and chemistry. This essay is a personal view of
some interesting issues that have arisen during my involvement in this
enterprise. Many of the concerns of the broader field of Knowledge Discovery
from Databases manifest themselves during the application of ILP to analyse
bio-chemical data. Addressing them in this microcosm has given me some
directions on the wider application of ILP, and I present these here in the
form of four suggestions and one rule. Readers are invited to consider them
in the context of a hypothetical Recommended Codes and Practices for the
application of ILP.
- [27]
- V. S. Costa, A. Srinivasan, and R. C.
Camacho.
A
note on two simple transformations for improving the efficiency of an ILP
system.
In J. Cussens and A. Frisch, editors, Proceedings of the Tenth
International Conference on Inductive Logic Programming (ILP2000),
LNAI, pages 225-242, Berlin, 2000. Springer.
Inductive Logic Programming (ILP) systems have had noteworthy
successes in extracting comprehensible and accurate models for data drawn
from a number of scientific and engineering domains. These results suggest
that ILP methods could enhance the model-construction capabilities of
software tools being developed for the emerging discipline of ``knowledge
discovery from databases.'' One significant concern in the use of ILP for
this purpose is that of efficiency. The performance of modern ILP systems is
principally affected by two issues: (1) they often have to search through
very large numbers of possible rules (usually in the form of definite
clauses); (2) they have to score each rule on the data (usually in the form
of ground facts) to estimate ``goodness''. Stochastic and greedy approaches
have been proposed to alleviate the complexity arising from each of these
issues. While these techniques can result in order-of-magnitude improvements
in the worst-case search complexity of an ILP system, they do so at the
expense of exactness. As this may be unacceptable in some situations, we
examine two methods that result in admissible transformations of clauses
examined in a search. While the methods do not alter the size of the search
space (that is, the number of clauses examined), they can alleviate the
theorem-proving effort required to estimate goodness. The first
transformation simply involves eliminating literals using a weak test for
redundancy. The second involves partitioning the set of literals within a
clause into groups that can be executed independently of each other. The
efficacy of these transformations are evaluated empirically on a number of
well-known ILP datasets. The results suggest that for problems that require
the use of highly non-determinate predicates, the transformations can provide
significant gains as the complexity of clauses sought
increases.
- [28]
- S. H. Muggleton, C. H. Bryant, and
A. Srinivasan.
Learning chomsky-like grammars for biological sequence families.
In Proceedings of the Seventeenth International Conference on Machine
Learning, San Francisco, CA, 2000. Morgan Kaufmann.
This paper presents a new method of measuring performance when
positives are rare and investigates whether Chomsky-like grammar
representations are useful for learning accurate comprehensible predictors of
members of biological sequence families. The positive-only learning framework
of the Inductive Logic Programming (ILP) system CProgol is used to generate a
grammar for recognising a class of proteins known as human neuropeptide
precursors (NPPs). As far as these authors are aware, this is both the first
biological grammar learnt using ILP and the first real-world scientific
application of the positive-only learning framework of CProgol. Performance
is measured using both predictive accuracy and a new cost function, Relative
Advantage (RA). The RA results show that searching for NPPs by using our best
NPP predictor as a filter is more than 100 times more efficient than randomly
selecting proteins for synthesis and testing them for biological activity.
The highest RA was achieved by a model which includes grammar-derived
features. This RA is significantly higher than the best RA achieved without
the use of the grammar-derived features.
- [29]
- S. H. Muggleton, C. H. Bryant, and
A. Srinivasan.
Measuring performance when positives are rare: Relative Advantage versus
predictive accuracy - a biological case-study.
In Proceedings of the Eleventh European Conference on Machine
Learning, LNCS, Berlin, 2000. Springer.
This paper presents a new method of measuring performance when
positives are rare and investigates whether Chomsky-like grammar
representations are useful for learning accurate comprehensible predictors of
members of biological sequence families. The positive-only learning framework
of the Inductive Logic Programming (ILP) system CProgol is used to generate a
grammar for recognising a class of proteins known as human neuropeptide
precursors (NPPs). Performance is measured using both predictive accuracy and
a new cost function, Relative Advantage (RA). The RA results show that
searching for NPPs by using our best NPP predictor as a filter is more than
100 times more efficient than randomly selecting proteins for synthesis and
testing them for biological activity. Predictive accuracy is not a good
measure of performance for this domain because it does not discriminate well
between NPP recognition models: despite covering varying numbers of (the
rare) positives, all the models are awarded a similar (high) score by
predictive accuracy because they all exclude most of the abundant
negatives.
- [30]
- L. Todorovski, S. Dzeroski,
A. Srinivasan, J. Whiteley, and D. Gavaghan.
Discovering the Structure of Partial Differential Equations from Example
Behavior.
In Proceedings of the Seventeenth International Conference on Machine
Learning, San Francisco, CA, 2000. Morgan Kaufmann.
One of the most powerful and widely accepted analytical formalisms
for modeling biological and physical systems is that of the partial
differential equation (PDE). Establishing an acceptable PDE model for a
dynamic system occupies a major portion of the work of the mathematical
modeler. There are two main aspects to this activity. First, an appropriate
structure has to be determined for the equations involved (the model
identification problem). Second, acceptably accurate values for parameters
are to be determined (the parameter estimation problem). Of these, the first
is more challenging, and is the focus of this paper. We propose a method for
discovering the structure of PDE models from example behavior. For simple PDE
models, we illustrate that a straight-forward adaptation of existing equation
discovery methods is sufficient. However, complex PDE models require a more
sophisticated approach: a two-stage method is described in the paper. The
efficacy of the approach is demonstrated initially by rediscovering the PDE
models for several artificial problems. We also use it to obtain the
structure of the classic FitzHugh-Nagumo model. This represents a very wide
class of biological systems, making the model discovery method of interest to
scientists concerned with the enterprise of obtaining a mathematical
understanding of dynamic processes occurring in the life
sciences.
- [31]
- A. Srinivasan.
A
study of two probabilistic methods for searching large spaces with ILP.
Technical Report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford,
2000.
Given sample data and background knowledge encoded in the form of
logic programs, a predictive Inductive Logic Programming (ILP) system
attempts to find a set of rules (or clauses) for predicting classification
labels in the data. Most present-day systems for this purpose rely on some
variant of a generate-and-test procedure that repeatedly examines a set of
potential candidates (termed here as the ``search space'') and selects one or
more clauses according to some criterion of ``goodness''. The worst-case
time-complexity of such systems depends critically on: (1) the size of the
search space; and (2) the cost of estimating the goodness of a clause. This
paper is concerned with addressing the first issue and is motivated by two
principal factors. First, the representation adopted by an ILP system often
engenders a search space whose size dominates complexity calculations.
Straightforward arguments show that examining fewer clauses should lead to
faster execution times. Second, evidence from feature-based methods suggests
that increased search also increases the probability of ``flukes'' -- rules
whose goodness values are high simply due to chance effects. We examine two
simple probabilistic schemes that restrict the search space for a predictive
ILP system by sacrificing optimality. The first, ``stochastic clause
selection'' is derived from a technique termed ordinal optimisation and
randomly selects a fixed-size sample of clauses from the search space which,
with high probability, contains a good clause. This directly addresses the
concern of searching intractably large spaces. The second, ``iterative beam
search'' is a direct adaptation of a technique developed for feature-based
methods to counter the increased chance of flukes. An ILP system equipped
with each of these methods is used to construct hypotheses for three
real-life datasets concerning problems from biochemistry (mutagenesis and
carcinogenesis) and engineering (finite-element mesh design). In each case,
we obtain estimates of the predictive accuracies of the hypotheses and the
time for their construction. These are then compared against estimates
obtained with two extreme search options, namely greedy search and restricted
branch-and-bound. For the problems considered, the results here show that
there is no significant gain in accuracy to be accrued with admissible
techniques like branch-and-bound, and that time savings comparable to that
from greedy search are possible with the use of a random search scheme like
stochastic clause selection.
- [32]
- A. Srinivasan.
The Aleph Manual.
Available at http://www.comlab.ox.ac.uk/oucl/ research/areas/machlearn/Aleph/,
1999.
- [33]
- A. Srinivasan, R.D. King, and
S. Muggleton.
The role of background knowledge: using a problem from chemistry to examine
the performance of an ILP program.
Technical Report PRG-TR-08-99, Oxford University Computing Laboratory, Oxford,
1999.
Inductive Logic Programming (ILP) systems construct explanations
for data in terms of domain-specific background information. How does the
quality of this information affect the performance of an ILP system? Results
from experiments concerned with learning simple programs for list processing
suggest that performance is sensitive to the type and amount of background
knowledge provided. In particular, background knowledge that contains large
amounts of information that is known to be irrelevant to the problem being
considered can, and typically does, prevent an ILP system in its search for
an correct explanation. In this paper, we examine a more realistic scenario
where the relevance of background predicates is graded from being more to
less relevant to the problem considered, and no ``correct explanation'' is
known. Theoretical or empirical answers to the performance of an ILP system
under such conditions address more directly the concerns of a practitioner.
Using a problem from chemistry, this paper reports on an empirical study of
the change in the performance of an ILP program as increasingly generic
chemical definitions are added to background knowledge. The problem examined
is the task of obtaining rules describing the mutagenic activity of small
molecules. The experiments describe the use of the ILP program Progol, and
the effect of adding atom and bond structure, bulk molecular properties, 2
and 3-dimensional descriptions of the molecules as background knowledge. The
results show that adding highly relevant background predicates causes
significant improvements in predictive accuracy, simplicity and time for
theory construction. These effects are less pronounced, although still
noticeable, along the scales of accuracy and simplicity as the background
knowledge includes increasingly more generic -- and hence not as directly
relevant -- chemical knowledge. For a practitioner concerned with use of ILP,
the results suggest that good ``first-cut'' results can be obtained quickly
by a simple exclusion of information known to be less
relevant.
- [34]
- A. Srinivasan.
A
study of two sampling methods for analysing large datasets with ILP.
Data Mining and Knowledge Discovery, 3(1):95-123, 1999.
This paper is concerned with problems that arise when submitting
large quantities of data to analysis by an Inductive Logic Programming (ILP)
system. Complexity arguments usually make it prohibitive to analyse such
datasets in their entirety. We examine two schemes that allow an ILP system
to construct theories by sampling from this large pool of data. The first,
``subsampling'', is a single-sample design in which the utility of a
potential rule is evaluated on a randomly selected sub-sample of the data.
The second, ``logical windowing'', is multiple-sample design that tests and
sequentially includes errors made by a partially correct theory. Both schemes
are derived from techniques developed to enable propositional learning
methods (like decision trees) to cope with large datasets. The ILP system
CProgol, equipped with each of these methods, is used to construct theories
for two datasets -- one artificial (a chess endgame) and the other naturally
occurring (a language tagging problem). In each case, we ask the following
questions of CProgol equipped with sampling: (1) Is its theory comparable in
predictive accuracy to that obtained if all the data were used (that is, no
sampling was employed)?; and (2) Is its theory constructed in less time than
the one obtained with all the data? For the problems considered, the answers
to these questions is ``yes''. This suggests that an ILP program equipped
with an appropriate sampling method could begin to address problems
satisfactorily that have hitherto been inaccessible simply due to data
extent.
- [35]
- A. Srinivasan.
Note
on the location of optimal classifiers in n-dimensional ROC space.
Technical Report PRG-TR-2-99, Oxford University Computing Laboratory, Oxford,
1999.
The comparative study of classifier performance is a worthwhile
concern in Machine Learning. Empirical comparisons typically examine unbiased
estimates of predictive accuracy of different algorithms -- the assumption
being that the classifier with the highest accuracy would be the ``optimal''
choice of classifier for the problem. The qualification on optimality is
needed here, as choice is restricted to the classifiers being compared, and
the estimates are typically subject to sampling errors. Comparisons based on
predictive accuracy overlook two important practical concerns, namely (a)
class distributions cannot be specified precisely. Distribution of classes in
the training set are thus rarely matched exactly on new data; and (b) that
the costs of different types of errors may be unequal. Using techniques
developed in signal detection, Provost and Fawcett describe an elegant method
for the comparative assessment of binary classifiers that takes these
considerations into account. Their principal result is that optimal
classifiers lie on the edge of the convex hull of points representing the
classifier performance in a Lorentz diagram. The authors leave open the
question of whether this result extends to classifiers that discriminate
between more than two classes. Here we show that the result that optimal
classfiers are located on the edge of the convex hull does extend to
arbitrary number of classes. This follows directly from results about convex
sets that form the basis of linear programming algorithms.
- [36]
- A. Srinivasan and R.C. Camacho.
Numerical reasoning with an ILP program capable of lazy evaluation and
customised search.
Journal of Logic Programming, 40(2,3):185-214, 1999.
Using problem-specific background knowledge, computer programs
developed within the framework of Inductive Logic Programming (ILP) have been
used to construct restricted first-order logic solutions to scientific
problems. However, their approach to the analysis of data with substantial
numerical content has been largely limited to constructing clauses that: (a)
provide qualitative descriptions (``high'', ``low'' etc.) of the values of
response variables; and (b) contain simple inequalities restricting the
ranges of predictor variables. This has precluded the application of such
techniques to scientific and engineering problems requiring a more
sophisticated approach. A number of specialised methods have been suggested
to remedy this. In contrast, we have chosen to take advantage of the fact
that the existing theoretical framework for ILP places very few restrictions
of the nature of the background knowledge. We describe two issues of
implementation that make it possible to use background predicates that
implement well-established statistical and numerical analysis procedures. Any
improvements in analytical sophistication that result are evaluated
empirically using artificial and real-life data. Experiments utilising
artificial data are concerned with extracting constraints for response
variables in the text-book problem of balancing a pole on a cart. They
illustrate the use of clausal definitions of arithmetic and trigonometric
functions, inequalities, multiple linear regression, and numerical
derivatives. A non-trivial problem concerning the prediction of mutagenic
activity of nitroaromatic molecules is also examined. In this case, expert
chemists have been unable to devise a model for explaining the data. The
result demonstrates the combined use by an ILP program of logical and
numerical capabilities to achieve an analysis that includes linear modelling,
clustering and classification. In all experiments, the predictions obtained
compare favourably against benchmarks set by more traditional methods of
quantitative methods, namely, regression and neural-network.
- [37]
- A. Srinivasan and R.D. King.
Feature construction with Inductive Logic Programming: a study of quantitative
predictions of biological activity aided by structural attributes.
Data Mining and Knowledge Discovery, 3(1):37-57, 1999.
Recently, computer programs developed within the field of Inductive
Logic Programming (ILP) have received some attention for their ability to
construct restricted first-order logic solutions using problem-specific
background knowledge. Prominent applications of such programs have been
concerned with determining ``structure-activity'' relationships in the areas
of molecular biology and chemistry. Typically the task here is to predict the
``activity'' of a compound (for example, toxicity), from its chemical
structure. A summary of the research in the area is: (a) ILP programs have
largely been restricted to qualitative predictions of activity (``high'',
``low'' etc.); (b) When appropriate attributes are available, ILP programs
have equivalent predictivity to standard quantitative analysis techniques
like linear regression. However ILP programs usually perform better when such
attributes are unavailable; and (c) By using structural information as
background knowledge, an ILP program can provide comprehensible explanations
for biological activity. This paper examines the use of ILP programs as a
method of ``discovering'' new attributes. These attributes could then be used
by methods like linear regression, thus allowing for quantitative predictions
while retaining the ability to use structural information as background
knowledge. Using structure-activity tasks as a test-bed, the utility of ILP
programs in constructing new features was evaluated by examining the
prediction of biological activity using linear regression, with and without
the aid of ILP learnt logical attributes. In three out of the five data sets
examined the addition of ILP attributes produced statistically better
results. In addition six important structural features that have escaped the
attention of the expert chemists were discovered. The method used here to
construct new attributes is not specific to the problem of predicting
biological activity, and the results obtained suggest a wider role for ILP
programs in aiding the process of scientific discovery.
- [38]
- A. Srinivasan and R.D. King.
Using Inductive Logic Programming to construct Structure-Activity
Relationships.
In G.C. Gini and A.R. Katrizsky, editors, Predictive toxicology of
chemicals: experiences and impact of AI tools (Papers from the 1999 AAAI
Spring Symposium), pages 64-73. AAAI Press, Menlo Park, CA, 1999.
The existence and rapid growth of chemical databases have brought
into focus the utility of methods that can assist the discovery of predictive
patterns in data, and communicating them in a manner designed to provoke
insight. This has turned attention to machine learning techniques capable of
extracting ``symbolic'' descriptions from data. At the cutting-edge of such
techniques is Inductive Logic Programming (ILP). Given a set of observations
and background knowledge encoded as a set of logical descriptions, an ILP
system attempts to construct explanations for the observations. The
explanations are in the same language as the observations and background
knowledge -- usually a subset of first-order logic. The use of first-order
logic contrasts with algorithms like decision-trees, and neural networks
which employ simple propositional logic representations. The increased
representation power along with the flexibility to include background
knowledge -- which can even include propositional algorithms -- allow a form
of data analysis and decision-support that is, in principle, unmatched by
first-generation methods. Biochemical applications of ILP have largely been
concerned with determining ``structure-activity'' relationships (SARs). The
task here is to obtain rules that predict the activity of a compound, like
toxicity, from its chemical structure. The representation language adopted by
ILP systems allows the development of compact, chemist-friendly ``theories'',
and ILP systems have progressively been shown to be capable of handling 1, 2,
and 3-dimensional descriptions of chemical structure. Empirical results in
predicting mutagenicity and carcinogenicity suggest that structure-activity
relations found by ILP systems achieve at least the same predictive power of
traditional SAR techniques, with fewer limitations (like the need for
alignment, pre-determination of structural features etc.) In some cases, they
have found novel structural features that significantly improve the
predictive capabilities of traditional 1 and 2-dimensional SAR methods. Here
we summarise the progress achieved so far in the use of ILP in these areas,
including ideas emerging from a recent toxicology prediction challenge which
suggest that a combination of ILP and established prediction methods can
provide a powerful form of relating chemical activity to
structure.
- [39]
- A. Srinivasan, R.D. King, and D.W.
Bristol.
An
assessment of submissions made to the Predictive Toxicology Evaluation
Challenge.
In Proceedings of the Sixteenth International Conference on Artificial
Intelligence (IJCAI-99). Morgan Kaufmann, Los Angeles, CA, 1999.
Constructing ``good'' models for chemical carcinogenesis was
identified in IJCAI-97 as providing a substantial challenge to ``knowledge
discovery'' programs. Attention was drawn to a comparative exercise which
called for predictions on the outcome of 30 rodent carcinogenicity
bioassays. This -- the Predictive Toxicology Evaluation (or PTE) Challenge --
was seen to provide AI programs with an opportunity to participate in an
enterprise of scientific merit, and a yardstick for comparison against strong
competition. Here we provide an assessment of the machine learning (ML)
submissions made. Models submitted are assessed on: (1) their accuracy, in
comparison to models developed with expert collaboration; and (2) their
explanatory value for toxicology. The principal findings were: (a) using
structural information available from a standard modelling package,
layman-devised features, and outcomes of established biological tests,
results from ML-derived models were at least as good as those with
expert-derived techniques. This was surprising; (b) the combined use of
structural and biological features by ML-derived models was unusual, and
suggested new avenues for toxicology modelling. This was also unexpected; and
(c) significant effort was required to interpret the output of even the most
``symbolic'' of ML-derived models. Much of this could have been alleviated
with measures for converting the results into a more ``toxicology-friendly''
form. As it stands, their absence is sufficient to prevent a whole-hearted
acceptance of these promising methods by toxicologists. This suggests that ML
techniques have been able to respond -- not fully, but nevertheless
substantially -- to the PTE Challenge.
- [40]
- A. Srinivasan, R.D. King, and D.W.
Bristol.
An
assessment of ILP-assisted models for toxicity and the PTE-3
experiment.
In S. Dzeroski and P.A. Flach, editors, Proceedings of the Ninth
International Workshop on Inductive Logic Programming (ILP99), volume
1634 of LNAI, pages 291-302, Berlin, 1999. Springer.
The Predictive Toxicology Evaluation (or PTE) Challenge provided
Machine Learning techniques with the opportunity to compete against
specialised techniques for toxicology prediction. Toxicity models that used
findings from ILP programs have performed creditably in the PTE-2 experiment
proposed under this challenge. We report here on an assessment of such models
along scales of: (1) quantitative performance, in comparison to models
developed with expert collaboration; and (2) potential explanatory value for
toxicology. Results appear to suggest the following: (a) across of range of
class distributions and error costs, some explicit models constructed with
ILP-assistance appear closer to optimal than most expert-assisted ones. Given
the paucity of test-data, this is to be interpreted cautiously; (b) a
combined use of propositional and ILP techniques appears to yield models that
contain unusual combinations of structural and biological features; and (c)
significant effort was required to interpret the output, strongly indicating
the need to invest greater effort in transforming the output into a
``toxicologist-friendly'' form. Based on the lessons learnt from these
results, we propose a new predictive toxicology evaluation experiment --
PTE-3 -- which will address some important shortcomings of the previous
study.
- [41]
- P. Finn, S. Muggleton, D. Page, and
A. Srinivasan.
Pharmacophore Discovery using the Inductive Logic Programming system
Progol.
Machine Learning, 30:241-270, 1998.
This paper is a case study of a machine aided knowledge discovery
process within the general area of drug design. More specifically, the paper
described a sequence of experiments in which an Inductive Logic Programming
(ILP) system is used for pharmacophore discovery. Within drug design, a
pharmacophore is a description of the substructure of a ligand (a small
molecule) which is responsible for medicinal activity. This medicinal
activity is produced by interaction between the ligand and the binding site
for the target protein. ILP was chosen by the domain expert (first author) at
Pfizer since active molecules are most naturally described, in relational
terms, as requiring a substructure with various 3-D relations which hold
among the atoms involved. The results described in this paper build on
previous investigations into prediction of mutagenecity using ILP with a 2-D
(bond connectivity only) representation of molecules. The case study supports
general lessons for knowledge discovery, as well as more specific lessons for
pharmacophore discovery, the use of ILP for 3-D problems, and for the
particular medicinal activity of ACE inhibition, a treatment for
hypertension.
- [42]
- S.H. Muggleton, A. Srinivasan, R.D.
King, and M.J.E. Sternberg.
Biochemical knowledge discovery using Inductive Logic Programming.
In H. Motoda, editor, First Conference on Discovery Science.
Springer-Verlag, 1998.
Machine Learning algorithms are being increasingly used for
knowledge discovery tasks. Approaches can be broadly divided by
distinguishing discovery of procedural from that of declarative knowledge.
Client requirements determine which of these is appropriate. This paper
discusses an experimental application of machine learning in an area related
to drug design. The bottleneck here is in finding appropriate constraints to
reduce the large number of candidate molecules to be synthesised and tested.
Such constraints can be viewed as declarative specifications of the
structural elements necessary for high medicinal activity and low toxicity.
The first-order representation used within Inductive Logic Programming (ILP)
provides an appropriate description language for such constraints. Within
this application area knowledge accreditation requires not only a
demonstration of predictive accuracy but also, and crucially, a certification
of novel insight into the structural chemistry. This paper describes an
experiment in which the ILP system Progol was used to obtain structural
constraints associated with mutagenicity of molecules. In doing so Progol
found a new indicator of mutagenicity within a subset of previously published
data. This subset was already known not to be amenable to statistical
regression, though its complement was adequately explained by a linear model.
According to the combined accuracy/explanation criterion provided in this
paper, on both subsets comparative trials show that Progol's
structurally-oriented hypotheses are preferable to those of other machine
learning algorithms.
- [43]
- R.R. Saith, A. Srinivasan, D. Michie, and
I.L. Sargent.
The relationship between embryo, oocyte and follicular features and the
developmental potential of human ivf embryos.
Human Reproduction Update, 4(2):121-134, 1998.
- [44]
- R. King and A. Srinivasan.
The discovery of indicator variables for QSAR using inductive logic
programming.
Journal of Computer-Aided Molecular Design, 11:571-580, 1997.
- [45]
- A. Srinivasan, R.D. King, S.H. Muggleton,
and M.J.E. Sternberg.
The Predictive Toxicology Evaluation Challenge.
In Proceedings of the Fifteenth International Conference on Artificial
Intelligence (IJCAI-97). Morgan Kaufmann, Los Angeles, CA, 1997.
Can an AI program contribute to scientific discovery? An area where
this gauntlet has been thrown is that of understanding the mechanisms of
chemical carcinogenesis. One approach is to obtain Structure-Activity
Relationships (SARs) relating molecular structure to cancerous activity.
Vital to this are the rodent carcinogenicity tests conducted within the US
National Toxicology Program by the National Institute of Environmental Health
Sciences (NIEHS). This has resulted in a large database of compounds
classified as carcinogens or otherwise. The Predictive-Toxicology Evaluation
project of the NIEHS provides the opportunity to compare carcinogenicity
predictions on previously untested chemicals. This presents a formidable
challenge for programs concerned with knowledge discovery. Desirable features
of this problem are: (1) involvement in genuine scientific discovery; (2)
availability of a large database with expert-certified classifications; (3)
strong competition from methods used by chemists; and (4) participation in
true blind trials, with results available by next IJCAI. We describe the
materials and methods constituting this challenge, and provide some initial
benchmarks. These show the Inductive Logic Programming tool Progol to be
competitive with current state-of-the-art. The challenge described here is
aimed at encouraging AI programs to avail themselves the opportunity of
contributing to an enterprise with immediate scientific
value.
- [46]
- A. Srinivasan and R.D. King.
Carcinogenesis predictions using ILP.
In N. Lavrac and S. Dzeroski, editors, Proceedings of the Seventh
International Workshop on Inductive Logic Programming (ILP97), volume
1297 of LNAI, pages 273-287, Berlin, 1997. Springer.
A version also in Intelligent Data Analysis in Medicine, Kluwer.
Obtaining accurate structural alerts for the causes of chemical
cancers is a problem of great scientific and humanitarian value. This paper
follows up on earlier research that demonstrated the use of Inductive Logic
Programming (ILP) for predictions for the related problem of mutagenic
activity amongst nitroaromatic molecules. Here we are concerned with
predicting carcinogenic activity in rodent bioassays using data from the U.S.
National Toxicology Program conducted by the National Institute of
Environmental Health Sciences. The 330 chemicals used here are significantly
more diverse than the previous study, and form the basis for obtaining
Structure-Activity Relationships (SARs) relating molecular structure to
cancerous activity in rodents. We describe the use of the ILP system Progol
to obtain SARs from this data. The rules obtained from Progol are comparable
in accuracy to those from expert chemists, and more accurate than most
state-of-the-art toxicity prediction methods. The rules can also be
interpreted to give clues about the biological and chemical mechanisms of
carcinogenesis, and make use of those learnt by Progol for mutagenesis.
Finally, we present details of, and predictions for, an ongoing international
blind trial aimed specifically at comparing prediction methods. This trial
provides ILP algorithms an opportunity to participate at the leading-edge of
scientific discovery.
- [47]
- R.D. King, S.H. Muggleton, A. Srinivasan,
and M.J.E. Sternberg.
Structure-activity relationships derived by machine learning: The use of
atoms and their bond connectivities to predict mutagenicity by inductive
logic programming.
Proc. of the National Academy of Sciences, 93:438-442, 1996.
- [48]
- R.D. King and A. Srinivasan.
Prediction of rodent carcinogenicity bioassays from molecular structure using
inductive logic programming.
Environmental Health Perspectives, 104(5):1031-1040, 1996.
- [49]
- A. Srinivasan, S.H. Muggleton, R.D.
King, and M.J.E. Sternberg.
Theories for mutagenicity: a study of first-order and feature based induction.
Artificial Intelligence, 85:277-299, 1996.
A classic problem from chemistry is used to test a conjecture that
in domains for which data are most naturally represented by graphs, theories
constructed with Inductive Logic Programming (ILP) will significantly
outperform those using simpler feature-based methods. One area that has long
been associated with graph-based or structural representation and reasoning
is organic chemistry. In this field, we consider the problem of predicting
the mutagenic activity of small molecules: a property that is related to
carcinogenicity, and an important consideration in developing less hazardous
drugs. By providing an ILP system with progressively more structural
information concerning the molecules, we compare the predictive power of the
logical theories constructed against benchmarks set by regression, neural,
and tree-based methods.
- [50]
- S. Muggleton, C.D. Page, and
A. Srinivasan.
An
initial experiment into stereochemistry-based drug design using ILP.
In S.H. Muggleton, editor, Proceedings of the Sixth Inductive Logic
Programming Workshop, volume 1314 of LNAI, pages 25-40,
Berlin, 1996. Springer-Verlag.
Extended version appears in ML Journal, 1998.
Previous applications of Inductive Logic Programming to drug design
have not addressed stereochemistry, or the three-dimensional aspects of
molecules. While some success is possible without consideration of
stereochemistry, researchers within the pharmaceutical industry consider
stereochemistry to be central to most drug design problems. This paper
reports on an experimental application of the ILP system P-Progol to
stereochemistry-based drug design. The experiment tests whether P-Progol can
identify the structure responsible for the activity of ACE
(angiotensin-converting enzyme) inhibitors from 28 positive examples, that
is, from 28 molecules that display the activity of ACE inhibition. ACE
inhibitors are a widely-used form of medication for the treatment of
hypertension. It should be stressed that this structure was already known
prior to the experiment and therefore is not a new discovery; the experiment
was proposed by a researcher within the pharmaceutical industry to test the
applicability of ILP to stereochemistry-based drug design. While the result
of the experiment is quite positive, one challenge remains before ILP can be
applied to a multitude of drug design problems.
- [51]
- A. Srinivasan and R.D. King.
Feature construction with inductive logic programming: a study of quantitative
predictions of biological activity aided by structural attributes.
In S.H. Muggleton, editor, Proceedings of the Sixth Inductive Logic
Programming Workshop, volume 1314 of LNAI, pages 89-104,
Berlin, 1996. Springer-Verlag.
Extended version appears in Data Mining and Knowledge Discovery, 1999.
Recently, computer programs developed within the field of Inductive
Logic Programming (ILP) have received some attention for their ability to
construct restricted first-order logic solutions using problem-specific
background knowledge. Prominent applications of such programs have been
concerned with determining ``structure-activity'' relationships in the areas
of molecular biology and chemistry. Typically the task here is to predict the
``activity'' of a compound, like toxicity, from its chemical structure.
Research in the area shows that: (a) ILP programs have been restricted to
qualitative predictions of activity (``high'', ``low'' etc.); (b) When
appropriate attributes are available, ILP programs have not been able to
better the performance of standard quantitative analysis techniques like
linear regression. However ILP programs perform creditably when such
attributes are unavailable; and (c) When both are applicable, ILP programs
are usually slower than their propositional counterparts. This paper examines
the use of ILP programs, not for obtaining theories complete for the sample,
but as a method of ``discovering'' new attributes. These could th en be used
by methods like linear regression, thus allowing for quantitative predictions
and the ability to use structural information as background knowledge. Using
structure-activity tasks as a test-bed the utility of ILP programs in
constructing new features was evaluated by examining the prediction of
chemical activity using linear regression, with and without the aid of ILP
learnt logical attributes. In three out of the five datasets examined the
addition of ILP attributes produced statistically better results (P <
0.01). In addition six important structural features that have escaped the
attention of the expert chemists were discovered. The method used here to
construct new attributes is not specific to the problem of predicting
biological activity, and the results obtained suggest a useful role for ILP
programs in aiding the process of scientific discovery.
- [52]
- M.J.E. Sternberg, R.D. King,
A. Srinivasan, and S.H. Muggleton.
Drug Design by Machine Learning.
In D. Michie S. Muggleton and K. Furukawa, editors, Machine Intelligence
15. Oxford University Press, Oxford, 1996.
- [53]
- M. Bain and A. Srinivasan.
Inductive logic programming with large-scale unstructured data.
In D. Michie S. Muggleton and K. Furukawa, editors, Machine Intelligence
14, pages 233-267. Oxford University Press, Oxford, 1995.
We report some recent developments from an ongoing project in which
a chess endgame domain is providing benchmark experimental tests for the
study of concept learning. The King and Rook against King (KRK) endgame is
simple enough in chess terms but provides concept learning tasks which can be
demanding, as evidenced in previous studies by a number of authors. For
learning systems these tasks have highlighted problems of representation,
such as the ability to express the structural relationships to be found in
learning examples, and other issues like correctness, compression and
comprehensibility. Our current focus is on Inductive Logic Programming
methods which are based on previously developed systems for the
generalisation and specialisation of normal logic programs. In the current
work we are principally concerned with improving these methods to be able to
handle more examples during learning. The main contribution of this work
consists of a new method of incremental learning, together with new results
on a set of concept learning problems taken from the KRK endgame
domain.
- [54]
- R.D. King, A.Srinivasan, and M.J.E.
Sternberg.
Relating chemical activity to structure: an examination of ILP successes.
New Gen. Comput., 13(3,4):411-433, 1995.
An important test-bed for Inductive Logic Programming (ILP) systems
has been the task of relating the activity of chemical compounds to their
structure. In this paper we examine the structure-activity problems that have
been addressed by ILP, and evaluate empirically the extent to which a
first-order representation was required. This is done by comparing ILP
theories against those constructed by standard linear regression and a
decision-tree learner. When propositional encodings are feasible for the
feature-based algorithms, we present evidence that they are capable of
matching the predictive accuracies of an ILP theory. However, as the
diversity of compounds considered increases, propositional encodings become
intractable. In such cases, our results provide support for the claim that
ILP programs will continue to construct accurate, understandable theories.
Based on this evidence, we propose future work to realise fully the potential
of ILP in structure-activity problems.
- [55]
- A. Srinivasan, S.H. Muggleton, and R.D.
King.
Comparing the use of background knowledge by inductive logic programming
systems.
In L. De Raedt, editor, Proceedings of the Fifth International Inductive
Logic Programming Workshop. Katholieke Universteit, Leuven, 1995.
Withdrawn from publication. New version submitted to IEEE Trans on Knowledge
and Data Engineering, 1999.
- [56]
- R.R. Saith, A. Srinivasan, I.L. Sargent,
and D.H. Barlow.
Embryo selection: can Artificial Intelligence help?
In Tenth Annual Meeting of the European Society of Human Reproduction and
Embryology (ESHRE), 1994.
Extended version appears in Human Reproduction Update, 1998.
- [57]
- A. Srinivasan, S.H. Muggleton, and
M.E. Bain.
The
justification of logical theories based on data compression.
In D. Michie S. Muggleton and K. Furukawa, editors, Machine Intelligence
13, pages 87-121. Oxford University Press, Oxford, 1994.
Non-demonstrative or inductive reasoning is a crucial component in
the skills of a learner. A leading candidate for this form of reasoning
involves the automatic formation of hypotheses. Initial successes in the
construction of propositional theories have now been followed by algorithms
that attempt to generalise sentences in the predicate calculus. An important
defect in these new-generation systems is the lack of a clear model for
theory justification. In this paper we describe a method of evaluating the
significance of a hypothesis based on the degree to which it allows
compression of the observed data with respect to prior knowledge. This can be
measured by comparing the lengths of the input and output tapes of a
reference Turing machine which will generate the examples from the hypothesis
and a set of derivational proofs. The model extends an earlier approach of
Muggleton by allowing for noise. The truth values of noisy instances are
switched by making use of correction codes. The utility of compression as a
significance measure is evaluated empirically in three independent domains.
In particular, the results show that the existence of compression
distinguishes a larger number of significant clauses than other significance
tests. The method also appears to distinguish noise as incompressible
data.
- [58]
- A. Srinivasan, S.H. Muggleton, R.D.
King, and M.J.E. Sternberg.
Mutagenesis: ILP experiments in a non-determinate biological domain.
In S. Wrobel, editor, Proceedings of the Fourth International Inductive
Logic Programming Workshop, pages 217-232. Gesellschaft fur
Mathematik und Datenverarbeitung MBH, 1994.
GMD-Studien Nr 237.
This paper describes the use of Inductive Logic Programming as a
scientific assistant. In particular, it details the application of the ILP
system Progol to discovering structural features that can result in
mutagenicity in small molecules. To discover these concepts, Progol
only had access to the atomic and bond structure of the molecules. With such
a primitive description and no further assistance from chemists, Progol corroborated some existing knowledge and proposed a new structural
alert for mutagenicity in compounds. In the process, the experiments act as a
case study in which, even with extremely limited background knowledge, an
Inductive Logic Programming tool firstly, complements a complex statistical
model developed by skilled chemists, and secondly, continues to provide
understandable theories when the statistical model fails. The experiments
also constitute the first demonstrations of a prototype of the Progol
system. Progol allows the construction of hypotheses with bounded
non-determinacy by performing a best-first search within the subsumption
lattice. The results here provide evidence that such searches are both viable
and desirable.
- [59]
- M. Sternberg, J. Hirst, R. Lewis,
R. King, A. Srinivasan, and S. Muggleton.
Application of machine learning to protein structure prediction and drug
design.
In S. Schulze-Kremer, editor, Advances in Molecular
Bioinformatics, pages 1-8. IOS Press, 1994.
- [60]
- James Cussens, Anthony Hunter, and
Ashwin Srinivasan.
Generating explicit orderings for non-monotonic logics.
In Proc. of the Eleventh National Conference on Artificial Intelligence
(AAAI-93), pages 420-425. MIT Press, 1993.
For non-monotonic reasoning, explicit orderings over formulae offer
an important solution to problems such as `multiple extensions'. However, a
criticism of such a solution is that it is not clear, in general, from where
such orderings should be obtained. Here we show how orderings can be derived
from statistical information about the domain which the formulae cover. For
this we provide an overview of prioritized logics -- a general class of
logics that incorporate explicit orderings over formulae. This class of
logics has been shown elsewhere to capture a wide variety of proof-theoretic
approaches to non-monotonic reasoning, and in particular, to highlight the
roles of preferences -- both explicit and implicit -- in such proof theory.
We take one particular prioritised logic, called SF logic, and describe an
experimental approach for comparing this logic with an important example of a
logic that does not use explicit orderings of preference -- namely Horn
clause logic with negation-as-failure. Finally, we present the results of
this comparison, showing how SF logic is more sceptical and more accurate
than negation-as-failure.
- [61]
- G. Edwards, P. Compton, R. Malor,
A. Srinivasan, and L. Lazarus.
PEIRS: a pathologist maintained expert system for the interpretation of
chemical pathology reports.
Pathology, 25:27-34, 1993.
- [62]
- A. Srinivasan and J.A. Richards.
Analysis of gis spatial data using knowledge based methods.
International Journal of Remote Sensing, 7(6):479-500, 1993.
A symbolic approach to analysis of mixed data types found in the
spatial database of a Geographic Information System is developed. Unlike
statistical techniques, that are essentially based on the manipulation of
numeric quantities, the method here attempts to emulate photo-interpretation
by manipulating symbolic representations of photo-interpretive knowledge.
This representation consists of a mixture of normal and `defeasible' rules.
The latter enables analysis to proceed in the face of uncertainty by making
(and possibly revising) assumptions about the content of the scene being
analysed. The analysis procedure is decomposed into three stages. The first
deals with the analysis of data from groups of sensors (sources) providing
the same type of information, the second combines the inferences of these
individual analyses, and the third imposes spatial consistency constraints.
This decomposition implies a structuring of the knowledge, an issue that
appears to be essential for a knowledge-based analysis to be efficient.
Examples are presented which demonstrate the system's ability to handle mixed
multi-spectral and radar imagery, and images at different spatial
resolutions.
- [63]
- P. Compton, G. Edwards, B. Kang, L. Lazarus,
R. Malor, P. Preston, and A. Srinivasan.
Ripple down rules: turning knowledge acquisition into knowledge maintenance.
Artificial Intelligence in Medicine, 4:463-475, 1992.
- [64]
- R.D. King, S. Muggleton, A.Srinivasan,
C. Feng, R.A. Lewis, and M.J.E. Sternberg.
Drug design using inductive logic programming.
In Proceedings of the 26th Hawaii International Conference on System
Sciences, pages 646-655, Los Alamitos, 1993. IEEE Computer Society
Press.
- [65]
- S.H. Muggleton, A. Srinivasan, and M.E.
Bain.
Compression, significance and accuracy.
In D. Sleeman and P. Edwards, editors, Ninth International Machine
Learning Conference, pages 338-347, San Mateo, CA, 1992. Morgan
Kaufmann.
Inductive Logic Programming (ILP) involves learning relational
concepts from examples and background knowledge. To date all ILP learning
systems make use of tests inherited from propositional and decision tree
learning for evaluating the significance of hypotheses. None of these
significance tests take account of the relevance or utility of the background
knowledge. In this paper we describe a method, called HP-compression, of
evaluating the significance of a hypothesis based on the degree to which it
allows compression of the observed data with respect to the background
knowledge. This can be measured by comparing the lengths of the input and
output tapes of a reference Turing machine which will generate the examples
from the hypothesis and a set of derivational proofs. The model extends an
earlier approach of Muggleton by allowing for noise. The truth values of
noisy instances are switched by making use of correction codes. The utility
of compression as a significance measure is evaluated empirically in three
independent domains. In particular, the results show that the existence of
positive compression distinguishes a larger number of significant clauses
than other significance tests The method is also shown to reliably
distinguish artificially introduced noise as incompressible
data.
- [66]
- A. Srinivasan, S.H. Muggleton, and M.E.
Bain.
Distinguishing noise from exceptions in non-monotonic learning.
In S. Muggleton and K. Furukawa, editors, Second International Inductive
Logic Programming Workshop (ILP92). Institute for New Generation
Computer Technology, 1992.
It is important for a learning program to have a reliable method of
deciding whether to treat errors as noise or to include them as exceptions
within a growing first-order theory. We explore the use of an
information-theoretic measure to decide this problem within the non-monotonic
learning framework defined by Closed-World-Specialisation. The approach
adopted uses a model that consists of a reference Turing machine which
accepts an encoding of a theory and proofs on its input tape and generates
the observed data on the output tape. Within this model, the theory is said
to ``compress'' data if the length of the input tape is shorter than that of
the output tape. Data found to be incompressible are deemed to be ``noise''.
We use this feature to implement a compression-guided specialisation
procedure that searches for the best-fitting theory for the data (that is,
the one with the shortest input tape length). The approach is empirically
evaluated on the standard Inductive Logic Programming problem of learning
classification rules for the KRK chess end-game.
- [67]
- A. Srinivasan, S.H. Muggleton, and
M.E. Bain.
Compression-guided correction of first-order theories.
In ECAI92 Workshop on Logical Approaches to Learning. 1992.
Shorter version of ILP92 paper.
- [68]
- P. Compton, A.Srinivasan, G. Edwards,
R. Malor, and L. Lazarus.
Knowledge base maintenance without a knowledge engineer.
In J. Liebowitz, editor, First World Congress on Expert Systems,
pages 668-675. Pergamon, London, 1991.
- [69]
- P. Compton, A.Srinivasan, G.Edwards,
R.Malor, and L.Lazarus.
Ripple down rules.
In IJCAI Workshop on Representing Knowledge in Medical Decision Support
Systems, pages 7.1-7.8, 1991.
- [70]
- A. Srinivasan and J.A. Richards.
Rule-based techniques for multi-source classification.
International Journal of Remote Sensing, 11:502-525, 1990.
- [71]
- A. Srinivasan and J.A. Richards.
Knowledge based multi-source analysis.
In Fifth Australasian Remote Sensing Conference, 1990.
- [72]
- A. Srinivasan, C.A. Sammut, and
J.A. Richards.
A knowledge-based system for multi-source classification in remote sensing.
In Fifth Australian Conference on the Application of Expert
Systems, pages 102-119, 1989.