Predicting protein secondary structure
Contents and links
Introduction to secondary structure prediction
Predicting the three-dimensional shape of proteins from their amino acid
sequence is widely believed to be one of the hardest unsolved problems
in molecular biology. It is also of considerable interest to pharmaceutical
companies since a protein's shape generally determines its function as
an enzyme. This is what a protein looks like.
[Muggleton S., King R.D., and Sternberg
M.J.E. (1992)]. The task is to learn rules to identify whether a position
in a protein is in an alpha-helix. Points of relevance are:
-
Each amino acid is denoted by a lower case character. There are 20 such
amino acids.
-
Positive examples state which positions of chosen proteins are in an alpha-helix.
Negative examples state the positions that are not in an alpha-helix.
-
The following background knowledge is provided:
-
position(A,B,C). Residue of protein A at position B
is C.
-
octf(A,B,C,D,E,F,G,H,I). Arithmetic information that allows indexing
groups of nine adjacent positions in a protein. Basically says positions
A--I occur in sequence.
-
alpha_triplet(A,B,C). Arithmetic information that allows indexing
groups of three adjacent positions in a protein.
-
alpha_pair(A,B). Arithmetic information that allows indexing a
pair of adjacent positions in a protein.
-
alpha_pair4(A,B). Arithmetic information that allows indexing
a pair of positions separated by 4 positions in a protein.
-
Physical and chemical properties of individual residues are described by
unary predicates. These properties include hydrophobicity, hydrophilicity,
charge, size, polarity, whether a residue is aliphatic or aromatic, whether
it is a hydrogen donor or acceptor etc.
-
Sizes, hydrophobicities, polarities etc., are represented by constants
such as polar0 and polar1.
-
Relations between the constants (less_than(polar0,polar1)) is
also provided as background knowledge.
The Golem dataset
The data files we provide are as used in the original Golem experiments,
and are downloadable as one
compressed TAR file. Within this file, background knowledge files have
a ``.b'' suffix, positive example files have a ``.f'' suffix, and negative
example files have a ``.n'' suffix.
Bibliography
Muggleton S., King R.D., and Sternberg M.J.E. (1992).
Predicting protein secondary structure using inductive logic programming.
in Protein Engineering, 5:647--657.
Up to applications main page.