Investigation into the role of sequence-driven-features and amino acid indices for the prediction of structural classes of proteins
Date
Authors
Advisors
Journal Title
Journal ISSN
ISSN
DOI
Volume Title
Publisher
Type
Peer reviewed
Abstract
The work undertaken within this thesis is towards the development of a representative set of sequence driven features for the prediction of structural classes of proteins. Proteins are biological molecules that make living things function, to determine the function of a protein the structure must be known because the structure dictates its physical capabilities. A protein is generally classified into one of the four main structural classes, namely all-α, all-β, α + β or α / β, which are based on the arrangements and gross content of the secondary structure elements. Current methods manually assign the structural classes to the protein by manual inspection, which is a slow process. In order to address the problem, this thesis is concerned with the development of automated prediction of structural classes of proteins and extraction of a small but robust set of sequence driven features by using the amino acid indices. The first main study undertook a comprehensive analysis of the largest collection of sequence driven features, which includes an existing set of 1479 descriptor values grouped by ten different feature groups. The results show that composition based feature groups are the most representative towards the four main structural classes, achieving a predictive accuracy of 63.87%. This finding led to the second main study, development of the generalised amino acid composition method (GAAC), where amino acid index values are used to weigh corresponding amino acids. GAAC method results in a higher accuracy of 68.02%. The third study was to refine the amino acid indices database, which resulted in the highest accuracy of 75.52%. The main contributions from this thesis are the development of four computationally extracted sequence driven feature-sets based on the underused amino acid indices. Two of these methods, GAAC and the hybrid method have shown improvement over the usage of traditional sequence driven features in the context of smaller and refined feature sizes and classification accuracy. The development of six non-redundant novel sets of the amino acid indices dataset, of which each are more representative than the original database. Finally, the construction of two large 25% and 40% homology datasets consisting over 5000 and 7000 protein samples, respectively. A public webserver has been developed located at http://www.generalised-protein-sequence-features.com, which allows biologists and bioinformaticians to extract GAAC sequence driven features from any inputted protein sequence.