Alignment-Free Probabilistic Proteomics: Patterns to Functionality

Abstract

Major Histocompatibility Complexes class I (MHC I), known as the Human Leukocyte Antigen class (HLA I) in humans, are proteins responsible for antigen presentation to Tlymphocytes. MHCs interact with T Cell Receptors (TCRs). They serve as crucial immune regulators for vertebrates. The three main sub-classes of the HLA class I proteins (HLA-A, HLA-B, HLA-C) are encoded in three different loci. Therefore (as genes within MHC I class are co-dominant), an individual has up to six different alleles of HLA class I protein present on the surface of their cells. The genetic diversity of HLA class I in the human population can be linked to the differentiated immunological response. Based on a combination of established bioinformatic and machine learning tools, we have addressed the challenge to analyse HLA class I protein data-set in order to determine their ability to bind to specific antigens. To achieve this, we have created three dimensional models of HLA class I variants using homology modelling techniques. These have then been placed in three dimensional grids in order to calculate the electrostatic fields around the protein domains. The resultant multi-dimensional data were then analysed using the unsupervised machine learning techniques: both linear Principal Component Analysis (PCA), and nonlinear ones: the auto-encoder neural network (NLPCA) and the Gaussian Process Latent Variable Model (GPLVM). The methods used, accomplished the task of distinguishing between the HLA proteins sub-classes (A, B and C). In addition, the results obtained with the GPLVM dimensionality reduction suggested, that the electrostatic potential calculation may add information necessary to identifying HLA super-types. However, this method by itself, it is not robust enough to be independently conclusive. The sequence alignments methods are not free from assumptions. Results they provide are influenced by the choice of a substitution matrix, as the numerical values are assigned to the differences between compared biomolecules’ primary structures. The increase of the number of known sequences, related to the development of the Next Generation Sequencing techniques created additional challenge, that is a computational time required. As an alternative to the sequence alignment, we implemented the methods from time series analysis, information and chaos theory, and statistical physics to translate information from amino acid sequences into numerical vectors, in order to predict the similarity in proteins structures and functions. We transformed a data set of 9693 amino acid sequences belonging to 100 protein families by replacing each amino acid with numerical values representing its physicochemical and biochemical properties, and based on that, calculated multiple multidimensional vectors of non-alignment protein descriptors with measures such as approximate and sample entropy or persistence, Hurst and Lyapunov exponents. The supervised learning Linear Discriminant Analysis technique, used to assess the ability of the developed protocols to correctly assign proteins to their functional groups, showed an efficiency up to over 99%.

Divisions: College of Engineering & Physical Sciences
Additional Information: Copyright © Ewa Magdalena Grela, 2022. Ewa Magdalena Grela asserts her moral right to be identified as the author of this thesis. This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with its author and that no quotation from the thesis and no information derived from it may be published without appropriate permission or acknowledgement. If you have discovered material in Aston Publications Explorer which is unlawful e.g. breaches copyright, (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please read our Takedown Policy and contact the service immediately.
Institution: Aston University
Last Modified: 28 Jun 2024 08:23
Date Deposited: 27 May 2024 07:01
Completed Date: 2022-12
Authors: Grela, Ewa

Export / Share Citation


Statistics

Additional statistics for this record