Analysis of class c g-protein coupled receptors using supervised classification methods
- KÖNIG, CAROLINE LEONORE
- René Alquézar Mancho Director
- Alfredo Vellido Alacena Director
Defence university: Universitat Politècnica de Catalunya (UPC)
Fecha de defensa: 30 October 2018
- Juan Miguel García Gómez Chair
- Ricard Gavaldà Mestre Secretary
- Rodrigo Santamaría Vicente Committee member
Type: Thesis
Abstract
G protein-coupled receptors (GPCRs) are cell membrane proteins with a key role in regulating the function of cells. This is the result of their ability to transmit extracellular signals, which makes them relevant for pharmacology and has led, over the last decade, to active research in the field of proteomics. The current thesis specifically targets class C of GPCRs, which are relevant in therapies for various central nervous system disorders, such as Alzheimer’s disease, anxiety, Parkinson’s disease and schizophrenia. The investigation of protein functionality often relies on the knowledge of crystal three dimensional (3-D) structures, which determine the receptor’s ability for ligand binding responsible for the activation of certain functionalities in the protein. The structural information is therefore paramount, but it is not always known or easily unravelled, which is the case of eukaryotic cell membrane proteins such as GPCRs. In the face of the lack of information about the 3-D structure, research is often bound to the analysis of the primary amino acid sequences of the proteins, which are commonly known and available from curated databases. Much research on sequence analysis has focused on the quantitative analysis of their aligned versions, although, recently, alternative approaches using machine learning techniques for the analysis of alignment-free sequences have been proposed. In this thesis, we focus on the differentiation of class C GPCRs into functional and structural related subgroups based on the alignment-free analysis of their sequences using supervised classification models. In the first part of the thesis, the main topic is the construction of supervised classification models for unaligned protein sequences based on physicochemical transformations and n-gram representations of their amino acid sequences. These models are useful to assess the internal data quality of the externally labeled dataset and to manage the label noise problem from a data curation perspective. In its second part, the thesis focuses on the analysis of the sequences to discover subtype- and region-speci¿c sequence motifs. For that, we carry out a systematic analysis of the topological sequence segments with supervised classification models and evaluate the subtype discrimination capability of each region. In addition, we apply different types of feature selection techniques to the n-gram representation of the amino acid sequence segments to find subtype and region specific motifs. Finally, we compare the findings of this motif search with the partially known 3D crystallographic structures of class C GPCRs.