Protein Sequences and Protein Structures

Combining Disparate Data Types

27.08.2015, 13:58

Protein Sequences and Protein Structures

There are many opportunities from combining disparate types of data. The example below is the investigation of protein variability and evolution by combining protein structures with sequences.

Bilder

The many large-scale genome sequencing projects and the advent of individual organism and metagenome sequencing is starting to accumulate in the enormous numbers of protein sequences. In some cases there are tens of thousands of sequences related to a single protein. Together with the 100,000+ structures in the Protein Data Base (PDB), this remarkable data for comprehending the important structural and sequence relationships. Understanding sequence conservation is obviously important for the understanding of protein evolution and ultimately for understanding phenomics. The datamining opportunities are unprecedented for using these available big data sets to develop a deeper understanding of protein evolution.

We pioneered such approaches with protein structures in 1985 by extracting potentials for interacting amino acids when we were able to use only 42 protein structures, which were sufficient for extracting the counts of the 190 types of amino acid pairs [1]. These large new data await clever new applications by dataminers. One of our new projects uses these data to identify closely interacting tight clusters of amino acids to characterize their sequence and geometric variabilities. Amino acid substitutions in proteins can be significantly better understood by considering the closely interacting groups of amino acids within structures, which have been combined naturally for favorable collective multibody interactions tight packing. Two amino acids that are distant in sequence may fold up into close contact pairs in the native structure. Because they are close, if one of them is replaced with a smaller amino acid, one of its neighbors may be replaced by a larger one, to maintain protein stability. In densely packed proteins, these correlated relationships involve more than simply pairs. In our project, information derived from Multiple Sequence Alignments (MSA) will be used to expand the numbers of physical clusters taken from structures by substituting the amino acids, according to the sequence alignments (see Figure 1).

By applying the sequence alignment to generate a larger number of possible clusters, we will be directly including evolutionary information. In a protein, amino acids co-evolve with other amino acids in ways to compensate for changes that are introduced. Characterizing these from a large set of proteins will permit understanding these interactions better. In multiple-sequence alignments of a given protein from different biological sources these co-evolving residues can be identified. Even the intricacies of allostery and how the proteins move and respond to other molecules could be meaningfully investigated with these large sets of data. These groups of correlated mutations can give insights into the structure and function of a protein.

Figure 1: Including additional clusters based on sequence alignments. A physical cluster of 8 close residues (residues 1-4 and 10-13 identified in blue in the table headings) is shown within the dashed circle at the top, with the central one in red. Sequence alignments show variations at the positions shown in highlights. We will use these additional sequences from the reliable multiple sequence alignments to include additional clusters with these specific sequence changes. The changed sequences in the additional clusters will significantly enhance the present studies by providing a large increase in the number of clusters of interacting residues that available for this project, far beyond the number taken directly from experimental structures [modified from [4]].

Contact clusters from a set of PDB structures can be selected to have different CATH topologies. The CATH database [2] is a classification of protein structures from Protein Data Bank. It contains a semiautomatic, hierarchical classification of protein domains. The four main levels in classification are Class, Architecture, Topology and Homologous superfamily. Protein structures that have same topology level share particular structural features. The current version of CATH database (version 4.0) includes 69,058 annotated PDBs. There are alternative ways to obtain MSA, from Pfam, or by using different multiple sequence alignment procedures, such as, MUSCLE and CLUSTAL Omega. Pfam [3] is a database of curated protein families, each of which is defined by two alignments and a profile hidden Markov model (HMM). Protein sequences within one family are aligned according to their functional regions, commonly termed domains. The current version of the database is Pfam 27.0, which contains a total of 14831 families. Table 1 shows a partial list of Pfam families and the number of sequences in each family.

The structural clusters and their sequences will capture the complex evolutionary information from the sequence alignments. The phylogenetic information could even be used as a weighting scheme for the clusters. The tight clusters can be quite specific, and such clusters will no longer depend just on the types of pairs of amino acids involved, but rather on larger pieces of structure, i.e., they will be protein-specific (different clusters for different proteins). Thus, a protein-specific cluster set can be derived separately for each protein. Previously Sander, Marks, and Onuchic have succeeded in predicting structural contact pairs of amino acids from the sequence data [4-6]. By utilizing the strength of the inferred couplings, they developed predictors of residue-residue proximity that have proven useful for protein structure prediction. The multi-body clusters described above provide a significantly more cooperative representation than do pairwise clusters, and also show impressive gains in threading calculations. Thus, we expect them to be superior for distinguishing the importance of specific clusters.

Figure 2: Myoglobin structure showing three sets of functionally related amino acids, marked in red, cyan and blue color, identified by their residue substitution patterns.

When the amino acids are no longer required to be spatially close to one another. These conserved groups of amino acids from a MSA are collectively correlated with protein functions. Myoglobin is shown in Figure 2 (PDB:2mgm). Three sets of amino acids are highlighted in red, cyan and blue. They are connected to the function of the protein. The amino acid substitutions in MSA may suggest novel mechanisms for collective behaviors of such disparate sets of amino acids to achieve certain functions, which may not have been known from previous research, and understanding these become more important as more proteins are utilized as drugs.

Table 1: Pfam protein family ID and the number of sequences in each multiple sequence alignment.

Overall the interpretation of the sequence data becomes significantly more meaningful when they are combined with structural information. There are many important opportunities in datamining by combining diverse data types.

References
[1] S Miyazawa, RL Jernigan (1985) Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation.Macromolecules 18:534-552.
[2] Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, et al. (2013)New functional families (FunFams) in CATH to improve the mapping ofconserved functional sites to 3D structures. Nucleic Acids Res 41:D490-498.
[3] RD Finn, A Bateman, J Clements, P Coggill, RY Eberhardt, et al. (2014)ThePfam protein families database. Nucleic Acids Res; 42 (Databaseissue):D222-D230.
[4] DS Marks, LJ Colwell, R Sheridan, TA Hopf, A Pagnani, et al. (2011)Protein 3D structure computed from evolutionary sequence variation.PLoS One 6:e28766.
[5] DS Marks, TA Hopf, C Sander (2012) Protein structure prediction fromsequence variation. Nat Biotechnol 30:1072-1080.
[6] Morcos F1, Jana B, Hwa T, Onuchic JN (2013) Coevolutionary signalsacross protein lineages help capture multiple protein conformations. ProcNatl Acad Sci USA 110: 20533-20538

© 2015 Jia K, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Citation: Kejue Jia and Robert L. Jernigan (2015) Combining Disparate Data Types: Protein Sequences and Protein Structures. J Data MiningGenomics Proteomics 6: e117. doi:10.4172/2153-0602.1000e117.

Author:
Kejue Jia and Robert L. Jernigan
Bioinformatics and Computational Biology Program Department of Biochemistry, Biophysics and Molecular Biology
Iowa State University, Ames, IA 50011
Corresponding author: Robert L. Jernigan
E-mail: [email protected]

Das könnte Sie auch interessieren

Chemie-Nobelpreis 2024

Durchbruch im Proteindesign und der Strukturvorhersage von Proteinen

Der Nobelpreis für Chemie 2024 würdigt bedeutende Fortschritte im Proteindesign und der Vorhersage von Proteinstrukturen. Die bahnbrechende Arbeiten könnten neue Wege für Biochemie und Medizin eröffnen.

mehr...

Proteinforschung

Unordnung in Proteinstruktur und Bedeutung für biologische Funktionen

Ein internationales Forschungsteam um Prof. Dr. Ute Hellmich von der Friedrich-Schiller-Universität Jena hat einen großen, ungeordneten Bereich des Rezeptorkanal-Proteins TRPV untersucht und zeigt einen Zusammenhang zwischen dem ungeordneten...

mehr...

Strukturanalyse mit Neutronendiffraktion

Wasser macht Platz für die Komplexbildung von Proteinen

Eine europäische Forschergruppe hat mit der Neutronendiffraktionsmethode untersucht, wie Wassermoleküle sich bei der Komplexbildung von Proteinen verhalten.

mehr...

Testverfahren weist Genmutationen schnell...

Identifikation antibiotikaresistenter Erreger

AID Diagnostika hat eine Komplettlösung für eine schnelle Erkennung von Resistenzen in Enterobakterien entwickelt. Durch eine Studie über das Auftreten von Genmutationen werden Präventionsmaßnahmen gemäß der DART 2020 ermöglicht.

mehr...

Auswertung biomedizinischer Daten

Zeitraffer für die Demenz-Forschung

Das DZNE hat kürzlich einen neuen Hochleistungsrechner gestartet. Dieser soll die Auswertung biomedizinischer Daten enorm beschleunigen und zu schnelleren Fortschritten in der Demenz-Forschung führen.

mehr...

West German Genome Center

Genomforschung, Bioinformatik und High Performance Computing

Mit dem „West German Genome Center“ (WGGC) haben die Universität zu Köln, die Rheinische Friedrich-Wilhelms-Universität Bonn und die Heinrich-Heine-Universität Düsseldorf (HHU) ein gemeinsames Kompetenzzentrum bei der Deutschen...

mehr...

Wechselwirkungen von Proteinen

Protein-Metabolit-Interaktomik

ETH-Forschende haben mit einem neuen Ansatz in Bakterienzellen bisher unbekannte Wechselwirkungen zwischen Proteinen und kleinen Stoffwechselmolekülen entdeckt.

mehr...

Mikrofluidische Komponenten

Miniaturventile blitzschnell maßgeschneidert

Experten sagen dem Markt für Lab-on-a-Chip-Systeme eine dynamische Zunahme voraus – analog zur Entwicklung immer kleinerer und leistungsfähigerer mikrofluidischer Komponenten. Angesichts dieser Gegebenheiten lassen neuartige Miniaturventile die...

mehr...

Elektrospinning

Kollagen-Vlies für die regenerative Medizin

Kollagen ist einer der Grundbausteine unseres Körpers, die natürliche Umgebung insbesondere von Knochen- und Hautzellen. Mit einem Anteil von über 30 % am Gesamtgewicht aller Proteine ist Kollagen das am häufigsten vorkommende Eiweiß im menschlichen...

mehr...

Protein Sequences and Protein Structures

Das könnte Sie auch interessieren

Durchbruch im Proteindesign und der Strukturvorhersage von Proteinen

Unordnung in Proteinstruktur und Bedeutung für biologische Funktionen

Wasser macht Platz für die Komplexbildung von Proteinen

Identifikation antibiotikaresistenter Erreger

Zeitraffer für die Demenz-Forschung

Genomforschung, Bioinformatik und High Performance Computing

Protein-Metabolit-Interaktomik

Miniaturventile blitzschnell maßgeschneidert

Kollagen-Vlies für die regenerative Medizin

Media

Service

Weitere Angebote

Über uns

Unser Netzwerk