Page 1
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 0
Subject: Bioinformatics
Lesson : Swiss-Prot
Lesson Developer: Monika Jaggi
College/ Department: Kirori Mal College, University of Delhi
Page 2
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 0
Subject: Bioinformatics
Lesson : Swiss-Prot
Lesson Developer: Monika Jaggi
College/ Department: Kirori Mal College, University of Delhi
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Swiss-Prot
? Introduction
? History
? Features
? Retrieval of Protein sequences
? Structure of a Swiss-Prot entry
? Important tools associated with UniProtKB/Swiss-
Prot
? Services associated with UniProtKB/Swiss-Prot
? TrEMBL
? Summary
? Exercise
? Glossary
? References
Page 3
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 0
Subject: Bioinformatics
Lesson : Swiss-Prot
Lesson Developer: Monika Jaggi
College/ Department: Kirori Mal College, University of Delhi
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Swiss-Prot
? Introduction
? History
? Features
? Retrieval of Protein sequences
? Structure of a Swiss-Prot entry
? Important tools associated with UniProtKB/Swiss-
Prot
? Services associated with UniProtKB/Swiss-Prot
? TrEMBL
? Summary
? Exercise
? Glossary
? References
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 2
Introduction
Biological database can be defined as biological information stored in an electronic format
and can be easily accessed throughout the world. These databases can be classified into
various categories depending upon data type, data source, maintainer status etc. A variety
of databases contain nucleotide and/or protein sequences data that are pertinent to a
specific gene. Protein databases are specific to protein sequences. There are three
important publicly accessible protein databases: Protein Information Resource (PIR),
Swiss-Prot and Protein Data Bank (PDB). Whereas PIR and Swiss-Prot contain protein
sequences, PDB is a structural database of biomolecules. PIR is considered as a primary
database whereas Swiss-Prot falls into secondary database category. The aim of this
chapter is to explain Swiss-Prot database and strategies to retrieve information from this
database. Some of the tools and databases that are linked to each entry will also be
discussed briefly.
History
Swiss-Prot is an annotated protein sequence database which was formulated and managed
by Amos Bairoch in 1986. It was established collaboratively by the Department of Medical
Biochemistry at the University of Geneva and European Molecular Biology Laboratory
(EMBL). Later it shifted to European Bioinformatics Institute (EBI) in 1994 and finally in
April 1998, it became a part of Swiss Institute of Bioinformatics (SIB) (Bairoch and
Apweiler, 1998). In 1996, TrEMBL was added as an automatically annotated supplement to
Swiss-Prot database (Bairoch and Apweiler, 1996). Since 2002, it is maintained by the
UniProt consortium and information about a protein sequence can be accessed via the
UniProt website (http://www.uniprot.org/) (Apweiler et al., 2004). The Universal Protein
Resource (UniProt) is the most widespread protein sequence catalog comprising of EBI, SIB
and PIR (UniProt Consortium, 2009).
Features
Swiss-Prot database is characterized for its high quality annotation which comes at a price
of lower coverage. It provides information about the function of protein, its domain
Page 4
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 0
Subject: Bioinformatics
Lesson : Swiss-Prot
Lesson Developer: Monika Jaggi
College/ Department: Kirori Mal College, University of Delhi
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Swiss-Prot
? Introduction
? History
? Features
? Retrieval of Protein sequences
? Structure of a Swiss-Prot entry
? Important tools associated with UniProtKB/Swiss-
Prot
? Services associated with UniProtKB/Swiss-Prot
? TrEMBL
? Summary
? Exercise
? Glossary
? References
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 2
Introduction
Biological database can be defined as biological information stored in an electronic format
and can be easily accessed throughout the world. These databases can be classified into
various categories depending upon data type, data source, maintainer status etc. A variety
of databases contain nucleotide and/or protein sequences data that are pertinent to a
specific gene. Protein databases are specific to protein sequences. There are three
important publicly accessible protein databases: Protein Information Resource (PIR),
Swiss-Prot and Protein Data Bank (PDB). Whereas PIR and Swiss-Prot contain protein
sequences, PDB is a structural database of biomolecules. PIR is considered as a primary
database whereas Swiss-Prot falls into secondary database category. The aim of this
chapter is to explain Swiss-Prot database and strategies to retrieve information from this
database. Some of the tools and databases that are linked to each entry will also be
discussed briefly.
History
Swiss-Prot is an annotated protein sequence database which was formulated and managed
by Amos Bairoch in 1986. It was established collaboratively by the Department of Medical
Biochemistry at the University of Geneva and European Molecular Biology Laboratory
(EMBL). Later it shifted to European Bioinformatics Institute (EBI) in 1994 and finally in
April 1998, it became a part of Swiss Institute of Bioinformatics (SIB) (Bairoch and
Apweiler, 1998). In 1996, TrEMBL was added as an automatically annotated supplement to
Swiss-Prot database (Bairoch and Apweiler, 1996). Since 2002, it is maintained by the
UniProt consortium and information about a protein sequence can be accessed via the
UniProt website (http://www.uniprot.org/) (Apweiler et al., 2004). The Universal Protein
Resource (UniProt) is the most widespread protein sequence catalog comprising of EBI, SIB
and PIR (UniProt Consortium, 2009).
Features
Swiss-Prot database is characterized for its high quality annotation which comes at a price
of lower coverage. It provides information about the function of protein, its domain
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 3
structure, post-translational modifications (PTM) etc. In other words, it imparts whole
information about a specific protein. Swiss-Prot database is curated to make it non-
redundant. Therefore, this database contains only one entry per protein. As a result, the
size of Swiss-Prot is very less as compared to DNA sequence databases. Figure 1 shows the
development of the size of this database. The high quality annotation and minimum
redundancy distinguish Swiss-Prot from other protein sequence databases.
There are four main features of Swiss-Prot:
1. High Quality Annotation: It is achieved through manually creating the protein
sequence entries. It is processed through 6 stages:
a. Sequence curation: In this step, identical sequences are extracted through
blast search and then the sequences from the related gene and same
organism are incorporated into a single entry. It makes sure that the
sequence is complete, correct and ready for further curation steps.
b. Sequence Analysis: It is performed by using various sequence analysis
tools. Computer-predictions are manually reviewed and important results are
selected for integration.
c. Literature curation: In this step, important publications related to the
sequence are retrieved from literature databases. The whole text of each
article is scanned manually and relevant information is gathered and
supplemented to the entry.
d. Family based curation: Putative homologs are determined by Reciprocal
Blast searches and phylogenetic resources which are further evaluated,
curated, annotated and propagated across homologous proteins to ensure
data consistency.
e. Evidence attribution: All information incorporated to the sequence entry
during manual annotation is linked to the original source so that users can
trace back the origin of data and evaluate it.
f. Quality assurance, integration and update: Each completely annotated
entry undergoes quality assurance before integration into Swiss-Prot and is
updated as new data become available.
Page 5
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 0
Subject: Bioinformatics
Lesson : Swiss-Prot
Lesson Developer: Monika Jaggi
College/ Department: Kirori Mal College, University of Delhi
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 1
Table of Contents
Chapter: Swiss-Prot
? Introduction
? History
? Features
? Retrieval of Protein sequences
? Structure of a Swiss-Prot entry
? Important tools associated with UniProtKB/Swiss-
Prot
? Services associated with UniProtKB/Swiss-Prot
? TrEMBL
? Summary
? Exercise
? Glossary
? References
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 2
Introduction
Biological database can be defined as biological information stored in an electronic format
and can be easily accessed throughout the world. These databases can be classified into
various categories depending upon data type, data source, maintainer status etc. A variety
of databases contain nucleotide and/or protein sequences data that are pertinent to a
specific gene. Protein databases are specific to protein sequences. There are three
important publicly accessible protein databases: Protein Information Resource (PIR),
Swiss-Prot and Protein Data Bank (PDB). Whereas PIR and Swiss-Prot contain protein
sequences, PDB is a structural database of biomolecules. PIR is considered as a primary
database whereas Swiss-Prot falls into secondary database category. The aim of this
chapter is to explain Swiss-Prot database and strategies to retrieve information from this
database. Some of the tools and databases that are linked to each entry will also be
discussed briefly.
History
Swiss-Prot is an annotated protein sequence database which was formulated and managed
by Amos Bairoch in 1986. It was established collaboratively by the Department of Medical
Biochemistry at the University of Geneva and European Molecular Biology Laboratory
(EMBL). Later it shifted to European Bioinformatics Institute (EBI) in 1994 and finally in
April 1998, it became a part of Swiss Institute of Bioinformatics (SIB) (Bairoch and
Apweiler, 1998). In 1996, TrEMBL was added as an automatically annotated supplement to
Swiss-Prot database (Bairoch and Apweiler, 1996). Since 2002, it is maintained by the
UniProt consortium and information about a protein sequence can be accessed via the
UniProt website (http://www.uniprot.org/) (Apweiler et al., 2004). The Universal Protein
Resource (UniProt) is the most widespread protein sequence catalog comprising of EBI, SIB
and PIR (UniProt Consortium, 2009).
Features
Swiss-Prot database is characterized for its high quality annotation which comes at a price
of lower coverage. It provides information about the function of protein, its domain
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 3
structure, post-translational modifications (PTM) etc. In other words, it imparts whole
information about a specific protein. Swiss-Prot database is curated to make it non-
redundant. Therefore, this database contains only one entry per protein. As a result, the
size of Swiss-Prot is very less as compared to DNA sequence databases. Figure 1 shows the
development of the size of this database. The high quality annotation and minimum
redundancy distinguish Swiss-Prot from other protein sequence databases.
There are four main features of Swiss-Prot:
1. High Quality Annotation: It is achieved through manually creating the protein
sequence entries. It is processed through 6 stages:
a. Sequence curation: In this step, identical sequences are extracted through
blast search and then the sequences from the related gene and same
organism are incorporated into a single entry. It makes sure that the
sequence is complete, correct and ready for further curation steps.
b. Sequence Analysis: It is performed by using various sequence analysis
tools. Computer-predictions are manually reviewed and important results are
selected for integration.
c. Literature curation: In this step, important publications related to the
sequence are retrieved from literature databases. The whole text of each
article is scanned manually and relevant information is gathered and
supplemented to the entry.
d. Family based curation: Putative homologs are determined by Reciprocal
Blast searches and phylogenetic resources which are further evaluated,
curated, annotated and propagated across homologous proteins to ensure
data consistency.
e. Evidence attribution: All information incorporated to the sequence entry
during manual annotation is linked to the original source so that users can
trace back the origin of data and evaluate it.
f. Quality assurance, integration and update: Each completely annotated
entry undergoes quality assurance before integration into Swiss-Prot and is
updated as new data become available.
Swiss-Prot
Institute of Lifelong Learning, University of Delhi 4
Figure :The exponential growth of the Swiss-Prot database in the period 1987-2012.
Source: http://web.expasy.org/docs/relnotes/relstat.html.
2. Minimum redundancy: During manual annotation, all entries belonging to
identical gene and from similar organism are merged into a single entry containing
complete information. This results in minimal redundancy.
3. Integration with other Databases: Swiss-Prot is presently cross-
referenced to more than 50 specialized databases. This extensive interlinking allows
Swiss Prot to play a major role as a connecting link between various biological
databases.
4. Documentation: Swiss-Prot Database contains a large number of index files and
specialized documentation files. ‘Documentation file’ section provides an updated
descriptive list of all document files.
Retrieval of Protein Sequences
1. Select the browser http://www.uniprot.org/, the UniProt database home page .
Read More