Page 1
Protein Informatics
and Cheminformatics
10.1 Protein informatics
10.2 Cheminformatics
10.1 Protein i nformatics 10.1.1 Introduction
Collecting information about any protein using
techniques of information technology comes under protein
informatics. Protein informatics has been of tremendous
help in getting the geometrical location of the functional
site, the biochemical function and the biological function
of the hypothetical proteins. In addition, it has led to
the determination of the tertiary structures of many
hypothetical proteins, whose molecular functions could not
be understood using conventional methods. Heterogeneous
databases and various descriptors of amino acid sequences,
tertiary structures and pathways on the proteome scale
have also been of help in developing protein informatics.
Chapter 10
Chapter 10.indd 256 09/01/2025 15:18:32
Reprint 2025-26
Page 2
Protein Informatics
and Cheminformatics
10.1 Protein informatics
10.2 Cheminformatics
10.1 Protein i nformatics 10.1.1 Introduction
Collecting information about any protein using
techniques of information technology comes under protein
informatics. Protein informatics has been of tremendous
help in getting the geometrical location of the functional
site, the biochemical function and the biological function
of the hypothetical proteins. In addition, it has led to
the determination of the tertiary structures of many
hypothetical proteins, whose molecular functions could not
be understood using conventional methods. Heterogeneous
databases and various descriptors of amino acid sequences,
tertiary structures and pathways on the proteome scale
have also been of help in developing protein informatics.
Chapter 10
Chapter 10.indd 256 09/01/2025 15:18:32
Reprint 2025-26
257
Protein i nformatics and c heminformatics 10.1.2 Protein data types
The process of computation of information extraction
needs raw data of protein. These protein data can be of
following types —
(i) Microscopic image of heat-denatured protein
aggregate
(ii) Protein in solution form
(iii) Protein sequence as output of Matrix Assisted Laser
Desorption Ionisation (MALDI)
(iv) Assembled protein sequence
(v) Protein crystal structure in Protein Data Bank (PDB)
format
(vi) Protein-protein, protein-ligand or protein-nucleotide
interaction ??le
(vii) Nuclear Magnetic Resonance (NMR) data and Mass
Spectrometry (MS) data
(viii) Protein sequences derived directly from the genomic
sequences, which do not contain the known evidence
of existence (Hypothetical protein)
The above mentioned types of protein data can be used for
getting useful information like
(i) Multi-fractal property of microscopic image of heat-
denatured protein aggregate is used for designing
protein-marker.
(ii) Protein data in solution are useful for analysing
physico-chemical properties and kinetics
information.
(iii) Fragmented short sequences of proteins from MALDI
are used to ??nd out the full length sequence.
(iv) Protein crystal structures are used to study
mutations and interactions.
(v) PDB, NMR and MS data are also used for the
prediction of structure of non-crystallised protein
(directly from the sequence).
(vi) There are proteins which do not have known
existences (hypothetical proteins) which can be
identi??ed from the genomic sequences.
(vii) Network mapping of protein provides information
about the possible target of treatment of different
diseases.
Chapter 10.indd 257 09/01/2025 15:18:32
Reprint 2025-26
Page 3
Protein Informatics
and Cheminformatics
10.1 Protein informatics
10.2 Cheminformatics
10.1 Protein i nformatics 10.1.1 Introduction
Collecting information about any protein using
techniques of information technology comes under protein
informatics. Protein informatics has been of tremendous
help in getting the geometrical location of the functional
site, the biochemical function and the biological function
of the hypothetical proteins. In addition, it has led to
the determination of the tertiary structures of many
hypothetical proteins, whose molecular functions could not
be understood using conventional methods. Heterogeneous
databases and various descriptors of amino acid sequences,
tertiary structures and pathways on the proteome scale
have also been of help in developing protein informatics.
Chapter 10
Chapter 10.indd 256 09/01/2025 15:18:32
Reprint 2025-26
257
Protein i nformatics and c heminformatics 10.1.2 Protein data types
The process of computation of information extraction
needs raw data of protein. These protein data can be of
following types —
(i) Microscopic image of heat-denatured protein
aggregate
(ii) Protein in solution form
(iii) Protein sequence as output of Matrix Assisted Laser
Desorption Ionisation (MALDI)
(iv) Assembled protein sequence
(v) Protein crystal structure in Protein Data Bank (PDB)
format
(vi) Protein-protein, protein-ligand or protein-nucleotide
interaction ??le
(vii) Nuclear Magnetic Resonance (NMR) data and Mass
Spectrometry (MS) data
(viii) Protein sequences derived directly from the genomic
sequences, which do not contain the known evidence
of existence (Hypothetical protein)
The above mentioned types of protein data can be used for
getting useful information like
(i) Multi-fractal property of microscopic image of heat-
denatured protein aggregate is used for designing
protein-marker.
(ii) Protein data in solution are useful for analysing
physico-chemical properties and kinetics
information.
(iii) Fragmented short sequences of proteins from MALDI
are used to ??nd out the full length sequence.
(iv) Protein crystal structures are used to study
mutations and interactions.
(v) PDB, NMR and MS data are also used for the
prediction of structure of non-crystallised protein
(directly from the sequence).
(vi) There are proteins which do not have known
existences (hypothetical proteins) which can be
identi??ed from the genomic sequences.
(vii) Network mapping of protein provides information
about the possible target of treatment of different
diseases.
Chapter 10.indd 257 09/01/2025 15:18:32
Reprint 2025-26
258
Biotechnology In order to carry out the protein informatics analysis,
the following two basic facilities are required:
(i) Availability of the raw data from various databases,
such as NCBI, PDB, CHEMBL, BIOMODELS, etc.
(ii) Informatics tools and techniques used for the
analyses. Some of the well known techniques are:
(a) image analysis by the wavelet techniques, (b)
sequence similarity and homology calculations, (c)
structure optimisation techniques, (d) data analysis
by statistical and machine learning techniques as
Arti??cial Neural Network (ANN), Support Vector
Machine (SVM) and Hidden Markov Model (HMM),
(e) Network Mapping Technique, and (f) Systems
Biology Mark-up Language (SBML).
10.1.3 Computational prediction of protein
structures
Protein structure prediction using bioinformatics tools
is aimed to explore how amino acid sequences specify
the structure of proteins and how these proteins bind to
substrates and other molecules to perform their functions.
This task for predicting structure of a protein (including
those of hypothetical proteins) using bioinformatics tools
is possible even when only gene sequence is known, i.e., in
the absence of protein sequence. Many computational tools
are available from different sources for making predictions
of structural and physico-chemical properties of proteins.
The major advantages of computational methods are the
time frame involved, low cost and the feasibility of high
throughput screening.
10.1.3.1 Primary structure prediction
Protein primary structure prediction involves physico-
chemical characterisation such as isoelectric point,
extinction co-ef??cient, instability index, aliphatic index and
grand average hydropathy. All these can be calculated with
the help of ProtParam tool of ExPASy Proteomics Server.
Some of the physico-chemical properties of proteins are
described in brief in the following section.
Isoelectric point— Isoelectric point (pI) is the pH at
which the surface of protein is covered with charge but
net charge of protein is zero. At pI, proteins are stable and
compact. If the computed pI value is less than 7 (pI<7), it
indicates that protein is considered as acidic.
Chapter 10.indd 258 09/01/2025 15:18:32
Reprint 2025-26
Page 4
Protein Informatics
and Cheminformatics
10.1 Protein informatics
10.2 Cheminformatics
10.1 Protein i nformatics 10.1.1 Introduction
Collecting information about any protein using
techniques of information technology comes under protein
informatics. Protein informatics has been of tremendous
help in getting the geometrical location of the functional
site, the biochemical function and the biological function
of the hypothetical proteins. In addition, it has led to
the determination of the tertiary structures of many
hypothetical proteins, whose molecular functions could not
be understood using conventional methods. Heterogeneous
databases and various descriptors of amino acid sequences,
tertiary structures and pathways on the proteome scale
have also been of help in developing protein informatics.
Chapter 10
Chapter 10.indd 256 09/01/2025 15:18:32
Reprint 2025-26
257
Protein i nformatics and c heminformatics 10.1.2 Protein data types
The process of computation of information extraction
needs raw data of protein. These protein data can be of
following types —
(i) Microscopic image of heat-denatured protein
aggregate
(ii) Protein in solution form
(iii) Protein sequence as output of Matrix Assisted Laser
Desorption Ionisation (MALDI)
(iv) Assembled protein sequence
(v) Protein crystal structure in Protein Data Bank (PDB)
format
(vi) Protein-protein, protein-ligand or protein-nucleotide
interaction ??le
(vii) Nuclear Magnetic Resonance (NMR) data and Mass
Spectrometry (MS) data
(viii) Protein sequences derived directly from the genomic
sequences, which do not contain the known evidence
of existence (Hypothetical protein)
The above mentioned types of protein data can be used for
getting useful information like
(i) Multi-fractal property of microscopic image of heat-
denatured protein aggregate is used for designing
protein-marker.
(ii) Protein data in solution are useful for analysing
physico-chemical properties and kinetics
information.
(iii) Fragmented short sequences of proteins from MALDI
are used to ??nd out the full length sequence.
(iv) Protein crystal structures are used to study
mutations and interactions.
(v) PDB, NMR and MS data are also used for the
prediction of structure of non-crystallised protein
(directly from the sequence).
(vi) There are proteins which do not have known
existences (hypothetical proteins) which can be
identi??ed from the genomic sequences.
(vii) Network mapping of protein provides information
about the possible target of treatment of different
diseases.
Chapter 10.indd 257 09/01/2025 15:18:32
Reprint 2025-26
258
Biotechnology In order to carry out the protein informatics analysis,
the following two basic facilities are required:
(i) Availability of the raw data from various databases,
such as NCBI, PDB, CHEMBL, BIOMODELS, etc.
(ii) Informatics tools and techniques used for the
analyses. Some of the well known techniques are:
(a) image analysis by the wavelet techniques, (b)
sequence similarity and homology calculations, (c)
structure optimisation techniques, (d) data analysis
by statistical and machine learning techniques as
Arti??cial Neural Network (ANN), Support Vector
Machine (SVM) and Hidden Markov Model (HMM),
(e) Network Mapping Technique, and (f) Systems
Biology Mark-up Language (SBML).
10.1.3 Computational prediction of protein
structures
Protein structure prediction using bioinformatics tools
is aimed to explore how amino acid sequences specify
the structure of proteins and how these proteins bind to
substrates and other molecules to perform their functions.
This task for predicting structure of a protein (including
those of hypothetical proteins) using bioinformatics tools
is possible even when only gene sequence is known, i.e., in
the absence of protein sequence. Many computational tools
are available from different sources for making predictions
of structural and physico-chemical properties of proteins.
The major advantages of computational methods are the
time frame involved, low cost and the feasibility of high
throughput screening.
10.1.3.1 Primary structure prediction
Protein primary structure prediction involves physico-
chemical characterisation such as isoelectric point,
extinction co-ef??cient, instability index, aliphatic index and
grand average hydropathy. All these can be calculated with
the help of ProtParam tool of ExPASy Proteomics Server.
Some of the physico-chemical properties of proteins are
described in brief in the following section.
Isoelectric point— Isoelectric point (pI) is the pH at
which the surface of protein is covered with charge but
net charge of protein is zero. At pI, proteins are stable and
compact. If the computed pI value is less than 7 (pI<7), it
indicates that protein is considered as acidic.
Chapter 10.indd 258 09/01/2025 15:18:32
Reprint 2025-26
259
Protein i nformatics and c heminformatics The pI greater than 7 (pI>7) reveals that protein is basic
in character. The computed isoelectric point (pI) will be
useful for developing the buffer system for puri??cation by
isoelectric focusing method.
The aliphatic index— The aliphatic index (AI), which
is de??ned as the relative volume of a protein occupied
by aliphatic side chains (A, V, I and L) is regarded as a
positive factor for the increase of thermal stability of
globular proteins. Very high aliphatic index of protein
sequences indicates that protein may be stable for a wide
temperature range.
The instability index—The instability index provides
an estimate of the stability of protein in a test tube. There are
certain dipeptides, the occurrence of which is signi??cantly
different in the unstable proteins compared with those
in the stable ones. This method assigns a weight value
of instability. Using these weight values it is possible to
compute an instability index. A protein whose instability
index is smaller than 40, is predicted as stable, a value
above 40 predicts that the protein may be unstable.
The Grand Average Hydropathy (GRAVY) value — The
Grand Average Hydropathy (GRAVY) value for a peptide or
protein is calculated as the sum of hydropathy values of
all the amino acids, divided by the number of residues in
the sequence. The low range of GRAVY value indicates the
possibility of better interaction with water.
10.1.3.2 Secondary Structure Prediction
The protein secondary structure has been studied
intensely, since it is very helpful to reveal the functions of
protein with unknown structures. In addition, it has been
shown that the prediction of protein secondary structure is
a step towards protein 3-dimensional structure prediction.
APSSP, CFSSP, SOPMA, and GOR are common protein
secondary structure prediction tools.
10.1.3.3 Three dimensional (3D) Structure
Prediction
The following three computational methods are commonly
used to predict protein 3D structure.
Homology modelling—For homology modelling,
the amino acid sequence of a protein with unknown
structure is aligned against sequences of proteins
Chapter 10.indd 259 09/01/2025 15:18:32
Reprint 2025-26
Page 5
Protein Informatics
and Cheminformatics
10.1 Protein informatics
10.2 Cheminformatics
10.1 Protein i nformatics 10.1.1 Introduction
Collecting information about any protein using
techniques of information technology comes under protein
informatics. Protein informatics has been of tremendous
help in getting the geometrical location of the functional
site, the biochemical function and the biological function
of the hypothetical proteins. In addition, it has led to
the determination of the tertiary structures of many
hypothetical proteins, whose molecular functions could not
be understood using conventional methods. Heterogeneous
databases and various descriptors of amino acid sequences,
tertiary structures and pathways on the proteome scale
have also been of help in developing protein informatics.
Chapter 10
Chapter 10.indd 256 09/01/2025 15:18:32
Reprint 2025-26
257
Protein i nformatics and c heminformatics 10.1.2 Protein data types
The process of computation of information extraction
needs raw data of protein. These protein data can be of
following types —
(i) Microscopic image of heat-denatured protein
aggregate
(ii) Protein in solution form
(iii) Protein sequence as output of Matrix Assisted Laser
Desorption Ionisation (MALDI)
(iv) Assembled protein sequence
(v) Protein crystal structure in Protein Data Bank (PDB)
format
(vi) Protein-protein, protein-ligand or protein-nucleotide
interaction ??le
(vii) Nuclear Magnetic Resonance (NMR) data and Mass
Spectrometry (MS) data
(viii) Protein sequences derived directly from the genomic
sequences, which do not contain the known evidence
of existence (Hypothetical protein)
The above mentioned types of protein data can be used for
getting useful information like
(i) Multi-fractal property of microscopic image of heat-
denatured protein aggregate is used for designing
protein-marker.
(ii) Protein data in solution are useful for analysing
physico-chemical properties and kinetics
information.
(iii) Fragmented short sequences of proteins from MALDI
are used to ??nd out the full length sequence.
(iv) Protein crystal structures are used to study
mutations and interactions.
(v) PDB, NMR and MS data are also used for the
prediction of structure of non-crystallised protein
(directly from the sequence).
(vi) There are proteins which do not have known
existences (hypothetical proteins) which can be
identi??ed from the genomic sequences.
(vii) Network mapping of protein provides information
about the possible target of treatment of different
diseases.
Chapter 10.indd 257 09/01/2025 15:18:32
Reprint 2025-26
258
Biotechnology In order to carry out the protein informatics analysis,
the following two basic facilities are required:
(i) Availability of the raw data from various databases,
such as NCBI, PDB, CHEMBL, BIOMODELS, etc.
(ii) Informatics tools and techniques used for the
analyses. Some of the well known techniques are:
(a) image analysis by the wavelet techniques, (b)
sequence similarity and homology calculations, (c)
structure optimisation techniques, (d) data analysis
by statistical and machine learning techniques as
Arti??cial Neural Network (ANN), Support Vector
Machine (SVM) and Hidden Markov Model (HMM),
(e) Network Mapping Technique, and (f) Systems
Biology Mark-up Language (SBML).
10.1.3 Computational prediction of protein
structures
Protein structure prediction using bioinformatics tools
is aimed to explore how amino acid sequences specify
the structure of proteins and how these proteins bind to
substrates and other molecules to perform their functions.
This task for predicting structure of a protein (including
those of hypothetical proteins) using bioinformatics tools
is possible even when only gene sequence is known, i.e., in
the absence of protein sequence. Many computational tools
are available from different sources for making predictions
of structural and physico-chemical properties of proteins.
The major advantages of computational methods are the
time frame involved, low cost and the feasibility of high
throughput screening.
10.1.3.1 Primary structure prediction
Protein primary structure prediction involves physico-
chemical characterisation such as isoelectric point,
extinction co-ef??cient, instability index, aliphatic index and
grand average hydropathy. All these can be calculated with
the help of ProtParam tool of ExPASy Proteomics Server.
Some of the physico-chemical properties of proteins are
described in brief in the following section.
Isoelectric point— Isoelectric point (pI) is the pH at
which the surface of protein is covered with charge but
net charge of protein is zero. At pI, proteins are stable and
compact. If the computed pI value is less than 7 (pI<7), it
indicates that protein is considered as acidic.
Chapter 10.indd 258 09/01/2025 15:18:32
Reprint 2025-26
259
Protein i nformatics and c heminformatics The pI greater than 7 (pI>7) reveals that protein is basic
in character. The computed isoelectric point (pI) will be
useful for developing the buffer system for puri??cation by
isoelectric focusing method.
The aliphatic index— The aliphatic index (AI), which
is de??ned as the relative volume of a protein occupied
by aliphatic side chains (A, V, I and L) is regarded as a
positive factor for the increase of thermal stability of
globular proteins. Very high aliphatic index of protein
sequences indicates that protein may be stable for a wide
temperature range.
The instability index—The instability index provides
an estimate of the stability of protein in a test tube. There are
certain dipeptides, the occurrence of which is signi??cantly
different in the unstable proteins compared with those
in the stable ones. This method assigns a weight value
of instability. Using these weight values it is possible to
compute an instability index. A protein whose instability
index is smaller than 40, is predicted as stable, a value
above 40 predicts that the protein may be unstable.
The Grand Average Hydropathy (GRAVY) value — The
Grand Average Hydropathy (GRAVY) value for a peptide or
protein is calculated as the sum of hydropathy values of
all the amino acids, divided by the number of residues in
the sequence. The low range of GRAVY value indicates the
possibility of better interaction with water.
10.1.3.2 Secondary Structure Prediction
The protein secondary structure has been studied
intensely, since it is very helpful to reveal the functions of
protein with unknown structures. In addition, it has been
shown that the prediction of protein secondary structure is
a step towards protein 3-dimensional structure prediction.
APSSP, CFSSP, SOPMA, and GOR are common protein
secondary structure prediction tools.
10.1.3.3 Three dimensional (3D) Structure
Prediction
The following three computational methods are commonly
used to predict protein 3D structure.
Homology modelling—For homology modelling,
the amino acid sequence of a protein with unknown
structure is aligned against sequences of proteins
Chapter 10.indd 259 09/01/2025 15:18:32
Reprint 2025-26
260
Biotechnology with known structures. High degrees of homology (very
similar sequences across and between the proteins) can
be used to determine the global structure of the protein
with unknown structure and place it into a certain fold
category. Lower degrees of homology may still be used
to determine the local structures, an example being the
Chou-Fasman method for predicting secondary structure.
An advantage of homology modelling methods is lack of
dependence on the knowledge of physical determinants.
MODELLER and SWISS-MODEL are commonly used tools
for homology modelling.
Fold prediction—With the method called ‘threading’,
the sequence of a protein with unknown structure is
forced to take the conformation of the backbone (protein
side chains) of a protein with known structure. These
methods tend to be more compute-intensive than homology
modelling methods, but they give more con??dence in the
physical viability of the results. LIBELLULA and Threader
are commonly used tools for this method.
De novo protein structure prediction: It is an
algorithmic process by which protein tertiary structure is
predicted from its amino acid primary sequence. QUARK
is a computer algorithm for ab initio protein structure
prediction and protein peptide folding, which aims to
construct the correct protein 3D model from amino acid
sequence only.
Computationally elucidated structure of a protein
is recorded as atomic coordinates in protein-data-bank
??les. The three-dimensional coordinates are stored in a
type of text-??le namely PDB-??le with ??le extension .pdb
in Protein Data Bank (PDB) database. It contains data
from X-ray crystallography, NMR and a few theoretical
structure models.
Domain prediction— Domain is distinct functional
and/or structural units of a protein. Independent folding
unit of a polypeptide chain also carries speci??c function.
They are often identi??ed as recurring (sequence or
structure) units, which may exist in various contexts.
Domains provide most valuable information for the
prediction of protein structure, function, evolution, and
design. The most common tools for domain prediction are
InterPRO scan of EMBL and CDD search of NCBI.
Chapter 10.indd 260 09/01/2025 15:18:32
Reprint 2025-26
Read More