NEET Exam  >  NEET Notes  >  Biotechnology for Class 11  >  Chapter Notes: Protein Informatics and Cheminformatics

Protein Informatics and Cheminformatics Chapter Notes | Biotechnology for Class 11 - NEET PDF Download

Chapter Notes - Protein Informatics and Cheminformatics

Protein Informatics

  • Protein informatics involves collecting information about proteins using information technology techniques.
  • It aids in identifying the geometrical location of functional sites, biochemical functions, and biological roles of hypothetical proteins.
  • It has facilitated the determination of tertiary structures of hypothetical proteins, which were previously difficult to understand using conventional methods.
  • Heterogeneous databases and descriptors of amino acid sequences, tertiary structures, and pathways on a proteome scale have been instrumental in advancing protein informatics.

Protein Data Types

  • Protein informatics relies on raw protein data for computational information extraction, which includes the following types:
  • Microscopic image of heat-denatured protein aggregate, used to study multi-fractal properties for designing protein markers.
  • Protein in solution form, useful for analyzing physico-chemical properties and kinetics information.
  • Protein sequence output from Matrix Assisted Laser Desorption Ionisation (MALDI), where fragmented short sequences are used to determine the full-length sequence.
  • Assembled protein sequence, providing a complete sequence for further analysis.
  • Protein crystal structure in Protein Data Bank (PDB) format, used to study mutations and interactions.
  • Protein-protein, protein-ligand, or protein-nucleotide interaction files, providing insights into molecular interactions.
  • Nuclear Magnetic Resonance (NMR) and Mass Spectrometry (MS) data, used for predicting the structure of non-crystallized proteins directly from sequences.
  • Protein sequences derived from genomic sequences without known evidence of existence (hypothetical proteins), used to identify such proteins.
  • Applications of these data types include:
  • Designing protein markers using multi-fractal properties of heat-denatured protein aggregates.
  • Analyzing physico-chemical properties and kinetics from protein solutions.
  • Reconstructing full-length sequences from MALDI-derived fragments.
  • Studying mutations and interactions using protein crystal structures.
  • Predicting structures of non-crystallized proteins using PDB, NMR, and MS data.
  • Identifying hypothetical proteins from genomic sequences.
  • Network mapping of proteins to identify potential treatment targets for diseases.
  • Protein informatics analysis requires two basic facilities:
  • Availability of raw data from databases such as NCBI, PDB, CHEMBL, and BIOMODELS.
  • Informatics tools and techniques, including:
  • Image analysis using wavelet techniques.
  • Sequence similarity and homology calculations.
  • Structure optimization techniques.
  • Data analysis using statistical and machine learning techniques like Artificial Neural Network (ANN), Support Vector Machine (SVM), and Hidden Markov Model (HMM).
  • Network Mapping Technique.
  • Systems Biology Mark-up Language (SBML).

Computational Prediction of Protein Structures

  • Protein structure prediction uses bioinformatics tools to determine how amino acid sequences define protein structures and their interactions with substrates and other molecules.
  • It enables structure prediction of proteins, including hypothetical ones, even when only the gene sequence is known, without the protein sequence.
  • Computational methods offer advantages like shorter time frames, low cost, and feasibility for high-throughput screening.

Primary Structure Prediction


Primary structure prediction involves physico-chemical characterization of proteins, including isoelectric point, extinction coefficient, instability index, aliphatic index, and grand average hydropathy (GRAVY).
These properties are calculated using the ProtParam tool of the ExPASy Proteomics Server.

  • Isoelectric Point (pI): The pH at which a protein’s surface has a neutral net charge, making it stable and compact. A pI < 7 indicates an acidic protein, while pI > 7 indicates a basic protein. The computed pI aids in developing buffer systems for purification by isoelectric focusing.
  • Aliphatic Index (AI): Measures the relative volume of aliphatic side chains (alanine, valine, isoleucine, leucine) in a protein, positively correlating with thermal stability. A high AI suggests stability across a wide temperature range
  • Instability Index: Estimates protein stability in a test tube by analyzing dipeptide occurrences. An index < 40 predicts stability, while > 40 suggests instability.
  • Grand Average Hydropathy (GRAVY): Calculated as the sum of hydropathy values of all amino acids divided by sequence length. A low GRAVY value indicates better water interaction.

Secondary Structure Prediction

  • Secondary structure prediction is crucial for understanding protein functions, especially for proteins with unknown structures.
  • It serves as a step toward predicting three-dimensional (3D) protein structures.
  • Common tools for secondary structure prediction include APSSP, CFSSP, SOPMA, and GOR.

Three Dimensional (3D) Structure Prediction

Three computational methods are commonly used for 3D protein structure prediction:

  • Homology Modelling: Aligns the amino acid sequence of a protein with unknown structure against sequences of proteins with known structures. High sequence homology determines the global structure, placing the protein in a fold category. Lower homology predicts local structures, e.g., using the Chou-Fasman method for secondary structure prediction. It does not rely on physical determinants. Common tools include MODELLER and SWISS-MODEL.
  • Fold Prediction (Threading): Forces the sequence of a protein with unknown structure to adopt the backbone conformation of a protein with known structure. More computationally intensive than homology modelling but provides higher confidence in physical viability. Common tools include LIBELLULA and Threader.
  • De Novo Protein Structure Prediction: Predicts tertiary structure from the primary amino acid sequence using algorithms. QUARK is a tool for ab initio structure prediction and protein peptide folding, constructing 3D models from sequences alone.
  • Computationally predicted structures are stored as atomic coordinates in Protein Data Bank (PDB) files, with the .pdb extension, containing data from X-ray crystallography, NMR, and theoretical models.
  • Domain Prediction: Identifies distinct functional or structural units of a protein, which fold independently and carry specific functions. Domains are recurring sequence or structure units and provide insights into protein structure, function, evolution, and design. Common tools include InterPRO scan (EMBL) and CDD search (NCBI).

Cheminformatics

  • Cheminformatics uses computational and informational techniques to address chemistry-related problems, integrating principles from physics, chemistry, biology, mathematics, biochemistry, statistics, and informatics.
  • Also known as chemoinformatics or chemical informatics, it is widely applied in drug discovery to evaluate large numbers of compounds for interactions with target cellular molecules.
  • Over the past two decades, cheminformatics has advanced conceptually and technically, with applications in chemical, pharmaceutical, and biotechnology industries, particularly in computer-aided drug design (CADD) for molecules with specific biological and therapeutic properties.
  • Cheminformatics specialists manage data on physical properties, 3D molecular and crystal structures, and chemical reaction pathways.
  • It handles virtual libraries of chemical databases, including hypothetical compounds, with information on synthesis methods and predicted stability of reaction products.
  • Virtual screening applies chemical and physical principles to identify and evaluate candidates from large libraries of real and virtual molecules for specific properties or reactions, which are then verified in laboratory studies.

Storing and Managing the Chemical Data

  • Numerous groups and organizations maintain chemical compound databases, some publicly available for free and others commercially accessible.
  • These databases, containing millions of compounds and reactions, are searchable in seconds due to robust computational power and tools.
  • Virtual molecule libraries, with billions of entries, include compounds not documented in literature but synthesizable using advanced combinatorial techniques.
  • The Chemical Abstracts Service (CAS), a division of the American Chemical Society, is the world’s largest collection of chemical insights, serving as a universal standard for chemical names and structures.
  • CAS registry includes over 219 million organic and inorganic substances, more than 70 million protein and nucleic acid sequences, and over 8 billion property values, updated daily with data from global literature in biomedical sciences, chemistry, engineering, and material science.
  • Popular chemical databases include:
  • PubChem: Maintains information on substances, compounds, and BioAssays.
  • ZINC: Contains compounds for virtual screening, including features like molecular weight and log P.
  • ChEMBL: Provides comprehensive data on bioactive small drug-like molecules and drug targets.
  • NCI: Offers small molecule structures, useful for cancer and AIDS research.
  • ChemDB: Includes chemicals with predicted or experimentally determined physicochemical properties, such as 3D structure, melting temperature, and solubility.
  • ChemSpider: Aggregates unique chemical entities from diverse data sources.
  • BindingDB: A database of small molecule binding affinities for protein targets.
  • DrugBank: Combines detailed drug data (chemical, pharmacological, pharmaceutical) with drug target information (sequence, structure, pathway).
  • PharmaGKB: A pharmacogenomics knowledge resource with clinical drug molecule information.
  • SuperDrug: Contains 3D structures of active ingredients in essential marketed drugs.

Why Do We Need Cheminformatics?

  • Cheminformatics tools navigate vast chemical resources, including hundreds of millions of compounds, properties, and reactions, to identify suitable compounds for specific purposes.
  • Pharmaceutical companies use cheminformatics for in silico drug design, followed by synthesis and testing.
  • The chemical manufacturing industry employs cheminformatics to design new properties and predict the efficacy and toxicity of chemicals before market release.

How to Store Information on Chemical Compounds?

  • Chemical compounds can be drawn on paper or using software with predefined templates to create standard geometric structures and reactions, stored as image files (e.g., jpg, tif) or documents (e.g., doc, pdf).
  • Such storage is inadequate for research requiring deep analysis of bond angles, rotational flexibility, and other molecular properties.
  • Chemical structures are stored as molecular graphs, representing atoms as nodes and bonds as edges.
  • The node-edge approach is used to model molecular pathways, such as glycolysis and the Krebs cycle, at a higher level.

Searching the Structures

  • Many commercial cheminformatics databases originate from academic research projects.
  • Basic searches extract information about chemical structures, such as physical and chemical properties within a specific boiling point range.
  • Substructure retrieval identifies compounds with specific functional groups, like methyl groups, benzene rings, or alkene backbones, through subgraph isomorphism (embedding a small graph into a larger one).
  • A two-stage search process is common:
  • First Stage: Filters out molecules unlikely to match the substructure query, eliminating most candidates.
  • Second Stage: Performs detailed subgraph isomorphism to identify molecules matching the substructure.
  • Molecule screens use binary strings (bitstrings) of 0s and 1s for efficient filtering.

Searching the Reactions

  • Chemists search reaction databases to check if a compound has been synthesized, identify reaction conditions, and explore different reaction pathways from one point to another.
  • Searches may include parameters like solvents, pH, temperature, and pressure.
  • Complex queries integrate multiple criteria, e.g., finding reactions using glucose at 37°C.
  • Atom mapping is a key feature, establishing correspondence between reactant and product atoms.
  • Cheminformatics tools and databases allow retrieval of reactions where specific substructures are converted into products.

Pharmacophore

  • A pharmacophore describes molecular features critical for a ligand’s recognition by a biological target to trigger a response, as defined by IUPAC.
  • It includes steric and electronic features ensuring optimal interactions with the target.
  • Pharmacophore models explain how structurally diverse ligands bind to a single receptor.
  • A 3D pharmacophore specifies spatial orientations of features like positively and negatively charged groups, rings, and hydrophobic regions.
  • It is a conceptual framework, not a physical molecule, defining pharmacophore points (steric, electrostatic, hydrophobic properties) needed for therapeutic molecule-target interactions.

Lipinski's Rule of Five (RO5)

  • Proposed by Christopher A. Lipinski in 1997, this rule outlines key molecular properties for orally active drugs.
  • An ideal drug should be biodegradable, non-toxic, stable, free of side effects, uniformly distributed in cells, controllably released, cost-effective, and easily excreted.
  • Criteria for an orally active drug (should not violate more than one rule):
  • No more than 5 hydrogen bond donors.
  • No more than 10 hydrogen bond acceptors.
  • Molecular weight below 500 Daltons.
  • Octanol-water partition coefficient (log P) less than 5.
  • RO5 applies only to oral drugs, not intramuscular or intravenous drugs.
  • Compounds are scored from 0 to 4 based on RO5; scores below 3 indicate unsuitability for further analysis.
  • RO5 does not apply to natural products or semisynthetic natural products.

The Journey of a Drug

  • Nature provides a vast array of active compounds with therapeutic potential, and scientific methods help identify promising molecules.
  • Drug discovery and development is a long, expensive, and risky process, involving discovery, development, and delivery phases.
  • Virtual screening is an in silico approach to select compounds from billions for specific purposes, such as drug discovery or industrial applications.
  • Virtual screening involves scoring, ranking, and extracting structures using computational methods, with filters to eliminate undesirable compounds.
  • Filters become increasingly stringent, narrowing down to a small group of molecules with desired properties.
  • Virtual screening includes:
  • General filters to identify drug-like compounds with desired Absorption, Distribution, Metabolism, and Excretion (ADME) properties.
  • Ligand-based methods, including machine learning, pharmacophore-based searches.
  • Structure-based methods, such as protein-ligand docking.
  • Compounds passing virtual screening undergo biological screening, synthesis, and testing.
The document Protein Informatics and Cheminformatics Chapter Notes | Biotechnology for Class 11 - NEET is a part of the NEET Course Biotechnology for Class 11.
All you need of NEET at this link: NEET
24 docs

FAQs on Protein Informatics and Cheminformatics Chapter Notes - Biotechnology for Class 11 - NEET

1. What is the difference between protein informatics and cheminformatics?
Ans. Protein informatics focuses on the analysis and interpretation of protein data, including structure, function, and interactions, while cheminformatics deals with chemical data, particularly the storage, retrieval, and analysis of chemical compounds and their properties.
2. How can protein informatics tools assist in drug discovery?
Ans. Protein informatics tools can help identify potential drug targets by analyzing protein structures and functions, predicting how drugs will interact with these proteins, and facilitating the design of new compounds that can effectively bind to these targets.
3. What are some common databases used in protein informatics?
Ans. Common databases include UniProt for protein sequences and functional information, Protein Data Bank (PDB) for 3D structures, and STRING for protein-protein interaction data, which provide valuable resources for researchers in the field.
4. What role does cheminformatics play in computational drug design?
Ans. Cheminformatics plays a crucial role in computational drug design by allowing researchers to model and simulate chemical interactions, optimize lead compounds, and analyze large datasets of chemical properties to identify promising candidates for further study.
5. How do machine learning techniques apply to protein and cheminformatics?
Ans. Machine learning techniques are applied in both fields to predict protein structures, classify protein functions, and analyze chemical properties, enabling researchers to make more informed decisions and accelerate the discovery process in drug development and bioinformatics.
Related Searches

ppt

,

study material

,

Protein Informatics and Cheminformatics Chapter Notes | Biotechnology for Class 11 - NEET

,

MCQs

,

Extra Questions

,

Semester Notes

,

Objective type Questions

,

Protein Informatics and Cheminformatics Chapter Notes | Biotechnology for Class 11 - NEET

,

shortcuts and tricks

,

Previous Year Questions with Solutions

,

Summary

,

Important questions

,

mock tests for examination

,

Free

,

past year papers

,

video lectures

,

pdf

,

Exam

,

practice quizzes

,

Sample Paper

,

Viva Questions

,

Protein Informatics and Cheminformatics Chapter Notes | Biotechnology for Class 11 - NEET

;