Presented at the Bioinformatics Seminar at the University of Arkansas, Little Rock on November 5, 2021.
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical database at the National Library of Medicine, National Institutes of Health. Arguably, PubChem is one of the largest chemical information resources in the public domain, with 111 million unique chemical structures, 1.39 million biological assays, and 292 million biological activity result outcomes. It also contains significant amounts of scientific research data and the inter-relationships between chemicals, proteins, genes, scientific literature, patents, and more. PubChem is a key resource for big data in chemistry and has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). It has also been used for cheminformatics education as well as chemical health and safety training. This presentation provides a high-level overview of PubChem’s data, tools, and services.
This document discusses cheminformatics, which involves the use of computer software and databases to manage chemical compound data and properties for applications in drug discovery. It defines cheminformatics as combining chemical synthesis, biological screening, and data mining to guide the drug development process. The document outlines the history and evolution of cheminformatics from chemical information to modern applications. It also discusses key companies involved in cheminformatics and related areas like quantitative structure-activity relationships and chemical libraries.
Prediction of the three dimensional structure of a given protein sequence i.e. target protein from the amino acid sequence of a homologous (template) protein for which an X-ray or NMR structure is available based on an alignment to one or more known protein structures
Drug and Chemical Databases 2018 - Drug DiscoveryGirinath Pillai
Latest collection of Chemical and Drug Databases for Biological Research as well as Drug Design studies. Databases statistics, links and overview data with CADD introduction.
The ZINC database was developed by John Irwin as a curated collection of commercially available small molecules for virtual screening, containing data on commercially available and annotated small molecules with their 3D structures. Investigators in pharmaceutical companies, biotech companies, and research universities use the ZINC database for virtual screening as it aims to represent molecules in their biologically relevant 3D form, and is continuously updated while also releasing static subsets quarterly.
PyMol is a molecular graphics program that allows visualization and manipulation of protein structures. It takes protein data files in PDB format as input and outputs the 3D protein structure that can be visualized, animated, exported, and analyzed through various features and commands. PyMol is open-source, cross-platform software that is effective for protein structure analysis and commonly used in research.
PubChem and its application for cheminformatics educationSunghwan Kim
PubChem is a public chemical database maintained by the U.S. National Institutes of Health containing information on small molecules, lipids, nucleic acids, and other chemical substances. It receives over 5 million unique users per month, many of whom are students. PubChem has potential as an educational resource given its popularity, sustainability, and zero cost to students. PubChem collaborates with academic partners to develop resources like an online cheminformatics course and chemical safety summaries.
Molecular Representation, Similarity and SearchRajarshi Guha
This document discusses molecular representation and similarity. It outlines different ways to represent molecules, such as explicitly showing atoms and bonds or more compact implicit representations. Methods to quantify similarity between molecules are presented, including fingerprints to encode structural features and calculate similarity metrics like Tanimoto scores. Applications like virtual screening and library design rely on assessing molecular similarity. Both 2D and 3D representations have advantages and limitations in evaluating biochemical relevance.
Cheminformatics is the application of computer science to solve chemical problems. It involves acquiring chemical data through experiments or simulations, managing the information in databases, and analyzing the data. Key aspects of cheminformatics include computer-assisted synthesis design, representing chemical structures digitally, and using mathematical models to analyze chemical data. Cheminformatics plays an important role in drug discovery by aiding processes like target identification, lead discovery, and molecular modeling.
Protein structure prediction involves computational methods to determine a protein's 3D structure from its amino acid sequence. Ab initio methods use physics-based calculations of potential energy to predict the most stable conformation. Comparative methods leverage databases of known protein structures, searching for sequences with similar folds. Homology modeling relies on the assumption that related proteins share similar folds, allowing prediction based on matches to distant evolutionary relatives. Protein threading compares local segments of the sequence to structural fragments in databases.
In Vitro ADMET Considerations for Drug Discovery and Lead GenerationOSUCCC - James
This document provides an overview of in vitro ADMET (absorption, distribution, metabolism, excretion, toxicity) assays that are used during drug discovery and development. Key points:
- In vitro assays are designed to mimic what happens to a compound in vivo and provide early data on absorption, distribution, metabolic transformations, potential toxicity, and more.
- Common assays examine solubility, permeability, protein binding, metabolic stability, metabolism pathways, toxicity, and effects of transporters and drug-drug interactions.
- The data generated from these assays are used throughout the drug development process to inform compound selection, design better candidates, and identify liabilities early. Understanding a compound's properties helps optimize the likelihood of success
Drug designing is a process used in biopharmaceutical industry to discover and develop new drug compounds.
Variety of computational methods are used to identify novel compounds ,design compounds for selectivity and safety.
Structure-based drug design, ligand-based drug design , homology based methods are used depending on how much information is available about drug targets and potential drug compounds.
This document discusses systems biology and some of its tools. It defines systems biology as the study of interactions between parts of biological systems to understand how they function. Biological networks involve interactions between pathways. Networks can be modeled as nodes and edges. Tools described for modeling and analyzing networks include Cytoscape for visualization, CellDesigner for drawing networks, and STRING for protein-protein interaction data. Databases of pathways, interactions and models are also listed.
The document discusses structure-based drug design (SBDD). It first provides background on drug design and SBDD. It then describes some key aspects of SBDD, including using the 3D structure of the biological target obtained from techniques like X-ray crystallography and NMR spectroscopy. It also discusses ligand-based and receptor-based drug design approaches. The document then outlines the typical steps involved in SBDD, including target selection, ligand selection, target preparation, docking, evaluating results, and discusses some molecular docking techniques and scoring functions used to predict binding.
Bioinformatics plays a key role in drug discovery by enabling researchers to efficiently analyze large amounts of biological data and computationally simulate drug-target interactions. Some important applications of bioinformatics in drug discovery include virtual high-throughput screening of compound libraries against protein targets to identify potential drug leads, analyzing genetic and protein sequences to infer evolutionary relationships and identify drug targets, and using homology modeling to predict the 3D structures of targets to aid in drug design when experimental structures are unknown.
Introduction
Overview
Reductionist approach
Holistic approach
What is systems biology?
○ Advantages of Systems Biology
Tools of holistic approach
○ Proteomics, Transcriptomics and Metabolomics
Conclusion
References
Presented at the Fall 2020 American Chemical Society (ACS) National Meeting (Virtual) on August 20, 2020.
Sunghwan Kim, Jian Zhang, Paul Thiessen, Asta Gindulyte, Pertti J. Hakkinen & Evan Bolton
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource at the U.S. National Institutes of Health. It collects chemical information from 700+ data sources and disseminates the collected data to the public free of charge. Arguably, PubChem contains the largest amount of chemical information available in the public domain, with more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 265 million bioactivity outcomes from one million assays covering around twenty thousand unique protein target sequences.
Included in the many types of content in PubChem is toxicological information about chemicals, e.g., human and animal toxicity, ecotoxicity, exposure limits, exposure symptoms, and antidote & emergency treatment. Notably, a substantial amount of toxicological information from resources formerly offered by the TOXicology data NETwork (TOXNET) is now integrated into PubChem, e.g., the Hazardous Substances Data Bank (HSDB), LactMed, and LiverTox. In addition, PubChem contains a large amount of bioactivity and toxicity screening data that can be used to build toxicity prediction models based on statistical and machine-learning approaches. This presentation provides an overview of PubChem’s toxicological information as well as tools and services that help users exploit this information. It also describes how open data in PubChem can be used to develop prediction models for chemical toxicity.
PubChem as a resource for chemical information trainingSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (March 31, 2019). [CINF 13]
==== Abstract ====
Libraries at many large academic institutions provide chemical information training programs for students. However, these programs are based on commercial chemical information resources, which come with non-trivial subscription fees. These fees are often too expensive for small organizations, including many primarily undergraduate institutions (PUIs) and community colleges (CCs). It leads to disparity in access to chemical information as well as learning opportunities among students. This issue may be addressed at least in part by developing free online training programs based on public chemical databases, such as PubChem (https://pubchem.ncbi.nlm.nih.gov). PubChem has a great potential as an online resource for chemical education, but it also has important issues that students and teachers should keep in mind, such as data accuracy, data provenance, structure standardization, terminologies and so on. In this presentation, we will discuss various aspects of PubChem as a resource for chemical information training.
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
A web-seminar jointly organized by KWSE (Korean Woman Scientists & Engineers) and KWiSE (Korean-American Women in Science and Engineering). Presented on July 27, 2021.
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem is one of the largest sources of publicly available chemical information, with more than 242.3 million depositor-provided substance descriptions, 94.7 million unique chemical structures, and 234.8 million bioactivity outcomes from 1.25 million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery based on natural products.
PubChem contains a large amount of bioactivity data, most of which are generated from high-throughput screening (HTS). However, these data also include a substantial amount of bioactivity information extracted from scientific articles published in journals in the chemical biology, medicinal chemistry, and natural product domains, thanks to data contribution by other databases like ChEMBL, Guide to Pharmacology, BindingDB, and PDBbind. In addition, through data integration with other databases such as DrugBank, HSDB, and HMDB, PubChem contains a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identify search, 2-D and 3-D similarity searches, substructure and superstructure searches, molecular formula search. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem’s data into their own in-house data on a local computing machine.
Searching for chemical information using PubChemSunghwan Kim
Presented at the 257th American Chemical Society (ACS) National Meeting in Orlando, FL (April 1, 2019). [CHED 303]
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical database, which provides information on a broad range of chemical entities, including small molecules, lipids, carbohydrates, and (chemically-modified) amino acid and nucleic acid sequences (including siRNA and miRNA). With three million unique users per month at peak, PubChem is ranked as one of the most visited chemistry websites in the world. A substantial number of PubChem users are between ages 18 and 24, who are likely to be undergraduate or graduate students at academic institutions. Therefore, PubChem has a great potential as an online resource for chemical education. In this talk, we will present “PubChem Search”, a new web interface that allows users to quickly find desired chemical information. This interface supports chemical name search as well as various types of chemical structure search, including identity/similarity search, superstructure/substructure search, and molecular search. Using PubChem Search, it is also possible to search for journal articles or patent documents that mention a given chemical. The hits returned from a search can be downloaded to local machines or further refined or analyzed in conjunction with other PubChem tools and services. In this presentation, we will demonstrate how the PubChem Search interface can be used to search beyond google for chemical information of interest.
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
PubChem is a public resource containing over 100 million unique chemical structures and 268 million bioactivity outcomes from assays. It aggregates data from over 750 sources and contains extensive information on chemical properties, biological activities, and related literature and patents. Users can search and access this data interactively through web interfaces or programmatically. As an example, bioactivity data from PubChem was used to develop predictive models for small molecule interactions with the retinoid X receptor alpha protein, achieving AUC scores over 0.7.
Revolution in the Connectivity Between Medicinal Chemistry and BiologyChris Southan
This document provides a summary of PubChem and related open cheminformatics resources and their role in connecting medicinal chemistry and biology. It discusses how PubChem has revolutionized the field by providing a central repository for chemical structures linked to biological data. Key points include how PubChem has accelerated output in medicinal chemistry research and enabled new approaches like chemical systems biology by making vast amounts of chemical and biological data openly accessible and searchable.
This presentation was given at a TRIANGLE AREA MASS SPECTOMETRY meeting on 01/29/2019 in Research Triangle Park, North Carolina to provide a general overview of the CompTox Chemicals Dashboard to an audience of mass spectrometrists and people interested in the capabilities of the dashboard for chemical forensics, structure identification etc.
NCBI Minute: Integrating PubChem into Your Chemistry TeachingSunghwan Kim
NCBI Webinar delivered via online on May 9, 2018.
PubChem is one of most visited chemistry web sites in the world with more than 2.9 million unique users per month. This NCBI Minute shows how you can integrate PubChem in your chemistry teaching as cheminformatics education resource. In addition to learning about tools and services for search, analysis, and download of chemical information, you will see how PubChem has been incorporated in Cheminformatics OLCC (On-Line Chemistry Courses), an intercollegiate hybrid course.
Presented online at KSEA - Virginia Washington Metro Regional Conference 2020 (VWMRC 2020) (May 9, 2020)
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource, visited by millions of unique users per month. It contains chemical data from more than 700 data sources and disseminates these data to the public free of charge. Arguably, it is the largest source of publicly available chemical information, containing more than 250 million depositor-provided substance descriptions, 100 million unique chemical structures, and 260 million bioactivity outcomes from one million assays covering around ten thousand unique protein target sequences. This presentation provides an overview of PubChem’s data, tools, and services useful for drug discovery.
The immense quantity of bioactivity data in PubChem can be used to develop computational models to predict bioactivities of small molecules. While these data are primarily generated from high-throughput screening (HTS), they also include a substantial amount of bioactivity information extracted from peer-reviewed journal articles. In addition, through data integration with other databases, PubChem has a wide range of annotations useful for drug discovery, including pharmacology, toxicology, drug target, metabolism, chemical vendors, scientific articles, patents, and many others.
PubChem supports various types of chemical structure searches, including identity, 2-D and 3-D similarity, substructure, superstructure, and molecular formula. It also provides multiple programmatic access routes, including E-Utilities, Power User Gateway (PUG), PUG-SOAP, PUG-REST, and PUG-View, allowing one to build an automated workflow that takes advantage of information contained in PubChem. In addition, through PubChemRDF, users can integrate PubChem data with their own.
This document describes the CompTox Chemistry Dashboard, a publicly accessible website developed by the EPA's National Center for Computational Toxicology that provides data on over 762,000 chemicals. The dashboard contains experimental and predicted physicochemical property data, environmental fate and transport data, toxicity data, and models. It allows users to search, view detailed chemical pages, access prediction reports, perform batch searches, and will integrate additional predicted properties and data in the future. The goal is to provide a central resource for computational toxicology data to support chemical safety assessments.
A presentation given at the 5th Metabolomics of North America webinar on September 8th 2023. Provides an overview of the cheminformatics support provided by the DSSTox database, CompTox Chemicals Dashboard and multiple other web-based applications in development
Tens of thousands of chemicals are currently in commerce, and hundreds more are introduced every year. Because current chemical testing is resource intensive, only a small fraction of chemicals have been adequately evaluated for potential human health effects. New technologies and computational tools have shown promise for closing this knowledge gap. In the U.S. EPA’s ToxCast effort, the use of ~700 high-throughput in vitro assays has broadly characterized the biological activity and potential mechanisms of ~1,800 chemicals. Coupling the high-throughput in vitro assays with additional in vitro pharmacokinetic assays and in vitro-to-in vivo extrapolation modeling allows conversion of in vitro bioactive concentrations to estimates to an administered dose (mg/kg/day). High throughput exposure models are generating exposure estimates based on key aspects of chemical production, fate, transport, and personal use. The path for incorporating new approach methods and technologies for prioritization and assessment of chemical alternatives poses multiple scientific challenges. These challenges include sufficient coverage of toxicological mechanisms to meaningfully interpret negative test results, development of increasingly relevant test systems, computational modeling to integrate experimental data, characterizing uncertainty, and efficient validation of the test systems and computational models. The presentation will cover progress at the U.S. EPA in the development and application of these technologies and approaches in evaluating alternatives and systematically addressing each of these challenges. This abstract does not necessarily reflect U.S. EPA policy.
This presentation was given at the ASMS Sanibel Conference "Unraveling the Exposome" and provided a general overview of the dashboard and how it integrates to many of the projects that we support but with a special focus on list generation, mass and formula searching based on MS-Ready structures and some of the prototypes that we have been developing to support non-targeted analysis.
The document discusses how the EPA's CompTox Chemicals Dashboard can be used to support mass spectrometry analyses for structure identification. The Dashboard contains data on over 800,000 chemicals including properties, lists, and links to other resources. It allows searching by formula, structure, and mass to find related chemicals. Candidate structures can be ranked using metadata. Predicted mass spectra from over 800,000 structures may also be accessible. The Dashboard integrates data to help identify unknown chemicals detected by mass spectrometry.
The CompTox Chemicals Dashboard is an open chemistry resource and web-based application containing data for ~900,000 substances. While it pales in comparison to other online resources containing many tens of millions of chemical substances it represents two decades of effort to aggregate and curate chemical data to deliver access to physicochemical properties and environmental fate and transport data, in vitro and in vivo toxicity data and consumer product data for over 500,000 products. Associated with the chemical substance collections are specific lists based on chemical classes (e.g. polychlorinated biphenyls (PCBs)), usage categories (e.g. flame retardants, pesticides), specific environmental classes of interest (e.g. disinfectant by-products) and regulatory lists (e.g. TSCA, Toxics release inventory data). The underlying database expands daily as a result of the efforts of curators who continue to harvest data from peer-reviewed publications, regulatory documents and relevant online databases. The cheminformatics architecture allows for mapping between parent chemicals and related substances including metabolites and environmental degradants. The dashboard has become a valuable resource in the identification of chemical substances in the environment.
High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are of increasing interest in chemical forensics for the identification of emerging contaminants and chemical signatures of interest. Our research using HRMS for non-targeted and suspect screening analyses utilizes Advanced Search capabilities including mass and formula-based searches. A specific type of data mapping in the underpinning database, using “MS-Ready” structures, has proven to be a valuable approach for structure identification that links structures that can be identified via HRMS with related substances in the form of salts, and other multi-component mixtures that are available in commerce. These MS-Ready structures have been used as an input set for computational MS-fragmentation to provide a database against which to search experimental data for spectral matching. This presentation will provide an overview of how CompTox Chemicals Dashboard, the underlying data, and how it supports structure identification and non-targeted analysis in chemical forensics. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
Presentation for Texas A&M Superfund Research Center virtual learning series, Big Data in Environmental Science and Toxicology. More details at https://superfund.tamu.edu/big-data-session-2-aug-18-2021/
Identification of unknowns in mass spectrometry based non-targeted analyses (NTA) requires the integration of complementary pieces of data to arrive at a confident, consensus structure. Researchers use chemical reference databases, spectral matching, fragment prediction tools, retention time prediction tools, and a variety of other data to arrive at tentative, probable, and confirmed, if possible, identifications. With the diverse, robust data contained within the US EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov), the goal of this research is to identify and implement a harmonized identification tool and workflow using previously generated chemistry data. Data has been compiled from product use, functional use prediction models, environmental media occurrence prediction models, and PubMed references, among other sources. We will report on our development of a visualization tool whereby users can visualize the relative contribution of identification-based metrics on a list of candidate structures and observe the greatest likelihood of occurrence. These data and visualization tools support NTA identification via the Dashboard and demonstrate an open, accessible tool for all users of HRMS data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
PubChem for drug discovery and chemical biologyChris Southan
This document provides an overview of the PubChem database for academic drug discovery and chemical biology. It describes PubChem's large content of over 97 million compounds and 3.4 million with bioactivity results. It highlights drug-related resources in PubChem like ChEMBL and the Guide to Pharmacology. It also demonstrates several use cases, including searching structures extracted from patents, linking between papers and chemistry, and getting probes mapped into PubChem.
The National Center for Computational Toxicology at the EPA has developed the CompTox Chemistry Dashboard to provide public access to toxicity and chemical property data. The dashboard integrates data from high-throughput screening, predicted properties, chemical hazard assessments, literature searches, and exposure information for over 760,000 chemicals. It allows users to search for chemicals individually or in batches, access detailed chemical pages, view predicted and experimental properties, and download open data. The dashboard is meant to be an integration hub that makes NCCT's data and models more accessible and reusable to other scientists.
PubChem for chemical information literacy trainingSunghwan Kim
Presented at the American Chemical Society Fall 2021 National Meeting (August 23, 2021; virtual).
==== Abstracts ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that collects chemical information from 780+ data sources. It is visited by millions of users every month and many of them are young students at academic undergraduate or graduate students at academic institutions. While PubChem has a great potential as an online resource for chemical education, it also has important issues that are not familiar to students and educators, including data accuracy, data provenance, structure standardization, terminologies, etc. In this presentation, various aspects of PubChem as a chemical education resource will be discussed, with a special emphasis on how to help students develop chemical information literacy skills.
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Sunghwan Kim
Presented at the American Chemical Society (ACS) Spring 2021 National Meeting (Virtual, April 16, 2021).
==== Abstract ====
Computer and informatics skills to handle an ever-increasing amount of chemical information are considered important for students pursuing STEM careers in the age of big data. However, many schools do not offer a cheminformatics course or alternative training opportunities. The Cheminformatics Online Chemistry Course (OLCC) is a community effort to introduce cheminformatics content into the undergraduate chemistry curriculum. It is a highly collaborative teaching project involving instructors at multiple schools as well as external cheminformatics experts recruited across sectors, including academia, government, and industry. Three Cheminformatics OLCCs were offered in the Fall 2015, Spring 2017, and Fall 2019 semesters. In each OLCC, the instructors at participating schools would meet face-to-face with the students, while external cheminformatics experts engaged through online discussions across campuses with both the instructors and students. All the material created in the course has been made available at the open education repositories of LibreTexts and CCCE websites for other institutions to adapt to their future needs. This presentation describes the instructional approaches of the Cheminformatics OLCC project and the lessons learned from this community effort. We also discuss future directions for this project as well as cheminformatics education in general, including pedagogy, resources, and course content.
Cheminformatics Education with PubChemSunghwan Kim
Presented on November 13, 2020, as part of the "Integrating Bioinformatics Education Series" (https://ualr.edu/bioinformatics/education-series/), organized by the Arkansas IDeA Network of Biomedical Research Excellence (Arkansas INBRE) (https://inbre.uams.edu/).
Sunghwan Kim
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
PubChem as an Emerging Toxicological Information ResourceSunghwan Kim
Presented on October 20, 2020 at the 9th American Society for Cellular and Computational Toxicology (ASCCT) National Meeting.
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource at the U.S. National Institutes of Health. It collects chemical information from 750+ data sources and disseminates it to the public free of charge. Arguably, PubChem contains the largest amount of chemical information available in the public domain, with more than 265 million depositor-provided substance descriptions, 100 million unique chemical structures, and 270 million bioactivity outcomes from one million assays covering around twenty thousand unique protein target sequences.
Included in the many types of content in PubChem is toxicological information about chemicals, e.g., human and animal toxicity, ecotoxicity, exposure limits, exposure symptoms, and antidote & emergency treatment. Notably, a substantial amount of toxicological information from resources formerly offered by the TOXicology data NETwork (TOXNET) is now integrated into PubChem, e.g., the Hazardous Substances Data Bank (HSDB), Genetic Toxicology Data Bank (Gene-Tox), Chemical Carcinogenesis Research Information System (CCRIS), LactMed, and LiverTox. In addition, PubChem contains a large amount of bioactivity and toxicity screening data that can be used to build toxicity prediction models based on statistical and machine-learning approaches. This presentation provides an overview of PubChem’s toxicological information and describes how open data in PubChem can be used to develop prediction models for chemical toxicity.
PubChem as a resource for chemical information educationSunghwan Kim
Presented at the Fall 2020 American Chemical Society (ACS) National Meeting (Virtual) on August 20, 2020.
Sunghwan Kim & Evan Bolton
National Library of Medicine, National Institutes of Health, Rockville, Maryland, United States
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource that contains one of the largest corpus of publicly available chemical information. It is one of the top five most visited chemistry web sites in the world, with more than four million unique users per month (as of April 2020). Considering that many of PubChem users are undergraduate students in academic institutions, PubChem has a great potential as an online resource for chemical education. However, it also has some important issues with data accuracy, data provenance, structure standardization, terminologies and so on, because PubChem is essentially a data aggregator that collects heterogeneous data from 700+ data sources in various domains. This presentation will discuss various aspects of PubChem as a chemical information education resource. Especially, a focus will be given on how to help students develop the ability to critically assess chemical information available in PubChem and other public databases.
Chemical Health and Safety Information in PubChemSunghwan Kim
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
Risk assessment in laboratories requires ready access to health and safety (H&S) information for many different chemicals used in laboratory work. Because chemical H&S data in the public domain are scattered across many websites, it is essential to create a centralized data repository that collects, organizes, and disseminates these data. An example is PubChem (https://pubchem.ncbi.nlm.nih.gov), developed and maintained by the U.S. National Institutes of Health.
PubChem contains a substantial corpus of H&S information of chemicals collected from authoritative government agencies and international organizations. PubChem’s H&S data include flammability, toxicity, exposure limits, exposure symptoms, first aid, handling, clean-up procedure, GHS symbols, and more. In addition, for 100,000+ compounds, PubChem provides a tailored data view called the Laboratory Chemical Safety Summary (LCSS), which presents pertinent H&S data for a given compound. The complete list of chemicals with an LCSS can be accessed through the PubChem LCSS project webpage (https://pubchemdocs.ncbi.nlm.nih.gov/lcss/) or the PubChem Classification Browser (https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72). If desired, LCSS data can be downloaded from the LCSS page for each compound, or in bulk from the PubChem LCSS project webpage, enabling local annotation of the data to support specific procedures in place at an institution. The LCSS page can be readily accessed from a mobile device using a chemical QR code.
Chemical Structure Standardization and Synonym Filtering in PubChemSunghwan Kim
Presented at the 258th American Chemical Society (ACS) National Meeting in San Diego, CA (August 26, 2019).
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical data repository that provides information on various chemical entities, including small molecules, siRNA, miRNA, peptides, lipids, carbohydrates, chemically modified biologics, etc. One of the most commonly requested tasks in PubChem is to search for a compound by chemical name (also commonly called “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. These name-structure associations are used to create links between chemicals and Medical Subject Headings (MeSH) terms, which in turn are used to generate associations between chemicals and PubMed articles. The accuracy of these depositor-provided synonym-structure associations is dependent upon two important quality control methods used in PubChem: (1) chemical structure standardization and (2) synonym filtering based on crowd voting. In this presentation, we will discuss the two quality control methods and their effects on the chemical synonym-structure associations.
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
Presented at the 2018 Research Festival at the National Institutes of Health (NIH) in Bethesda, MD (September 13, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, public-domain bioactivity data available in PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop machine learning-based prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using popular supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The general applicability of the developed models was evaluated with external data sets from ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for bioactivity of small molecules.
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 22, 2018).
==== Abstract ====
The retinoid X receptor (RXR) is a nuclear hormone receptor that functions as a transcription factor with roles in development, cell differentiation, metabolism, and cell death. Chemicals that interfere the RXR signaling pathway may cause adverse effects on human health. In this study, open bioactivity data available at PubChem (https://pubchem.ncbi.nlm.nih.gov) were used to develop prediction models for chemical modulators of RXR-alpha, which is a subtype of RXR that plays a role in metabolic signaling pathways, dermal cysts, cardiac development, insulin sensitization, etc. The models were constructed from quantitative high-throughput screening (qHTS) data from the Tox21 project, using various supervised machine learning methods (including support vector machine, random forest, neural network, k-nearest neighbors, decision tree, and naïve Bayes). The performance of the models was evaluated with an external data set containing bioactivity data submitted by ChEMBL and the NCATS Chemical Genomics Center (NCGC). This study showcases how open data in the public domain can be used to develop prediction models for chemical toxicity.
Searching for patent information in PubChem Sunghwan Kim
Presented at the 256th American Chemical Society (ACS) National Meeting in Boston, MA (August 19, 2018).
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource, containing more than 242 million chemical substance descriptions, 94 million unique compounds, and 234 million bioactivities determined from 1.25 million assay experiments. Importantly, data contribution from multiple sources, including IBM, SureChEMBL, ScripDB, NextMove, and BindingDB, allows PubChem to provide links to patent documents that mention chemicals. Currently, PubChem offers links between about 6.7 million patent documents and more than 20 million unique chemical structures, with over 137 million compound-patent links, covering primarily U.S. patents with some from European, and World Intellectual Property Organization, and Japanese patent documents. This presentation will provide an overview of the patent information in PubChem as well as the best practice for using it.
How can you access PubChem programmatically?Sunghwan Kim
Presented at the 255th American Chemical Society (ACS) National Meeting in New Orleans, LA (March. 19, 2018).
Building automated workflows that exploit the vast amount of data contained in PubChem requires programmatic access to the data through application programming interfaces (APIs). PubChem provides several programmatic access routes to its data, including Entrez Utilities (E-Utilities or E-Utils), PubChem Power User Gateway (PUG), PUG-SOAP, PUG-REST, PUG-View, and a REST-ful interface to PubChemRDF. This presentation provides an overview of these programmatic access tools, including recent updates, limitations, usage policies, and best practices.
*References*
(1) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem, Nucleic Acids Research, 2015, 43(W1):W605–W611. https://doi.org/10.1093/nar/gkv396
(2) An update on PUG-REST: RESTful interface for programmatic access to PubChem, Nucleic Acids Research, 2018, 46(W1):gky294. https://doi.org/10.1093/nar/gky294
This an presentation about electrostatic force. This topic is from class 8 Force and Pressure lesson from ncert . I think this might be helpful for you. In this presentation there are 4 content they are Introduction, types, examples and demonstration. The demonstration should be done by yourself
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
just download it to see!
Collaborative Team Recommendation for Skilled Users: Objectives, Techniques, ...Hossein Fani
Collaborative team recommendation involves selecting users with certain skills to form a team who will, more likely than not, accomplish a complex task successfully. To automate the traditionally tedious and error-prone manual process of team formation, researchers from several scientific spheres have proposed methods to tackle the problem. In this tutorial, while providing a taxonomy of team recommendation works based on their algorithmic approaches to model skilled users in collaborative teams, we perform a comprehensive and hands-on study of the graph-based approaches that comprise the mainstream in this field, then cover the neural team recommenders as the cutting-edge class of approaches. Further, we provide unifying definitions, formulations, and evaluation schema. Last, we introduce details of training strategies, benchmarking datasets, and open-source tools, along with directions for future works.
Lunar Mobility Drivers and Needs - ArtemisSérgio Sacani
NASA’s new campaign of lunar exploration will see astronauts visiting sites of scientific or strategic
interest across the lunar surface, with a particular focus on the lunar South Pole region.[1] After landing
crew and cargo at these destinations, local mobility around landing sites will be key to movement of
cargo, logistics, science payloads, and more to maximize exploration returns.
NASA’s Moon to Mars Architecture Definition Document (ADD)[2] articulates the work needed to achieve
the agency’s human lunar exploration objectives by decomposing needs into use cases and functions.
Ongoing analysis of lunar exploration needs reveals demands that will drive future concepts and elements.
Recent analysis of integrated surface operations has shown that the transportation of cargo on the
surface from points of delivery to points of use will be particularly important. Exploration systems will
often need to support deployment of cargo in close proximity to other surface infrastructure. This cargo
can range from the crew logistics and consumables described in the 2023 “Lunar Logistics Drivers and
Needs” white paper,[3] to science and technology demonstrations, to large-scale infrastructure that
requires precision relocation.
Molecular biology of abiotic stress tolerence in plantsrushitahakik1
### Molecular Biology of Abiotic Stress Tolerance in Plants
Abiotic stress refers to the non-living environmental factors that can cause significant harm to plants, including drought, salinity, extreme temperatures, heavy metals, and oxidative stress. Understanding the molecular biology underlying abiotic stress tolerance is crucial for developing crops that can withstand these conditions, ensuring food security in the face of climate change and environmental degradation. Here, we explore the key molecular mechanisms, pathways, and genetic strategies plants use to cope with abiotic stress.
#### 1. Signal Perception and Transduction
**1.1. Signal Perception:**
Plants possess various sensors and receptors to detect abiotic stress signals. For instance, membrane-bound receptors such as receptor-like kinases (RLKs) and ion channels play critical roles in sensing changes in environmental conditions.
**1.2. Signal Transduction Pathways:**
Upon sensing abiotic stress, plants activate complex signal transduction pathways that involve:
- **Calcium Signaling:** Changes in cytosolic calcium levels act as secondary messengers. Calcium-binding proteins, such as calmodulins (CaMs) and calcineurin B-like proteins (CBLs), decode these signals and activate downstream responses.
- **Reactive Oxygen Species (ROS) Signaling:** ROS are produced under stress and function as signaling molecules. Controlled ROS production is crucial for activating defense mechanisms, while excessive ROS can cause cellular damage.
- **Mitogen-Activated Protein Kinase (MAPK) Cascades:** These cascades amplify the stress signal and regulate the expression of stress-responsive genes.
#### 2. Transcriptional Regulation
**2.1. Transcription Factors (TFs):**
TFs are pivotal in regulating the expression of genes involved in stress responses. Key TF families include:
- **AP2/ERF (APETALA2/ETHYLENE RESPONSE FACTOR):** Involved in drought and salinity tolerance.
- **NAC (NAM, ATAF, and CUC):** Play roles in responding to dehydration and high salinity.
- **bZIP (Basic Leucine Zipper):** Associated with responses to various stresses, including drought and oxidative stress.
- **WRKY:** Participate in the regulation of genes involved in stress responses and pathogen defense.
**2.2. Epigenetic Regulation:**
Epigenetic modifications, such as DNA methylation, histone modifications, and chromatin remodeling, influence gene expression without altering the DNA sequence. These modifications can lead to the activation or repression of stress-responsive genes.
#### 3. Stress-Responsive Genes and Proteins
**3.1. Osmoprotectants:**
Plants accumulate osmoprotectants like proline, glycine betaine, and sugars (e.g., trehalose) to maintain cellular osmotic balance under stress conditions.
**3.2. Antioxidant Defense:**
To mitigate oxidative stress, plants enhance the production of antioxidants, such as superoxide dismutase (SOD), catalase (CAT), and peroxidases, which scavenge harmful ROS.
El Nuevo Cohete Ariane de la Agencia Espacial Europea-6_Media-Kit_english.pdfChamps Elysee Roldan
Europe must have autonomous access to space to realise its ambitions on the world stage and
promote knowledge and prosperity.
Space is a natural extension of our home planet and forms an integral part of the infrastructure
that is vital to daily life on Earth. Europe must assert its rightful place in space to ensure its
citizens thrive.
As the world’s second-largest economy, Europe must ensure it has secure and autonomous access to
space, so it does not depend on the capabilities and priorities of other nations.
Europe’s longstanding expertise in launching spacecraft and satellites has been a driving force behind
its 60 years of successful space cooperation.
In a world where everyday life – from connectivity to navigation, climate and weather – relies on
space, the ability to launch independently is more important than ever before. With the launch of
Ariane 6, Europe is not just sending a rocket into the sky, we are asserting our place among the
world’s spacefaring nations.
ESA’s Ariane 6 rocket succeeds Ariane 5, the most dependable and competitive launcher for decades.
The first Ariane rocket was launched in 1979 from Europe’s Spaceport in French Guiana and Ariane 6 will continue the adventure.
Putting Europe at the forefront of space transportation for nearly 45 years, Ariane is a triumph of engineering and the prize of great European industrial and political
cooperation. Ariane 1 gave way to more powerful versions 2, 3 and 4. Ariane 5 served as one of the world’s premier heavy-lift rockets, putting single or multiple
payloads into orbit – the cargo and instruments being launched – and sent a series of iconic scientific missions to deep space.
The decision to start developing Ariane 6 was taken in 2014 to respond to the continued need to have independent access to space, while offering efficient
commercial launch services in a fast-changing market.
ESA, with its Member States and industrial partners led by ArianeGroup, is developing new technologies for new markets with Ariane 6. The versatility of Ariane 6
adds a whole new dimension to its very successful predecessors
Possible Anthropogenic Contributions to the LAMP-observed Surficial Icy Regol...Sérgio Sacani
This work assesses the potential of midsized and large human landing systems to deliver water from their exhaust
plumes to cold traps within lunar polar craters. It has been estimated that a total of between 2 and 60 T of surficial
water was sensed by the Lunar Reconnaissance Orbiter Lyman Alpha Mapping Project on the floors of the larger
permanently shadowed south polar craters. This intrinsic surficial water sensed in the far-ultraviolet is thought to be
in the form of a 0.3%–2% icy regolith in the top few hundred nanometers of the surface. We find that the six past
Apollo Lunar Module midlatitude landings could contribute no more than 0.36 T of water mass to this existing,
intrinsic surficial water in permanently shadowed regions (PSRs). However, we find that the Starship landing
plume has the potential, in some cases, to deliver over 10 T of water to the PSRs, which is a substantial fraction
(possibly >20%) of the existing intrinsic surficial water mass. This anthropogenic contribution could possibly
overlay and mix with the naturally occurring icy regolith at the uppermost surface. A possible consequence is that
the origin of the intrinsic surficial icy regolith, which is still undetermined, could be lost as it mixes with the
extrinsic anthropogenic contribution. We suggest that existing and future orbital and landed assets be used to
examine the effect of polar landers on the cold traps within PSRs
Dalghren, Thorne and Stebbins System of Classification of AngiospermsGurjant Singh
The Dahlgren, Thorne, and Stebbins system of classification is a modern method for categorizing angiosperms (flowering plants) based on phylogenetic relationships. Developed by botanists Rolf Dahlgren, Robert Thorne, and G. Ledyard Stebbins, this system emphasizes evolutionary relationships and incorporates extensive morphological and molecular data. It aims to provide a more accurate reflection of the genetic and evolutionary connections among angiosperm families and orders, facilitating a better understanding of plant diversity and evolution. This classification system is a valuable tool for botanists, researchers, and horticulturists in studying and organizing the vast diversity of flowering plants.
Ethical considerations play a crucial role in research, ensuring the protection of participants and the integrity of the study. Here are some subject-specific ethical issues that researchers need
Testing the Son of God Hypothesis (Jesus Christ)Robert Luk
Instead of answering the God hypothesis, we investigate the Son of God hypothesis. We developed our own methodology to deal with existential statements instead of universal statements unlike science. We discuss the existence of the supernaturals and found that there are strong evidence for it. Given that supernatural exists, we report on miracles investigated in the past related to the Son of God. A Bayesian methodology is used to calculate the combined degree of belief of the Son of God Hypothesis. We also report the testing of occurrences of words/numbers in the Bible to suggest the likelihood of some special numbers occurring, supporting the Son of God Hypothesis. We also have a table showing the past occurrences of miracles in hundred year periods for about 1000 years. Miracles that we have looked at include Shroud of Turin, Eucharistic Miracles, Marian Apparitions, Incorruptible Corpses, etc.
1. PubChem and Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Email: sunghwan.kim@nih.gov
2. 2
Outline
1. What Is PubChem?
2. What Does PubChem Have?
3. Exploring Chemical Information in PubChem
4. Programmatic Access to PubChem
5. Bioactivity Prediction Model Building with PubChem Data
6. PubChem and COVID-19 Conspiracy Theories
7. Summary
4. 4
https://pubchem.ncbi.nlm.nih.gov
Public chemical database at NIH.
Contains information on various chemical entities:
• (Drug-like) small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem Is a Public Chemical Information Resource
5. 5
PubChem Is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
800+ data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
7. 7
History of PubChem
NIH Molecular Libraries Program (MLP)
Common Fund project.
Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
8. 8
History of PubChem
NIH Molecular Libraries Program (MLP)
Common Fund project.
Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
Had three components (subprojects):
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
9. 9
History of PubChem
PubChem was launched in 2004 as a component of MLP.
All Common Fund projects are supported only up to 10 years.
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
10. 10
History of PubChem
PubChem was launched in 2004 as a component of MLP.
All Common Fund projects are supported only up to 10 years.
PubChem evolved to play a dual role:
As a data archive
As a knowledgebase
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)
12. 12
User Demographics
(June 2020 through May 2021)
36.5%
27.4%
13.5%
10.4%
6.7% 5.4%
0
1
2
3
4
5
6
18-24 25-34 35-44 45-54 55-64 65+
Number
of
Users
(millions)
Age
34.64% of total users
~40% of PubChem users are aged between 18 and 24.
(likely to be college students)
25. 25
Multiple data collections in PubChem
Compound
Unique chemical
structures
Substance
Depositor-provided
chemical data
BioAssay
Assay descriptions
& test results
Protein Gene Pathway Patent
Archive Archive
Knowledgebase
Chemical data associated with a protein/gene/pathway/patent
26. 26
As of November 2021, PubChem contains:
• 276 million substance descriptions
• 111 million unique chemical structures
• 292 million biological activity test results
• 1.4 million biological assays, covering 21 thousand unique protein
sequence targets.
(Arguably) the largest corpus of
publicly available chemical information from 800+ data
sources.
PubChem Statistics
27. 27
PubChem’s Chemical Space
Lipinski’s
Rule of 5 (Ro5) for
Drug-likeness a
Congreve’s
Rule of 3 (Ro3) for
Lead-likeness b
Molecular Weight ≤500 ≤300
Octanol–water partition coefficient (Log P) ≤5 ≤3
Number of H-bond donors ≤5 ≤3
Number of H-bond acceptors ≤10 ≤3
Number of Rotational Bond N/A ≤3
Polar surface area (PSA) N/A ≤60
a Lipinski et al., Adv. Drug Delivery Rev. 1997, 23(1–3), 3-25.
b Congreve et al., Drug Discov. Today, 2003, 8(19), 876-877.
28. 28
Congreve’s
Rule of 3 (Ro3)
11.7 millions
(10.57 %)
Lipinski’s
Rule of 5 (Ro5)
78.9 millions
(71.36%)
All compounds
110.6 millions
(100%)
PubChem’s Chemical Space
30. 30
Bioactivity Data in PubChem
Tested
3.6 millions
(3.27%)
Active
(AC ≤ 1 nM)
74 thousands
(0.07%)
Active
(1 nM < AC ≤ 1 µM)
777.5 thousands
(0.70%)
Active
(others)
635.2 thousands
(0.57%)
Inactive
2.1 millions
(1.93%)
Not Tested
107.0 millions
(96.73%)
All Compounds
110.7 millions
(100.00%)
AC: activity concentration (e.g., IC50, EC50, Ki, Kd, etc.)
31. 31
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
32. 32
Bioactivity Data in PubChem
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
33. 33
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data
• From manual curation or data
mining
• No (or few) inactives
• Provided by various PubChem
depositors including:
ChEMBL,
PDBbind, BindingDB,
Guide to Pharmacology
Bioactivity Data in PubChem
34. 34
• Virtual screening hits should be synthesizable or purchasable.
• PubChem contains “real” molecules (not “virtual” molecules)
• At least one or more data contributors claim that they have the compound
and/or information about it.
• Some of these compounds are chemical vendors (e.g., Sigma Aldrich).
Availability of compounds for subsequent experiments
35. 35
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Availability of compounds for subsequent experiments
36. 36
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Legacy designation:
No longer maintains their records up-to-date.
o Discontinued funding, low business priority, …
Availability of compounds for subsequent experiments
42. 42
Simplified molecular-input line-entry system (SMILES)
CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)
NC4=NC=CC(=N4)C5=CN=CC=C5
Line notations for chemical structures
IUPAC International Chemical Identifier (InChI)
InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-
10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3
44. 44
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
Chemical Structure Search
45. 45
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
Chemical Structure Search
46. 46
Identity Search
Depending on what you mean by “identical molecules”, you will get different search results.
What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
Users can search PubChem using different “nuances” of structural identity.
Chemical Structure Search
47. 47
Substructure Search
• use a substructure as a query
• search for compounds that contain the query substructure.
Superstructure Search
• use a superstructure as a query
• search for compounds that are contained in the query superstructure.
Chemical Structure Search
48. 48
When do you use substructure searches?
ex. when you want to find all molecules that
have a particular molecular scaffold.
Cephalosporins
(a class of β-lactam antibiotics)
Substructure/Superstructure Search
Chemical Structure Search
49. 49
Similarity Search
Why do we need similarity search?
• There is a huge imbalance of available information among compounds in PubChem.
For example, among 110.7 million compounds in PubChem,
- 3.6 million compounds (3.27 %) have been tested in at least one assay.
- 1.5 million compounds (1.34 %) have been tested to be active in at least one assay.
• The remaining 86.8 million compounds (97.6%) have not been tested in any assay.
• Bioactivities of these compounds may be predicted from structurally similar compounds with
known bioactivities.
• “Similarity Principle” : structurally similar compounds are likely to have similar biological
properties.
Chemical Structure Search
50. 50
How can you quantify similarity?
• Similarity is very subjective and context-dependent.
• There are many different ways to quantify similarity.
• Different similarity methods will recognize different flavors of similarity.
• PubChem uses two different similarity measures.
- 2-D similarity based on molecular fingerprints.
- 3-D similarity based on rapid-overlay of chemical structures (ROCS).
Similarity Search
Chemical Structure Search
51. 51
PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
Chemical Structure Search
52. 52
PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
• Structural Similarity between two molecules are
computed using the Tanimoto equation:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =
𝑁𝑁𝐴𝐴𝐴𝐴
𝑁𝑁𝐴𝐴 + 𝑁𝑁𝐵𝐵 − 𝑁𝑁𝐴𝐴𝐴𝐴
NA: # bits set for molecule A
NB: # bits set for molecule B
NAB: # bits set for both
Chemical Structure Search
• Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules).
53. 53
PubChem 3-D Similarity
Three similarity measures:
• Shape-Tanimoto (ST): 3-D overlap between steric shapes of molecules
• Color-Tanimoto (CT): 3-D overlap between “feature” atoms
(H-bond donors/acceptors, Cationic/Anionic centers, rings and hydrophobes)
• Combo-Tanimoto (ComboT): the sum of ST and CT
Both ST and CT range from 0 to 1, and ComboT range from 0 to 2 (without normalization to 1).
Chemical Structure Search
54. 54
PubChem 3-D Similarity
Chemical Structure Search
3-D similarity quantification involves optimization of superposition between two molecules:
• ST-optimization: finds the superposition that maximizes the ST score between them.
• CT-optimization: considers both CT and ST scores during the optimization.
55. 55
Why does PubChem use two different similarities.
• 2-D similarity comparison is much faster than
3-D similarity comparison
- 2-D: 106 comparisons per second
- 3-D: 102 ~ 103 comparisons per second
• However, 2-D similarity methods often fail to
recognize structural similarity that can be
easily recognized by 3-D similarity methods.
Chemical Structure Search
CID 1548887
(Sulindac)
CID 3715
(Indomethacin)
2D = 0.39
ST = 0.92
CT = 0.52
Both are non-steroidal anti-inflammatory drugs
(NSAIDs) and cyclooxygenase inhibitors.
56. 56
Gene/Protein/Pathway Summary
Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene/pathway target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein/Pathway Summary
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein/pathway.
58. 58
Patent Summary
Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent Summary page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
60. 60
https://pubchem.ncbi.nlm.nih.gov/classification
Browse PubChem data using a classification of interest.
Search for records annotated with the desired classification/term.
A few examples of supported ontologies/classifications.
• MeSH (Medical Subject Headings)
• ChEBI (Chemical Entities of Biological Interest)
• FDA Pharm Classes
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC (Anatomical Therapeutic Chemical Classification System) Code
• WIPO International Patent Classification
Classification Browser
65. 65
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
66. 66
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
67. 67
PubChem users have very diverse
backgrounds/interests.
PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
68. 68
Multiple programmatic access routes
Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez
Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG-View
72. Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
90% 10%
Data sets
73. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
74. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
All data
Available in
PubChem.
75. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
76. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
471
Data sets
77. Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
78. Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes α (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210); γ ( 2-10 ∼ 210)
NN Neural network solver (lbfgs or adam); α (10-7 ∼ 107)
10-fold cross-validation was used for hyperparameter optimization.
Model Building
79. Model Performance Evaluation
Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
=
1
2
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
+
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
=
1
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
80. Performance of the models
AUC scores of ≥0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
Maximum AUC score (0.77):
PubChem fingerprint with RF
Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation
81. Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models
90. 91
Continuation Application Continuation-in-Part (CIP) Application
Adds new claims to a pending parent
application (i.e., not granted nor abandoned).
Cannot change the specification of the
invention.
Has the same priority date as the parent
application.
Increases the scope of the application without
having to file an entirely new application (and
consequently losing the original filing date).
Adds "enhancements" to the original invention
disclosed in the parent application.
New claims may also be added:
• Claims concerning the original invention:
the same priority date as the parent
application
• Claims concerning the enhancement:
the priority is the filing date of the CIP
application.
Same Invention
Additional claims
Modified Invention
Additional Claims
91. Patent Application
Publication
Patent (Granted)
Patent Application
15/293,211
(10/13/2016)
62/240,783
(10/13/2015)
15/495,485
(04/24/2017)
10,242,713 B2
(02/11/2019)
System and method for using, processing,
and displaying biometric data (20 claims)
2017/0229149 A1
(8/10/2017)
16/273,141
(02/11/2019)
10,522,188 B2
(12/31/2019)
2019/0325914 A1
(10/24/2019)
System and method for using, processing,
and displaying biometric data (30 claims)
16/704,844
(12/05/2019)
10,910,016 B2
(2/2/2021)
System and method for using, processing,
and displaying biometric data (20 claims)
2020/0126593 A1
(4/23/2020)
16/876,114
(05/17/2020)
2020/0279585 A1
(9/3/2020)
11,024,339 B2
(6/1/2021)
System and method for testing for
COVID-19 (17 claims)
Provisional
• The pre-pandemic applications are about a generic
system/method that deals with biometric data.
• The post-pandemic application includes a modified
invention and additional claims specific to COVID-19
C
C
C
CIP
92. 93
How to deal with this type of misinformation
Consider PubChem as an information locator.
PubChem data are from other data sources.
More detailed information may be available at the original data source.
It is highly recommended to check the original data source.
93. 94
Provides students with training/learning opportunity for technology transfer.
Many studuents are not familiar with patents (contrary to copyright/plagiarism).
In general, when there is some sort of domain-specific data that students can access,
there should be some introductory training opportunity for it.
How to deal with this type of misinformation
94. 95
• PubChem is one of the largest sources of publicly available chemical
information
• PubChem is a data aggregator, which collects chemical information from
hundreds of data sources.
• PubChem contains chemical information useful for drug discovery.
• In addition to bioactivity data generated through high-throughput screenings,
PubChem contains a substantial amount of bioactivity information extracted
from scientific articles.
• Chemical vendor and patent information for compounds in PubChem helps
prioritize hit compounds for further screening.
Summary
95. 96
• PubChem supports multiple programmatic access routes to its data, allowing
for automating complicated and specialized tasks beyond what PubChem’s
web interface supports.
• PubChem data can be used for developing computational prediction models
for bioactivity or toxicity of molecules, in conjunction with machine learning
methods.
• PubChem is used by millions of users, but some of them often misinterpret or
misunderstand PubChem data, which needs to be addressed by PubChem
as well as at a community level.
Summary
96. 97
Acknowledgements
The PubChem Team
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
Collaborators
Prof. Robert Belford (UALR)
Prof. Ehren Bucholtz (U. of Health Sciences and Pharmacy in St. Louis)
ACS CHED Committee on Computers in Chemical Education (CCCE)
Funding
Intramural Research Program of the National Library of Medicine
97. Thank you for your attention.
Questions?
Sunghwan Kim
(sunghwan.kim@nih.gov)