PubChem and Big Data Chemistry

PubChem and Big Data Chemistry
Sunghwan Kim, Ph.D., M.Sc.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Email: sunghwan.kim@nih.gov

2
Outline
1. What Is PubChem?
2. What Does PubChem Have?
3. Exploring Chemical Information in PubChem
4. Programmatic Access to PubChem
5. Bioactivity Prediction Model Building with PubChem Data
6. PubChem and COVID-19 Conspiracy Theories
7. Summary

4
 https://pubchem.ncbi.nlm.nih.gov
 Public chemical database at NIH.
 Contains information on various chemical entities:
• (Drug-like) small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem Is a Public Chemical Information Resource

5
PubChem Is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
800+ data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students

6
History of PubChem
 NIH Molecular Libraries Program (MLP)
 Common Fund project.

7
History of PubChem
 Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.

8
History of PubChem
 Aimed to provide academic researchers with high-throughput
screening (HTS) resources for drug discovery.
 Had three components (subprojects):
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)

9
History of PubChem
 PubChem was launched in 2004 as a component of MLP.
 All Common Fund projects are supported only up to 10 years.
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)

10
History of PubChem
 PubChem was launched in 2004 as a component of MLP.
 All Common Fund projects are supported only up to 10 years.
 PubChem evolved to play a dual role:
 As a data archive
 As a knowledgebase
Large, shared
compound
library
HTS centers at
academic
institutions
Central data
repository
(PubChem)

11
0
1
2
3
4
5
6
Unique
Monthly
Users
(millions)
Time
Monthly Usage Statistics
(Unique Interactive Users Only)
Source: Google Analytics
 5 million unique interactive users per month at peak (Oct. 2020)
 Programmatic requests are not included.
 These statistics are lower-bound.

12
User Demographics
(June 2020 through May 2021)
36.5%
27.4%
13.5%
10.4%
6.7% 5.4%
0
1
2
3
4
5
6
18-24 25-34 35-44 45-54 55-64 65+
Number
of
Users
(millions)
Age
34.64% of total users
~40% of PubChem users are aged between 18 and 24.
(likely to be college students)

14
PubChem Data Content
Structures and properties

15
Structures and properties Spectra

16
Chemical
health & safety
3
2 0

17
Chemical
health & safety
3
2 0
Bioactivity

18
Chemical
health & safety
3
2 0
Bioactivity Chemical vendors & synthesis

20
Clinical trials
Drugs

21
Clinical trials
Patents
Drugs

22
Clinical trials
Patents
Drugs
Scientific articles

23
Dual Role of PubChem
Archive Knowledgebase

24
Dual Role of PubChem
Archive Knowledgebase

25
Multiple data collections in PubChem
Compound
Unique chemical
structures
Substance
Depositor-provided
chemical data
BioAssay
Assay descriptions
& test results
Protein Gene Pathway Patent
Archive Archive
Knowledgebase
Chemical data associated with a protein/gene/pathway/patent

26
 As of November 2021, PubChem contains:
• 276 million substance descriptions
• 111 million unique chemical structures
• 292 million biological activity test results
• 1.4 million biological assays, covering 21 thousand unique protein
sequence targets.
(Arguably) the largest corpus of
publicly available chemical information from 800+ data
sources.
PubChem Statistics

27
PubChem’s Chemical Space
Lipinski’s
Rule of 5 (Ro5) for
Drug-likeness a
Congreve’s
Rule of 3 (Ro3) for
Lead-likeness b
Molecular Weight ≤500 ≤300
Octanol–water partition coefficient (Log P) ≤5 ≤3
Number of H-bond donors ≤5 ≤3
Number of H-bond acceptors ≤10 ≤3
Number of Rotational Bond N/A ≤3
Polar surface area (PSA) N/A ≤60
a Lipinski et al., Adv. Drug Delivery Rev. 1997, 23(1–3), 3-25.
b Congreve et al., Drug Discov. Today, 2003, 8(19), 876-877.

28
Congreve’s
Rule of 3 (Ro3)
11.7 millions
(10.57 %)
Lipinski’s
Rule of 5 (Ro5)
78.9 millions
(71.36%)
All compounds
110.6 millions
(100%)

29
Ro5
78.9 millions
(71.36%)
Ro5−1
18.9 millions
(17.08%)
Ro5−2
10.2 millions
(9.26%)
Ro5−3
2.3 millions
(2.05%)
Ro5−4
0.28 millions
(0.25%)
Ro5 + Ro5-1 = 88.44%

30
Bioactivity Data in PubChem
Tested
3.6 millions
(3.27%)
Active
(AC ≤ 1 nM)
74 thousands
(0.07%)
Active
(1 nM < AC ≤ 1 µM)
777.5 thousands
(0.70%)
Active
(others)
635.2 thousands
(0.57%)
Inactive
2.1 millions
(1.93%)
Not Tested
107.0 millions
(96.73%)
All Compounds
110.7 millions
(100.00%)
AC: activity concentration (e.g., IC50, EC50, Ki, Kd, etc.)

31
High-Throughput Screening data
• From Molecular Libraries
Program and other HTS projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent compounds)
• Typically measured at single
concentration
Literature-extracted data

32
• Many inactives
• False hits
(e.g., aggregators,
concentration

33
• Many inactives
• False hits
(e.g., aggregators,
concentration
• From manual curation or data
mining
• No (or few) inactives
• Provided by various PubChem
depositors including:
ChEMBL,
PDBbind, BindingDB,
Guide to Pharmacology

34
• Virtual screening hits should be synthesizable or purchasable.
• PubChem contains “real” molecules (not “virtual” molecules)
• At least one or more data contributors claim that they have the compound
and/or information about it.
• Some of these compounds are chemical vendors (e.g., Sigma Aldrich).
Availability of compounds for subsequent experiments

35
 Two important aspects of PubChem records
(in the context of “compound availability”)
 Non-live compounds:
 Not searchable although they exist.
 No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share

36
 Two important aspects of PubChem records
(in the context of “compound availability”)
 Non-live compounds:
 Not searchable although they exist.
 No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
 Legacy designation:
 No longer maintains their records up-to-date.
o Discontinued funding, low business priority, …

37
3. Exploring Chemical Information in
PubChem

38
Text Query
 Chemical name
 Gene/protein name
 Pathway name
 Patent ID
 CAS registry number
 PubChem record ID
(CID, SID, AID)

39
Multiple
collections are
searched
simultaneously.
https://pubchem.ncbi.nlm.nih.gov/
#query=%22salicylic%20acid%22

40
Compound
Summary for
salicylic acid
(CID 338)
https://pubchem.ncbi.nlm.nih.gov/
compound/338

41
Chemical Structure
Query
 SMILES
 InChI/InChIKey

42
Simplified molecular-input line-entry system (SMILES)
CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)
NC4=NC=CC(=N4)C5=CN=CC=C5
Line notations for chemical structures
IUPAC International Chemical Identifier (InChI)
InChI=1S/C17H21NO/c1-18(2)13-14-19-17(15-9-5-3-6-
10-15)16-11-7-4-8-12-16/h3-12,17H,13-14H2,1-2H3

43
Multiple types of
chemical
structure search
 Identity
 2-D similarity
 3-D similarity
 Substructure
 Superstructure

44
 Identity Search
 Depending on what you mean by “identical molecules”, you will get different search results.
 What is the definition of “identity”?
→ Different tautomeric states,
Different stereoisomers,
Different isotopes,
Salt forms or mixtures, …
Chemical Structure Search

45
 Identity Search
Different isotopes,
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.

46
 Identity Search
Different isotopes,
(ex) CHCl3 vs. CDCl3 :
Both have the same chemical properties but different spectroscopic property.
 Users can search PubChem using different “nuances” of structural identity.

47
 Substructure Search
• use a substructure as a query
• search for compounds that contain the query substructure.
 Superstructure Search
• use a superstructure as a query
• search for compounds that are contained in the query superstructure.

48
 When do you use substructure searches?
ex. when you want to find all molecules that
have a particular molecular scaffold.
Cephalosporins
(a class of β-lactam antibiotics)
Substructure/Superstructure Search

49
 Similarity Search
 Why do we need similarity search?
• There is a huge imbalance of available information among compounds in PubChem.
For example, among 110.7 million compounds in PubChem,
- 3.6 million compounds (3.27 %) have been tested in at least one assay.
- 1.5 million compounds (1.34 %) have been tested to be active in at least one assay.
• The remaining 86.8 million compounds (97.6%) have not been tested in any assay.
• Bioactivities of these compounds may be predicted from structurally similar compounds with
known bioactivities.
• “Similarity Principle” : structurally similar compounds are likely to have similar biological
properties.

50
 How can you quantify similarity?
• Similarity is very subjective and context-dependent.
• There are many different ways to quantify similarity.
• Different similarity methods will recognize different flavors of similarity.
• PubChem uses two different similarity measures.
- 2-D similarity based on molecular fingerprints.
- 3-D similarity based on rapid-overlay of chemical structures (ROCS).
 Similarity Search

51
 PubChem 2-D Similarity
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.

52
• PubChem 881-bit binary fingerprints:
Each bit position represents the presence (=1) or
absence (=0) of a predefined molecular fragment.
• Structural Similarity between two molecules are
computed using the Tanimoto equation:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =
𝑁𝑁𝐴𝐴𝐴𝐴
𝑁𝑁𝐴𝐴 + 𝑁𝑁𝐵𝐵 − 𝑁𝑁𝐴𝐴𝐴𝐴
NA: # bits set for molecule A
NB: # bits set for molecule B
NAB: # bits set for both
• Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules).

53
 Three similarity measures:
• Shape-Tanimoto (ST): 3-D overlap between steric shapes of molecules
• Color-Tanimoto (CT): 3-D overlap between “feature” atoms
(H-bond donors/acceptors, Cationic/Anionic centers, rings and hydrophobes)
• Combo-Tanimoto (ComboT): the sum of ST and CT
 Both ST and CT range from 0 to 1, and ComboT range from 0 to 2 (without normalization to 1).

54
 3-D similarity quantification involves optimization of superposition between two molecules:
• ST-optimization: finds the superposition that maximizes the ST score between them.
• CT-optimization: considers both CT and ST scores during the optimization.

55
 Why does PubChem use two different similarities.
• 2-D similarity comparison is much faster than
3-D similarity comparison
- 2-D: 106 comparisons per second
- 3-D: 102 ~ 103 comparisons per second
• However, 2-D similarity methods often fail to
recognize structural similarity that can be
easily recognized by 3-D similarity methods.
CID 1548887
(Sulindac)
CID 3715
(Indomethacin)
2D = 0.39
ST = 0.92
CT = 0.52
Both are non-steroidal anti-inflammatory drugs
(NSAIDs) and cyclooxygenase inhibitors.

56
Gene/Protein/Pathway Summary
 Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene/pathway target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein/Pathway Summary
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given
gene/protein/pathway.

58
Patent Summary
 Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent Summary page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes

60
 https://pubchem.ncbi.nlm.nih.gov/classification
 Browse PubChem data using a classification of interest.
 Search for records annotated with the desired classification/term.
 A few examples of supported ontologies/classifications.
• MeSH (Medical Subject Headings)
• ChEBI (Chemical Entities of Biological Interest)
• FDA Pharm Classes
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC (Anatomical Therapeutic Chemical Classification System) Code
• WIPO International Patent Classification
Classification Browser

63
 Identifier Exchange Service
https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service
 Score Matrix Service
https://pubchemdocs.ncbi.nlm.nih.gov/identifier-exchange-service
 Standardization Service
https://pubchem.ncbi.nlm.nih.gov/standardize/standardize.cgi
 PubChem Data Sources (https://pubchem.ncbi.nlm.nih.gov/sources)
 PubChem Widgets (https://pubchemdocs.ncbi.nlm.nih.gov/widgets)
 PubChem Upload (https://pubchem.ncbi.nlm.nih.gov/upload/)
 PubChem Blog (https://pubchemblog.ncbi.nlm.nih.gov)
 PubChemDocs (https://pubchemdocs.ncbi.nlm.nih.gov)
Other Tools & Services

64
4. Programmatic Access to
PubChem

65
 PubChem users have very diverse
backgrounds/interests.
 PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.

66
interactively.
 Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.

67
interactively.
 Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
 Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.

68
 Multiple programmatic access routes
 Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
 Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez
Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG-View

5. Showcase:
Bioactivity Prediction Model Building with
PubChem Data

 Involved in regulation of gene expression in
various biological processes.
 Potential roles in:
• metabolic signaling pathways
• skin alopecia (spot baldness)
• dermal cysts
• cardiac development
• insulin sensitization
• ……
 Let’s build binary classifiers (i.e, active vs.
inactive) for chemical modulators of RXRA
Retinoid X Receptor α (RXRA)
PDB ID: 1FBY

Tox21
(AID 1159531)
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
Data sets

Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
90% 10%
Data sets

Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets

Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
journal articles
• qHTS data
90% 10%
Data sets
All data
Available in
PubChem.

Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
journal articles
• qHTS data
90% 10%
471
Data sets

 Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints

 Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naïve Bayes α (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210); γ ( 2-10 ∼ 210)
NN Neural network solver (lbfgs or adam); α (10-7 ∼ 107)
 10-fold cross-validation was used for hyperparameter optimization.
Model Building

Model Performance Evaluation
 Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵
=
1
2
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
+
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
=
1
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 + 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹

 Performance of the models
 AUC scores of ≥0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
 Maximum AUC score (0.77):
PubChem fingerprint with RF
 Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation

Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models

83
6. PubChem and
COVID-19 Conspiracy Theories

84
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Page
Views
Hydroxychloroquine
3/17
Univ. of Minnesota
Begins Testing
Hydroxychloroquine
3/30
Emergency Use
Authorization of
Hydroxychloroquine
4/8
Trump said
“What do you have
to lose?”
5/18
Trump said he had
been taking it.
7/28
Trump said he still
thought it worked.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Page Views for Hydroxychloroquine (in 2020)

85
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Page
Views
Hydroxychloroquine
Remdesivir
Dexamethasone
3/17
Univ. of Minnesota
Begins Testing
Hydroxychloroquine
3/30
Emergency Use
Authorization of
Hydroxychloroquine
4/8
Trump said
“What do you have
to lose?”
5/18
Trump said he had
been taking it.
7/28
Trump said he still
thought it worked.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Page Views for Hydroxychloroquine (in 2020)
Drugs used for standard treatment of
COVID-19 had a fewer page views.

86
Source: https://silview.media/2020/10/04/atomic-bombshell-rothschilds-patented-covid-19-biometric-tests-in-2015-and-2017
System and Method for
Testing for COVID-19
(US 2020279585 A1)
Priority date:
2015-10-13

87
Source: https://www.reuters.com/article/uk-factcheck-patent/fact-check-rothschild-did-not-patent-a-test-for-covid-19-in-2015-and-2017-idUSKBN27C34O

90
This is
a continuation-in-part
application.

91
Continuation Application Continuation-in-Part (CIP) Application
 Adds new claims to a pending parent
application (i.e., not granted nor abandoned).
 Cannot change the specification of the
invention.
 Has the same priority date as the parent
application.
 Increases the scope of the application without
having to file an entirely new application (and
consequently losing the original filing date).
 Adds "enhancements" to the original invention
disclosed in the parent application.
 New claims may also be added:
• Claims concerning the original invention:
the same priority date as the parent
application
• Claims concerning the enhancement:
the priority is the filing date of the CIP
application.
Same Invention
Additional claims
Modified Invention
Additional Claims

Patent Application
Publication
Patent (Granted)
Patent Application
15/293,211
(10/13/2016)
62/240,783
(10/13/2015)
15/495,485
(04/24/2017)
10,242,713 B2
(02/11/2019)
System and method for using, processing,
and displaying biometric data (20 claims)
2017/0229149 A1
(8/10/2017)
16/273,141
(02/11/2019)
10,522,188 B2
(12/31/2019)
2019/0325914 A1
(10/24/2019)
16/704,844
(12/05/2019)
10,910,016 B2
(2/2/2021)
2020/0126593 A1
(4/23/2020)
16/876,114
(05/17/2020)
2020/0279585 A1
(9/3/2020)
11,024,339 B2
(6/1/2021)
System and method for testing for
COVID-19 (17 claims)
Provisional
• The pre-pandemic applications are about a generic
system/method that deals with biometric data.
• The post-pandemic application includes a modified
invention and additional claims specific to COVID-19
C
C
C
CIP

93
How to deal with this type of misinformation
 Consider PubChem as an information locator.
 PubChem data are from other data sources.
 More detailed information may be available at the original data source.
 It is highly recommended to check the original data source.

94
 Provides students with training/learning opportunity for technology transfer.
 Many studuents are not familiar with patents (contrary to copyright/plagiarism).
 In general, when there is some sort of domain-specific data that students can access,
there should be some introductory training opportunity for it.
How to deal with this type of misinformation

95
• PubChem is one of the largest sources of publicly available chemical
information
• PubChem is a data aggregator, which collects chemical information from
hundreds of data sources.
• PubChem contains chemical information useful for drug discovery.
• In addition to bioactivity data generated through high-throughput screenings,
PubChem contains a substantial amount of bioactivity information extracted
from scientific articles.
• Chemical vendor and patent information for compounds in PubChem helps
prioritize hit compounds for further screening.
Summary

96
• PubChem supports multiple programmatic access routes to its data, allowing
for automating complicated and specialized tasks beyond what PubChem’s
web interface supports.
• PubChem data can be used for developing computational prediction models
for bioactivity or toxicity of molecules, in conjunction with machine learning
methods.
• PubChem is used by millions of users, but some of them often misinterpret or
misunderstand PubChem data, which needs to be addressed by PubChem
as well as at a community level.
Summary

97
Acknowledgements
 The PubChem Team
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
 Collaborators
Prof. Robert Belford (UALR)
Prof. Ehren Bucholtz (U. of Health Sciences and Pharmacy in St. Louis)
ACS CHED Committee on Computers in Chemical Education (CCCE)
 Funding
Intramural Research Program of the National Library of Medicine

Thank you for your attention.
Questions?
Sunghwan Kim
(sunghwan.kim@nih.gov)

PubChem and Big Data Chemistry

More Related Content

What's hot

What's hot (20)

Similar to PubChem and Big Data Chemistry

Similar to PubChem and Big Data Chemistry (20)

More from Sunghwan Kim

More from Sunghwan Kim (12)

Recently uploaded

Recently uploaded (20)

PubChem and Big Data Chemistry