What Is InChI and InChIKey in LC-MS/MS?

When working with LC-MS/MS metabolomics or small-molecule identification, you will frequently encounter three major structure formats: SMILES, InChI, and InChIKey.

Although they all describe chemical structures, they serve very different purposes in databases, searching, and data sharing.

In practical LC-MS workflows, understanding the difference between them helps when using databases such as PubChem, HMDB, MassBank, ChemSpider, or GNPS.


Quick Summary

  • SMILES → Human-readable chemical structure notation
  • InChI → Standardized structure identifier developed by IUPAC
  • InChIKey → Short hashed version of InChI optimized for database search

The most important point is this:

InChI improves structure standardization, while InChIKey improves searchability.


What Is SMILES?

SMILES (Simplified Molecular Input Line Entry System) represents molecular structures using text strings.

Example:

CC(=O)OC1=CC=CC=C1C(=O)O

This is the SMILES notation for aspirin.

Advantages of SMILES

  • Compact and readable
  • Easy for cheminformatics software
  • Widely used in scripting and databases
  • Convenient for structure editing

Limitations

The same molecule can have multiple valid SMILES strings depending on atom ordering.

For example:

CCO

and

OCC

both represent ethanol.

This inconsistency becomes problematic for large-scale database matching.


Ethanol structural formula converted into SMILES, InChI, and InChIKey formats for LC-MS/MS chemical structure identification and database searching.
An infographic showing how the chemical structure of ethanol is converted into SMILES, InChI, and InChIKey formats. The figure compares their roles in structure representation, standardization, database searching, and LC-MS/MS compound annotation workflows.


What Is InChI?

InChI stands for:

International Chemical Identifier

It was developed by IUPAC to create a standardized and reproducible chemical identifier.

Example:

InChI=1S/C9H8O4/c1-13-8(11)6-4-2-3-5-7(6)9(10)12/h2-5H,1H3,(H,10,12)

Unlike SMILES, InChI follows strict normalization rules.

This means:

  • identical structures generate identical InChI strings
  • database interoperability improves
  • duplicate entries are reduced

Why InChI Matters in LC-MS/MS

In metabolomics and unknown compound annotation, researchers often combine results from multiple databases.

Different databases may store:

  • different names
  • different synonyms
  • different SMILES strings

However, standardized InChI identifiers make cross-database comparison much more reliable.

This becomes especially important when:

  • merging spectral libraries
  • validating metabolite annotations
  • comparing vendor software results
  • exporting annotation tables

What Is InChIKey?

An InChI string can become very long.

That creates problems for:

  • web searching
  • indexing
  • database keys
  • URLs

To solve this, InChIKey was introduced.

Example:

BSYNRYMUTXBXSQ-UHFFFAOYSA-N

This is the InChIKey for aspirin.


InChI vs InChIKey

FeatureInChIInChIKey
Human readablePartiallyNo
Full structural informationYesNo
Fixed lengthNoYes
Database indexingModerateExcellent
Web search friendlyPoorExcellent

The key idea:

InChIKey is essentially a compressed hash of the full InChI string.


Why InChIKey Is Important for Database Search

In LC-MS/MS workflows, InChIKey is commonly used because it is:

  • short
  • standardized
  • searchable
  • database-friendly

Many public databases index compounds primarily by InChIKey.

Examples include:

  • PubChem
  • HMDB
  • ChemSpider
  • GNPS
  • MassBank

If two databases contain the same compound, matching the InChIKey is often the fastest way to confirm identity consistency.


Practical Example in LC-MS Annotation

Suppose your LC-MS software suggests:

  • Aspirin
  • Acetylsalicylic acid
  • 2-Acetoxybenzoic acid

These may appear as different names, but they all share the same InChIKey.

That allows you to:

  • remove duplicates
  • unify annotations
  • compare external databases reliably

Typical Workflow in Metabolomics

A simplified workflow often looks like this:

  1. Detect precursor m/z
  2. Search candidate molecular formulas
  3. Predict or compare MS/MS fragments
  4. Retrieve candidate structures
  5. Compare InChIKey across databases
  6. Finalize annotation confidence

This is why many LC-MS annotation pipelines internally rely on standardized identifiers rather than compound names alone.


Common Misunderstanding

A very common misconception is:

“InChIKey contains the full structure.”

It does not.

InChIKey is a hashed representation designed for indexing and search efficiency.

The complete structural information exists in the full InChI.


Final Thoughts

For practical LC-MS/MS interpretation:

  • Use SMILES for structure handling and cheminformatics workflows
  • Use InChI for standardized structural representation
  • Use InChIKey for database searching and cross-platform matching

In metabolomics and small-molecule annotation, InChIKey has effectively become the universal “chemical search ID” across many public databases.

Understanding this distinction makes database interpretation, annotation merging, and spectral library comparison much more reliable.


FAQ

What is the difference between SMILES and InChI?

SMILES is designed to represent chemical structures in a compact and human-readable format, while InChI is designed to create a standardized identifier for consistent database matching.

In practice:

  • SMILES is easier for manual editing and scripting
  • InChI is better for standardized compound comparison

Why do LC-MS databases use InChIKey instead of InChI?

Full InChI strings can become very long and difficult to index efficiently.

InChIKey solves this problem by providing:

  • fixed-length identifiers
  • fast database indexing
  • easier web searching
  • simpler duplicate detection

That is why most public metabolomics databases primarily use InChIKey.


Can two different compounds share the same InChIKey?

In theory, hash collisions are possible because InChIKey is a compressed representation.

However, collisions are extremely rare in practical LC-MS and metabolomics workflows.

For most analytical applications, InChIKey is considered sufficiently unique.


Does InChIKey contain the full molecular structure?

No.

InChIKey is only a hashed representation of the full InChI string.

It does not contain complete structural information and cannot fully reconstruct the molecule by itself.


Why can the same molecule have multiple SMILES strings?

SMILES depends on atom ordering and writing conventions.

Different software tools may generate different valid SMILES notations for the same compound.

This is one reason standardized identifiers such as InChI were developed.


Which format is best for LC-MS/MS spectral library searching?

Most spectral libraries and public databases rely heavily on InChIKey because it enables:

  • fast searching
  • cross-database matching
  • duplicate removal
  • standardized annotation workflows

However, many tools still store SMILES internally for structure visualization and cheminformatics calculations.


Is InChI better than SMILES for metabolomics annotation?

For database consistency and annotation merging, yes.

For structure editing or cheminformatics scripting, SMILES is often more convenient.

In real workflows, both are commonly used together.


Can LC-MS software automatically generate InChIKey?

Yes.

Many modern LC-MS and cheminformatics platforms can generate:

  • SMILES
  • InChI
  • InChIKey

automatically after structure assignment or database matching.

Examples include workflows connected to PubChem, HMDB, RDKit, GNPS, or ChemSpider.


Why is InChIKey useful in metabolomics papers?

Compound names can vary significantly between databases and publications.

Using InChIKey helps ensure:

  • reproducibility
  • unambiguous compound reporting
  • easier cross-study comparison
  • reliable database linkage

This is especially important for large untargeted metabolomics datasets.


Can peptides use InChI or InChIKey?

Yes, but they are more commonly used for small molecules and metabolites.

Proteomics workflows usually rely more heavily on:

  • amino acid sequences
  • FASTA identifiers
  • peptide-spectrum matches (PSMs)

rather than InChI-based identifiers.


다음 이전