close-icon
Subscribe to learn more about this topic
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Getting Started with GNPS Data Using Python and Pandas

How to download, understand, clean, and analyze the GNPS JSON dataset

datarevenue-icon
by
Markus Schmitt

If you've ever analysed Mass Spectrometry data, you might have come across the Global Natural Products Social Molecular Networking (GNPS) site. It's a wonderful resource, offering a powerful platform where you can upload your data and get back results. But sometimes, it's more useful to download their data and work with it directly. 

[You can find a Jupyter Notebook version of this article here.]

Let’s see how we can understand this large dataset and discover what data cleaning we need to ensure further analysis is untarnished.

This article is a guide to using Jupyter Notebook, Python, and Pandas to:

  • Get the main “ALL_GNPS” dataset: a 4GB JSON collection of all publicly available mass spectrometry data in GNPS, including peaks, assignments, and structures;
  • Load it into a Pandas dataframe;;
  • Understand all the columns and how best to work with them;
  • Normalize and clean the data and understand some caveats;
  • Understanding the InChI and Smiles columns and their hashed indexes;
  • Finding unique and partially unique rows using the InChIKey;
  • Analyze the peaks by generating histograms;

This guide will get you comfortable with the dataset’s structure; show you how to manipulate the data; and provide a good base for your own specific problems.

Downloading the JSON data from GNPS

The first thing you'll need is the JSON file. This contains all of the GNPS data, which you can grab from https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.json

Copy that URL and start up a Jupyter Notebook. You could use pandas or requests to download the file, but we recommend you install a useful utility wget. The wget utility also shows you the progress of your download. This is useful as the full file download takes a few minutes, even on a fast connection.

# install wget
!pip3 install wget

# >>> Requirement already satisfied: wget in 
./jupenv/lib/python3.8/site-packages (3.2)
import wget
all_gnps_url = 
"https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.json"
wget.download(all_gnps_url)

# >>> 100% 
[....................................................................
] 3933553559 / 3933553559
'ALL_GNPS.json'

Reading the JSON file as a Pandas dataframe

To analyse and manipulate the data, read the JSON file into a Pandas dataframe.

Import the pandas and json libraries, read in the file, and display the first 10 rows to see what the file looks like as follows.

%%time
import pandas as pd
import json

filename = "ALL_GNPS.json"
df = pd.read_json(filename)
df.head()

# >>> CPU times: user 24.8 s, sys: 5.18 s, total: 30 s
# >>> Wall time: 29.2 s

This reads the entire file into memory. So you'll need more than 4GB of free RAM and it might take a couple of minutes to read the file. Once it has completed, you'll see a table containing the first 10 rows of the dataset and some of the columns (note the ... column in the middle, where some columns are hidden in this summary view).

A table showing the summary of the data, with the missing columns highlighted
Note that not all columns are displayed in Pandas’ summary view

To find out more about the data, you can look at the shape of the table.

df.shape
# >>> (651979, 38)

The data has just over 650,000 rows and 38 columns. This dataset is often updated, so you'll likely have a few more rows.

You could view all the columns in the summary view by running pd.set_option("display.max_columns", 50). But it’s often easier to simply view a list of the column names as follows.

df.columns

# >>> Index(['spectrum_id', 'source_file', 'task', 'scan', 'ms_level',
       'library_membership', 'spectrum_status', 'peaks_json', 'splash',
       'submit_user', 'Compound_Name', 'Ion_Source', 'Compound_Source',
       'Instrument', 'PI', 'Data_Collector', 'Adduct', 'Scan', 'Precursor_MZ',
       'ExactMass', 'Charge', 'CAS_Number', 'Pubmed_ID', 'Smiles', 'INCHI',
       'INCHI_AUX', 'Library_Class', 'SpectrumID', 'Ion_Mode', 'create_time',
       'task_id', 'user_id', 'InChIKey_smiles', 'InChIKey_inchi',
       'Formula_smiles', 'Formula_inchi', 'url', 'annotation_history'],
      dtype='object')

Some of the most useful columns include:

  • peaks_json: A list of all peaks associated with the spectrum. Each peak is a pair of m/z (first element) and intensity (second element);
  • Ion_Mode: A string value that separates the data into ionization modes;
  • InChIKey_smiles: A hash of the SMILES representation of the structure;
  • InChIKey_inchi: A hash of the InChI representation of the structure;
  • PI: The principal investigator;
  • SpectrumID: A unique ID for each spectrum in the dataset;
  • Compound_Name: A free-text field that generally contains the compound name and sometimes contains additional metadata such as “Putative Digalactosyl Diacylglycerol (DGDG); 14:0/18:5”;
  • Precursor_MZ: This is the mass-to-charge ratio of the parent ion.

Let's take a closer look at some of these.

Understanding the peaks_json data

Perhaps the most interesting column is peaks_json. You’ll probably want to compare this to your own spectra data. You can view all the peak data for the 12th spectrum in the database as follows (this is a usefully short one to view as a sample).

df.peaks_json[12]

'[[114.997688,18.585751],[120.076920,11.440020],[177.034821,19.897190],[216.851959,15.350990],[373.162964,88.845261],[374.164093,48.464008],[374.236267,12.960160],[453.903107,38.544971]]'

This looks like a nested Python list. But it’s actually still a string at this point, as Pandas hasn't parsed the nested JSON structures for you. If you try to get a specific value by index, you'll only get the character at that position of the string. For example, the following code returns 4 as that’s the fourth character of the string.

df.peaks_json[12][4]
# >>> '4'

You can easily fix this by parsing each data value as JSON. This will convert it into a more useful Python data object. As the dataset is large, this step may take several minutes to process.

%%time
df.peaks_json = df.peaks_json.apply(json.loads)

# >>> CPU times: user 1min 43s, sys: 2.91 s, total: 1min 46s
# >>> Wall time: 1min 46s

Now the data is stored as a list of lists of integers, which means you can now retrieve data points more easily. The fourth element will actually return the m/z and abundance measurements of the 4th peak.

df.peaks_json[12][4]
# >>> [373.162964, 88.845261]

We'll look at the peaks_json column in more depth soon. But first let's do some cleaning on the other columns.

Fixing the Ion_Mode inconsistencies

Because the data is user-contributed, it's not always consistent. A good example of this is Ion_Mode, which contains inconsistencies in spacing and capitalisation. This inconsistency could mess with your analysis if you’re trying to get all the positive or all the negative examples, so you’ll want to normalize it to use consistent names.

Take a look at the unique values in this column:

df.Ion_Mode.value_counts()

Positive         513823
Negative          85079
positive          36162
negative          14732
                   1284
 Positive           381
N/A                 377
 Negative           139
Positive-20eV         2
Name: Ion_Mode, dtype: int64

Let’s normalize these to just three values ("Positive", "Negative", "N/A") as follows:

def norm_ion(inp):
	inp = inp.lower()
	if "positive" in inp:
    	    return "Positive"
	elif "negative" in inp:
    	    return "Negative"
	else:
         return "N/A"

df['Ion_Mode'] = df.Ion_Mode.apply(norm_ion)
df.Ion_Mode.value_counts()

Positive    550368
Negative     99950
N/A           1661
Name: Ion_Mode, dtype: int64

This overwrites the Ion_Mode column with the normalised values. You can see that everything now fits cleanly into one of the three options.

Understanding InChIKey_inchi and InChIKey_smiles

InChI and SMILES are two different ways to represent the structure of metabolomic data, and there is some debate as to which is better. We'll focus more on InChI and less on SMILES because InChI is more consistent. These are useful compact representations that can act as an index to the full spectrum.

InChI is canonicalized (there can only be a single valid representation), while the SMILES data varies based on the researcher who prepared them. Let’s have a look at the differences in an example:

df["INCHI"][1]
# >>> 'InChI=1S/C45H73N5O10S3/c1-14-17-24(6)34(52)26(8)37-25(7)30(58-13)18-31-46-29(19-61-31)39-49-45(12,21-62-39)43-50-44(11,20-63-43)42(57)48-32(22(4)15-2)35(53)27(9)40(55)59-36(23(5)16-3)38(54)47-33(28(10)51)41(56)60-37/h19,22-28,30,32-37,51-53H,14-18,20-21H2,1-13H3,(H,47,54)(H,48,57)/t22-,23-,24+,25-,26-,27+,28+,30-,32-,33-,34-,35-,36-,37-,44+,45+/m0/s1'

df["Smiles"][1]
# >>> 'CCC[C@@H](C)[C@@H]([C@H](C)[C@@H]1[C@H]([C@H](Cc2nc(cs2)C3=N[C@](CS3)(C4=N[C@](CS4)(C(=O)N[C@H]([C@H]([C@H](C(=O)O[C@H](C(=O)N[C@H](C(=O)O1)[C@@H](C)O)[C@@H](C)CC)C)O)[C@@H](C)CC)C)C)OC)C)O'

SMILES and InChI also have a hashed representation (called an “InChI Key”). This is a more compact representation, useful for deduplication, among other things. In many cases, the SMILES and InChI hash is the same. The first group of characters is a hash of the connectivity, and the second of the stereochemistry and other layers. The final section represents the protonation layer.

An example InChIKey showing connectivity and proton layers (first grouping), other InChI layers such as stereochemistry etc. (second grouping), and protonation layer (third grouping)
The three parts of an InChIKey

As before, this data is user-contributed so it doesn't always look as you'd expect. Specifically, the smiles and inchi hashes do not always match, though they should. You can read more about how the hash is constructed in this article.

The second row of the dataset is an example where the InChIKey from InChI and SMILES matched completely, as expected.

df["InChIKey_inchi"][1]
# >>> 'KNGPFNUOXXLKCN-ZNCJFREWSA-N'
df["InChIKey_smiles"][1]
# >>> 'KNGPFNUOXXLKCN-ZNCJFREWSA-N'

Let’s look at an example where the hashes do not match. We can find one of these at index 78 of our dataset. Print out both the SMILES and the InChI representation by running the following code:

print(df['InChIKey_inchi'][78] + "\n"  + df['InChIKey_smiles'][78])
# >>> VCQKVUNLUXMOER-FNMPZSNGSA-N
# >>> VCQKVUNLUXMOER-XWXFOZCPSA-N

Note how the first group of characters matches, but the second doesn’t (FNM… vs XWX…). One of these likely has a stereochemistry assignment error. But it's difficult to tell which one without looking this up in a different database, such as PubChem.

Finding unique and partially-unique structures

Use the inchi value to find duplicates as follows:

# Find rows with completely unique hashes
df["InChIKey_inchi"].value_counts()

                              276355
None                            11972
AIONOLUJZLIMTK-UHFFFAOYSA-N      1063
RWXIFXNRCLMQCD-UHFFFAOYSA-N       988
LPLVUJXQOOQHMX-UHFFFAOYSA-N       943
                                ...  
MALFODICFSIXPO-KFKQDBFTSA-N         1
IVUFRBWVXKXZMM-UHFFFAOYSA-N         1
JWPAJVNQGTWPMI-UHFFFAOYSA-N         1
XLGNZQXMDVSKOV-UHFFFAOYSA-N         1
NMKCCEHYPWXWLG-ATTOZUNXSA-N         1
Name: InChIKey_inchi, Length: 20721, dtype: int64

So of the 650,000+ rows, only around 20,000 are unique. The first 14 characters represent the main layer of the molecule, so it's also interesting to look at unique counts for those. Do this as follows:

# Find rows which have unique main layer / connectivity hashes
df['InChIKey_inchi'].apply(lambda x: x.split("-")[0]).value_counts()

                 276355
None               11972
AIONOLUJZLIMTK      1118
RWXIFXNRCLMQCD       995
LPLVUJXQOOQHMX       987
                   ...  
QHYBPAALYZNWPK         1
RNEHQZRKZJSYOL         1
ACSWAJLDOHJFNA         1
XLTWUPQILPHRFU         1
BTXKFKMDVYRJEU         1
Name: InChIKey_inchi, Length: 17633, dtype: int64

You can see there are around 18,000 results with unique structures. The duplicate entities are not necessarily useless so we'll leave them there and keep in mind that they exist.

How many peaks should my samples have?

You probably noticed that some of the samples have hundreds or thousands of peaks, while others have only a few. You can quickly analyze the distribution of the number of peaks per sample as follows:

num_peaks = df.peaks_json.apply(len)

# !pip3 install matplotlib
from matplotlib import pyplot as plt
num_peaks.hist()
A histogram with only one visible bar between 0 and 50,000. All other buckets from 50,000 to 350,000 in groups of 50,000 appear empty.
A majority of the data has only a small number of peaks

You can see that a large majority of the samples have fewer peaks. In fact, most have under 500 peaks recorded.

num_peaks.count()
# >>> 651979

num_peaks[num_peaks < 500].count()
# >>> 618776

And another histogram of only the samples with less than 500 peaks shows that the data still skews heavily toward samples with fewer peaks.

num_peaks[num_peaks < 500].hist()
A histogram with buckets of size 100 between 0 and 500. The data still skews left with 400,000 in the left-most bucket and falling off quickly to below 50,000.
Number of peaks for samples with < 500 peaks

Samples with 1–100 peaks are generally “normal.” You can be more confident of the data with samples in this range, because small molecule spectra that contain hundreds of peaks can indicate noisy spectra. You need to remember that comparisons done on data containing too much noise might not be valid.

Where next?

You've gotten a good grasp of how to work with GNPS data using Python and Pandas. 

For more advanced analysis, take a look at Omigami, an open source Python and R package that gives you access to scalable APIs for the latest metabolomics algorithms.

Get Notified of New Articles

Leave your email to get our weekly newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.