phenomedb.compounds

PhenomeDB enables the storage of annotation metadata such as chemical references and classes, and has a data model and import processes capable of harmonising annotations to their analytical specificity.

The minimum information required for import is compound name (as annotated) and InChI (if available). If the specificity of the annotation is low, multiple compounds and InChIs can be recorded per annotation. With this minimum information, PhenomeDB can lookup and record the following external references and classes and make them queryable and reportable.

Databases: PubChem, ChEBI, ChEMBL, ChemSpider, LipidMAPS, HMDB

Classes: LipidMAPS, HMDB, ClassyFIRE

PhenomeDB ImportCompoundTask overview

The ImportCompoundTask overview, which looks up compound metadata and populates the database

Compound metadata can be imported from PeakPantheR region-of-interest files (ROI) files for LC-MS annotations. Recent versions for these can be found in ./phenomedb/data/compounds/.

To import the ROI compound data use the tasks ImportROICompounds and ImportROILipids

IVDr annotation metadata can be imported using ImportBrukerBiLISACompounds and ImportBrukerBiQuantCompounds,. The source data are available in ./phenomedb/data/compounds/

Once imported, compounds and compound classes can be explored using the Compound View UI.

PhenomeDB Compound List View

The Compound List View, showing a searchable, paginated table of imported compounds

PhenomeDB Compound View

The Compound View, showing the imported information for one compound, with links to external databases

class phenomedb.compounds.AddMissingClassyFireClasses(username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)

Update the Compound properties and external references from the InChI lookups

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

load_data()

Load the data

loop_and_map_data()

Loop over compounds and update the references

class phenomedb.compounds.CleanROIFile(roi_file=None, roi_dtypes=None, assay_name=None, merged_file=None, replace_fields=False, pipeline_run_id=None, replace_missing=False, fields_to_replace=[], fields_to_ignore=[], cpds_to_replace=[], cpds_to_ignore=[], username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None)

Clean an ROI file. Takes ROI file, checks IDs from source, adds missing fields, writes out to log file

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

Raises:
  • Exception – [description]

  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

check_cas(subrow, pubchem_data)

Check the CAS number from pubchem against the row data

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • pubchem_data (dict) – The pubchem record

Returns:

The subrow containing the data

Return type:

pandas.Series

check_chebi(subrow)

Check ChEBI using the InChI

Parameters:

subrow (pandas.Series) – The subrow containing the data

Returns:

The subrow containing the data

Return type:

pandas.Series

check_chembl(subrow, inchi_key)

Check found ChEMBL against the ChEMBLID

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search ChEMBL

Returns:

The subrow containing the data

Return type:

pandas.Series

check_chemspider(subrow, inchi_key)

Check found chemspider against the chemspider.

Only works for those without IDs

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search ChEMBL

Returns:

The subrow containing the data

Return type:

pandas.Series

check_classyfire(subrow, inchi_key)

Check classyfire fields

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search CLASSYFIRE

Returns:

The subrow containing the data

Return type:

pandas.Series

check_field(found_value, field_name, subrow)

Check an individual field’s value and update if specified to

Parameters:
  • found_value (object) – The value of the property found via DB lookup

  • field_name (str) – The name of the field to check

  • subrow (pandas.Series) – The subrow containing the fields

Raises:
  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

  • ROICleanCheckFail – [description]

Returns:

The subrow containing the fields

Return type:

pandas.Series

check_fields(row)

Check the fields for a row. Adds warnings, and replaces if settings specify to

Parameters:

row (pandas.Series) – The row from the Dataframe

Returns:

The row from the Dataframe

Return type:

pandas.Series

check_hmdb(subrow, inchi_key)

Check HMDB using inchi_key

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search HMDB

Returns:

The subrow containing the data

Return type:

pandas.Series

check_kegg(subrow, pubchem_cid)

Check KEGG using pubchem id

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • pubchem_cid (str) – The Pubchem CID to search KEGG with

Returns:

The subrow containing the data

Return type:

pandas.Series

check_lipidmaps(subrow, inchi_key)

Check the lipidmaps fields

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search lipidmaps

Returns:

The subrow containing the data

Return type:

pandas.Series

check_logP_RDKit(subrow)

Check the logP calculated in RDKit against the one in the ROI file

Parameters:

subrow (pandas.Series) – The subrow containing the data

Returns:

The subrow containing the data

Return type:

pandas.Series

check_merged_file(subrow)

Checked the mergedName against the merged_file

Parameters:

subrow (pandas.Series) – The subrow containing the data

Returns:

The subrow containing the data

Return type:

pandas.Series

check_pubchem(subrow, inchi_key)

[summary]

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to search with

Returns:

The subrow containing the data

Return type:

pandas.Series

check_refmet(subrow, inchi_key=None)

Check refmet using inchi_key or mass range

Parameters:
  • subrow (pandas.Series) – The subrow containing the data

  • inchi_key (str) – The InChI Key to use to search REFMET

Returns:

The subrow containing the data

Return type:

pandas.Series

get_assay()

Get assay

Raises:

Exception – [description]

load_data()

Loads the necessary files

loop_and_map_data()

Loop and map dataset

process()

Main method

remove_whitespace_and_weird_characters()

Remove whitespace and strip commonly used weird characters

class phenomedb.compounds.CompoundTask(**kwargs)

The CompoundTask base class. Used for CompoundTasks. Loads the lookup files to be used as a reference, and has methods for checking various databases (KEGG,HMDB,REFMET,CHEBI,LIPIDMAPS,PUBCHEM,CHEMBL,CAS)

Parameters:

Task (phenomedb.task.Task) – The Task base class

Raises:
  • Exception – [description]

  • Exception – [description]

  • Exception – [description]

  • Exception – [description]

Returns:

The CompoundTask class

Return type:

phenomedb.compound.CompoundTask

add_cas_from_hmdb(compound, lookup_field=None, lookup_value=None)

Add CAS from HMDB for a Compound. Uses Compound.inchi_key by default

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update

  • lookup_field (str, optional) – The search field, defaults to None

  • lookup_value (str, optional) – The search value, defaults to None

Returns:

CAS

Return type:

str

add_or_update_chebi(compound)

Add or update chebi based on a Compound

Parameters:

compound (phenomedb.models.Compound, optional) – Compound to add or update ChEBI for, defaults to None

Raises:

Exception – [description]

Returns:

Found ChEBI id

Return type:

str

add_or_update_chembl(compound)

Add of update ChEMBL for compound

Parameters:

compound (phenomedb.models.Compound) – Compound to update

Returns:

ChEMBL ID

Return type:

str

add_or_update_classyfire(compound)

Add or update the Classyfire references and classes

Parameters:

compound (phenomedb.models.Compound) – The Compound to update

add_or_update_hmdb(compound, lookup_field=None, lookup_value=None)

Add or update HMDB ID + Groups using Compound.inchi_key or a lookup field

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update

  • lookup_field (str, optional) – The search field, defaults to None

  • lookup_value (str, optional) – The search value, defaults to None

Returns:

HMDB ID

Return type:

str

add_or_update_kegg(compound, pubchem_cid=None, kegg_id=None)

Add or update KEGG using pubchem_cid or kegg_id

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update

  • pubchem_cid (str, optional) – The Pubchem CID to use for searching, defaults to None

  • kegg_id (str, optional) – The KEGG ID to use for searching, defaults to None

Returns:

kegg_id

Return type:

str

add_or_update_lipid_maps(compound, lookup_field=None, lookup_value=None)

Add or update LipidMAPS IDs and Groups.

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update

  • lookup_field (str, optional) – The search field to use, defaults to None

  • lookup_value (str, optional) – The search value, defaults to None

Returns:

LipidMAPs ID

Return type:

str

add_or_update_ontology_ref(ontology_source, accession_number, field, model_id)

Add or update an ontology ref

Parameters:
  • ontology_source (phenomedb.models.OntologySource) – The OntologySource

  • accession_number (str) – The accession number for the ontology

  • field (str) – The model field to map the ontology to

  • model_id (int) – The ID of the mapped model

add_or_update_pubchem_from_api(compound)

Add or update pubchem info for a Compound. Updates all the Compound properties including mass, chemical_formula, IUPAC and smiles.

Parameters:

compound (phenomedb.models.Compound) – The Compound to update

Returns:

The matching pubchem CID

Return type:

str

add_stereo_group(compound)

Add the stereo group for the compound

Parameters:

compound (phenomedb.models.Compound) – The compound to add

build_subrows(row)

Take a row from an ROI file and create sub row, where each sub row contains the info for 1 compound. Required because ROI files can contain ‘Annotation’ information that maps to multiple ‘Compounds’ (ie unique InChIs)

Parameters:

row (pandas.Series) – The row to split into multiple rows.

calculate_log_p(inchi)

Calculate logP using RDKit

Parameters:

inchi (str) – InCHI for compound

Returns:

logP

Return type:

float

find_chebi(inchi)

Find a ChEBI based on inchi. If there are multiple InChIs in the row,

Parameters:

inchi (str) – The InChI to search

Returns:

ChEBI ID

Return type:

str

generate_inchi_key(inchi)

Generate inchi_key using RDKit

Parameters:

inchi (str) – InChI for compound

Returns:

InChI key for compound

Return type:

str

get_cas_from_pubchem(pubchem_data)

Get CAS from pubchem

Parameters:

pubchem_data (dict) – The pubchem record

Returns:

The pubchem record

Return type:

dict

get_classyfire_reference(value)

Get the Classyfire ontology reference

Parameters:

value (str) – The Classyfire ID to get the reference from

Returns:

The Classyfire ontology reference

Return type:

str

get_from_chebi(inchi)

Get from ChEBI by inchi

Parameters:

inchi (str) – The InChI

Returns:

The ChEBI ID

Return type:

str

get_from_chembl(inchi_key)

Get from ChEMBL by inchi_key

Parameters:

inchi_key (str) – The InChI Key

Returns:

ChEMBL ID

Return type:

str

get_from_classyfire(inchi_key)

Get content from classyfire

Parameters:

inchi_key (str) – InChI key to search with

Returns:

data from classyfire

Return type:

dict

get_from_hmdb(inchi_key)

Get from HMDB row by inchi_key

Parameters:

inchi_key (str) – The InChI key to search by

Returns:

The HMDB dataset row number

Return type:

int

get_from_kegg(pubchem_cid)

Get from Kegg

Parameters:

pubchem_cid (int) – The Pubchem CID

Returns:

kegg id

Return type:

str

get_from_lipidmaps(inchi_key)

Get from lipidmaps by inchi_key

Parameters:

inchi_key (str) – The InChI Key to search by

Returns:

The row number from lipidmaps dataset

Return type:

int

get_from_pubchem(inchi_key)

Get from pubchem by inchi_key

Parameters:

inchi_key (str) – The inchi_key to search by

Returns:

The pubchem result

Return type:

dict

get_from_pubchem_api(lookup_field, lookup_value)

Get from pubchem by lookup_field and lookup_value

Parameters:
  • lookup_field (str) – The field to search with

  • lookup_value (str) – The value to search with

Returns:

the found record

Return type:

dict

get_from_refmet(subrow, inchi_key=None)

Get from refmet by subrow or inchi_key

Parameters:
  • subrow (pandas.Series) – The row of the ROI file

  • inchi_key (str, optional) – The InChI Key, defaults to None

Returns:

The row number of the refmet database

Return type:

int

get_inchi_key_from_pubchem_or_hmdb(inchi, hmdb_id)

Get an inchi_key from an InChI via Pubchem or HMDB_ID

Parameters:
  • inchi (str) – The InChI to find the inchi_key.

  • hmdb_id (int) – The HMDBID

Returns:

The InChI Key

Return type:

str

get_lipid_maps_reference(value)

Get the lipid maps ontology reference from the ID

Parameters:

value (str) – The LipidMAPS ID to strip the ontology reference from

Returns:

The LipidMAPS ontology reference

Return type:

str

get_or_add_compound_external_db(compound, external_db_name, database_ref)

Get or add CompoundExternalDB.

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update

  • external_db_name (str) – The name of the db to update

  • database_ref (str) – The database ref to add

get_pubchem_prop(pubchem_compound, label, name=None)

Get a pubchem property based on it’s label and name

Parameters:
  • pubchem_compound (dict) – The pubchem record

  • label (str) – The label to search for

  • name (str, optional) – The name to search for, defaults to None

Returns:

The property

Return type:

object

get_pubchem_view_from_api(pubchem_cid)

Get a pubchem view record (more info) by pubchem_cid

Parameters:

pubchem_cid (int) – The Pubchem CID

Returns:

The found record

Return type:

dict

load_data()

Load the databases + the ExternalDB.ids

loop_and_map_data()

Override this method in your CompoundTasks

parse_pubchem_value(value_dict)

Parse a pubchem value into it’s defined type.

Parameters:

value_dict (dict) – The value dict of the record.

Returns:

The converted typed value (str, int, or float)

Return type:

object

process()

Process method. Loads the data and then maps it.

update_annotation(cpd_name, feature_dict, version)

Update a compound annotation with the config to store. Used for adding the ion types from a ROI row

Parameters:
  • cpd_name (str) – The name of the annotated compound

  • feature_dict (dict) – The config dictionary

  • version (str) – The version of the Annotation

update_name_to_refmet(compound, lookup_field=None, lookup_value=None)

Updata the compound name to refmet. Optionally use a lookup_field. Defaults to Compound.inchi_key

Parameters:
  • compound (phenomedb.models.Compound) – The Compound to update the name for.

  • lookup_field (str, optional) – The field to use for searching, defaults to None

  • lookup_value (str, optional) – The value to use for searching, defaults to None

Returns:

refmet_name

Return type:

str

class phenomedb.compounds.ExportCompoundsToCSV(methods=None, username=None, task_run_id=None, output_file_path=None, annotation_config_field=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)

Export Compounds To CSV Class

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

build_dataset()

Build the dataset from the compounds

build_row(compound)

Build a row from a Compound

Parameters:

compound (phenomedb.models.Compound) – Compound to build a row from

process()

Export all the compounds in the db

class phenomedb.compounds.ImportBrukerBiLISACompounds(bilisa_file=None, version=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)

Import the Bruker BILISA Lipoprotein fractions. The file is in test_data/compounds/ These are imported as Annotations, not Compounds, as they represent centrifuged fractions of lipoproteins, not individual compounds.

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

add_annotation(row)

Add Annotation

Parameters:

row (pandas.Series) – The row from the file

load_data()

Loads the annotation method and then assay. The

loop_and_map_data()

Loop and map data

class phenomedb.compounds.ImportBrukerBiQuantCompounds(version=None, biquant_compounds_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)

Import Bruker BI-QUANT Compounds.

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

add_compound_and_mappings(row)

Add compound and mappings

Parameters:

row (pandas.Series) – The row from the file

load_data()

load the data

loop_and_map_data()

Loop and map data

class phenomedb.compounds.ImportCompoundsFromCSV(input_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
load_data()

Load the databases + the ExternalDB.ids

loop_and_map_data()

Override this method in your CompoundTasks

class phenomedb.compounds.ImportROICompounds(roi_file=None, assay_name=None, roi_version=None, update_names=False, task_run_id=None, username=None, pipeline_run_id=None, upstream_task_run_id=None, na_values=None, na_none=None, db_env=None, db_session=None, execution_date=None, missing_lipid_classes=False)

Import ROI Compounds

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

add_or_update_compound(row, annotation, inchi, inchi_key, chemical_formula, monoisotopic_mass, log_p, iupac, smiles, sub_cpd_name=None)

Adds or updates a compound

Parameters:
  • row (pandas.Series) – The row from the ROI dataframe

  • annotation (phenomedb.models.Annotation) – The Annotation object

  • inchi (str) – The InChI for the Compound

  • inchi_key (str) – The InChI Key for the Compound

  • chemical_formula (str) – The chemical formula for the Compound

  • monoisotopic_mass (float) – The monoisotopic_mass of the Compound

  • log_p (float) – The logP (partition coefficient) of the Compound

  • iupac (str) – The IUPAC identifier of the Compound

  • smiles (str) – The SMILES string of the Compound

  • sub_cpd_name (str, optional) – The split cpd_name of the Annotation, defaults to None

Returns:

The Compound

Return type:

phenomedb.models.Compound

add_or_update_compound_from_subrow(row, annotation)

Add or update compound from subrow

Parameters:
  • row (pandas.Series) – The row from the ROI dataframe

  • annotation (phenomedb.models.Annotation) – The Annotation object

import_row(row)

Imports a row from the file.

  1. Breaks the row into subrows and imports seperately.

  2. Finds or adds Annotation + harmonised annotation

  3. Adds compounds + identifiers

  4. Adds groups

Parameters:

row (pandas.Series) – The row from the ROI dataframe

Returns:

The row from the ROI dataframe

Return type:

pandas.Series

load_data()

Loads data

loop_and_map_data()

Loop and map the data

process()

Main method

class phenomedb.compounds.ImportStandardsV1(standards_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
load_data()

Load the databases + the ExternalDB.ids

loop_and_map_data()

Override this method in your CompoundTasks

class phenomedb.compounds.ParseHMDBXMLtoCSV(input_file_path=None, output_file_path=None, hmdb_type=None, username=None, task_run_id=None, pipeline_run_id=None, upstream_task_run_id=None)

Parse HMDB XML to CSV, used for simpler lookups

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

download_file()

Downloads the HMDB file

process()

Main method. Downloads file or uses cache, then loops and builds a CSV

reset_found_fields()

Resets the fields when they are found

Returns:

found field dictionary

Return type:

dict

class phenomedb.compounds.ParseKEGGtoPubchemCIDTask(output_file_path=None, compound_type=None, test=False, task_run_id=None, username=None, pipeline_run_id=None, upstream_task_run_id=None)

This task parses KEGG and builds a dataframe of KEGGID -> Pubchem CID lookups

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

extract_and_set_compound_id(name)

Extract KEGG ID from name

Parameters:

name (str) – KEGG element name

extract_kegg_ids()

Extracts the KEGG IDs from the brite codes

get_pubchem_cid(pubchem_sid)

Get pubchem CID from pubchem SID

Parameters:

kegg_id (str) – Pubchem SID

Returns:

Pubchem CID

Return type:

str

get_pubchem_sid(kegg_id)

Get pubchem SID from kegg id

Parameters:

kegg_id (str) – KEGG ID

Returns:

Pubchem SID

Return type:

str

loop_into_brite_fields(element)

Recurse into brite fields to extract all required parameters

Parameters:

element (object) – element to recurse into

parse_kegg_compound_class(brite_code)

Parse a KEGG compound class by brite code

Parameters:

brite_code (str) – The code to search with

process()

Main method

class phenomedb.compounds.UpdateCompoundRefs(username=None, task_run_id=None, pipeline_run_id=None, upstream_task_run_id=None)

Update the Compound properties and external references from the InChI lookups

Parameters:

CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask

load_data()

Load the data

loop_and_map_data()

Loop over compounds and update the references