phenomedb.compounds
PhenomeDB enables the storage of annotation metadata such as chemical references and classes, and has a data model and import processes capable of harmonising annotations to their analytical specificity.
The minimum information required for import is compound name (as annotated) and InChI (if available). If the specificity of the annotation is low, multiple compounds and InChIs can be recorded per annotation. With this minimum information, PhenomeDB can lookup and record the following external references and classes and make them queryable and reportable.
Databases: PubChem, ChEBI, ChEMBL, ChemSpider, LipidMAPS, HMDB
Classes: LipidMAPS, HMDB, ClassyFIRE
The ImportCompoundTask overview, which looks up compound metadata and populates the database
Compound metadata can be imported from PeakPantheR region-of-interest files (ROI) files for LC-MS annotations. Recent versions for these can be found in ./phenomedb/data/compounds/.
To import the ROI compound data use the tasks ImportROICompounds and ImportROILipids
IVDr annotation metadata can be imported using ImportBrukerBiLISACompounds and ImportBrukerBiQuantCompounds,. The source data are available in ./phenomedb/data/compounds/
Once imported, compounds and compound classes can be explored using the Compound View UI.
The Compound List View, showing a searchable, paginated table of imported compounds
The Compound View, showing the imported information for one compound, with links to external databases
- class phenomedb.compounds.AddMissingClassyFireClasses(username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
Update the Compound properties and external references from the InChI lookups
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- load_data()
Load the data
- loop_and_map_data()
Loop over compounds and update the references
- class phenomedb.compounds.CleanROIFile(roi_file=None, roi_dtypes=None, assay_name=None, merged_file=None, replace_fields=False, pipeline_run_id=None, replace_missing=False, fields_to_replace=[], fields_to_ignore=[], cpds_to_replace=[], cpds_to_ignore=[], username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None)
Clean an ROI file. Takes ROI file, checks IDs from source, adds missing fields, writes out to log file
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- Raises:
Exception – [description]
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
- check_cas(subrow, pubchem_data)
Check the CAS number from pubchem against the row data
- Parameters:
subrow (pandas.Series) – The subrow containing the data
pubchem_data (dict) – The pubchem record
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_chebi(subrow)
Check ChEBI using the InChI
- Parameters:
subrow (pandas.Series) – The subrow containing the data
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_chembl(subrow, inchi_key)
Check found ChEMBL against the ChEMBLID
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search ChEMBL
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_chemspider(subrow, inchi_key)
Check found chemspider against the chemspider.
Only works for those without IDs
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search ChEMBL
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_classyfire(subrow, inchi_key)
Check classyfire fields
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search CLASSYFIRE
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_field(found_value, field_name, subrow)
Check an individual field’s value and update if specified to
- Parameters:
found_value (object) – The value of the property found via DB lookup
field_name (str) – The name of the field to check
subrow (pandas.Series) – The subrow containing the fields
- Raises:
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
ROICleanCheckFail – [description]
- Returns:
The subrow containing the fields
- Return type:
pandas.Series
- check_fields(row)
Check the fields for a row. Adds warnings, and replaces if settings specify to
- Parameters:
row (pandas.Series) – The row from the Dataframe
- Returns:
The row from the Dataframe
- Return type:
pandas.Series
- check_hmdb(subrow, inchi_key)
Check HMDB using inchi_key
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search HMDB
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_kegg(subrow, pubchem_cid)
Check KEGG using pubchem id
- Parameters:
subrow (pandas.Series) – The subrow containing the data
pubchem_cid (str) – The Pubchem CID to search KEGG with
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_lipidmaps(subrow, inchi_key)
Check the lipidmaps fields
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search lipidmaps
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_logP_RDKit(subrow)
Check the logP calculated in RDKit against the one in the ROI file
- Parameters:
subrow (pandas.Series) – The subrow containing the data
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_merged_file(subrow)
Checked the mergedName against the merged_file
- Parameters:
subrow (pandas.Series) – The subrow containing the data
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_pubchem(subrow, inchi_key)
[summary]
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to search with
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- check_refmet(subrow, inchi_key=None)
Check refmet using inchi_key or mass range
- Parameters:
subrow (pandas.Series) – The subrow containing the data
inchi_key (str) – The InChI Key to use to search REFMET
- Returns:
The subrow containing the data
- Return type:
pandas.Series
- get_assay()
Get assay
- Raises:
Exception – [description]
- load_data()
Loads the necessary files
- loop_and_map_data()
Loop and map dataset
- process()
Main method
- remove_whitespace_and_weird_characters()
Remove whitespace and strip commonly used weird characters
- class phenomedb.compounds.CompoundTask(**kwargs)
The CompoundTask base class. Used for CompoundTasks. Loads the lookup files to be used as a reference, and has methods for checking various databases (KEGG,HMDB,REFMET,CHEBI,LIPIDMAPS,PUBCHEM,CHEMBL,CAS)
- Parameters:
Task (phenomedb.task.Task) – The Task base class
- Raises:
Exception – [description]
Exception – [description]
Exception – [description]
Exception – [description]
- Returns:
The CompoundTask class
- Return type:
phenomedb.compound.CompoundTask
- add_cas_from_hmdb(compound, lookup_field=None, lookup_value=None)
Add CAS from HMDB for a Compound. Uses Compound.inchi_key by default
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
lookup_field (str, optional) – The search field, defaults to None
lookup_value (str, optional) – The search value, defaults to None
- Returns:
CAS
- Return type:
str
- add_or_update_chebi(compound)
Add or update chebi based on a Compound
- Parameters:
compound (phenomedb.models.Compound, optional) – Compound to add or update ChEBI for, defaults to None
- Raises:
Exception – [description]
- Returns:
Found ChEBI id
- Return type:
str
- add_or_update_chembl(compound)
Add of update ChEMBL for compound
- Parameters:
compound (phenomedb.models.Compound) – Compound to update
- Returns:
ChEMBL ID
- Return type:
str
- add_or_update_classyfire(compound)
Add or update the Classyfire references and classes
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
- add_or_update_hmdb(compound, lookup_field=None, lookup_value=None)
Add or update HMDB ID + Groups using Compound.inchi_key or a lookup field
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
lookup_field (str, optional) – The search field, defaults to None
lookup_value (str, optional) – The search value, defaults to None
- Returns:
HMDB ID
- Return type:
str
- add_or_update_kegg(compound, pubchem_cid=None, kegg_id=None)
Add or update KEGG using pubchem_cid or kegg_id
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
pubchem_cid (str, optional) – The Pubchem CID to use for searching, defaults to None
kegg_id (str, optional) – The KEGG ID to use for searching, defaults to None
- Returns:
kegg_id
- Return type:
str
- add_or_update_lipid_maps(compound, lookup_field=None, lookup_value=None)
Add or update LipidMAPS IDs and Groups.
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
lookup_field (str, optional) – The search field to use, defaults to None
lookup_value (str, optional) – The search value, defaults to None
- Returns:
LipidMAPs ID
- Return type:
str
- add_or_update_ontology_ref(ontology_source, accession_number, field, model_id)
Add or update an ontology ref
- Parameters:
ontology_source (phenomedb.models.OntologySource) – The OntologySource
accession_number (str) – The accession number for the ontology
field (str) – The model field to map the ontology to
model_id (int) – The ID of the mapped model
- add_or_update_pubchem_from_api(compound)
Add or update pubchem info for a Compound. Updates all the Compound properties including mass, chemical_formula, IUPAC and smiles.
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
- Returns:
The matching pubchem CID
- Return type:
str
- add_stereo_group(compound)
Add the stereo group for the compound
- Parameters:
compound (phenomedb.models.Compound) – The compound to add
- build_subrows(row)
Take a row from an ROI file and create sub row, where each sub row contains the info for 1 compound. Required because ROI files can contain ‘Annotation’ information that maps to multiple ‘Compounds’ (ie unique InChIs)
- Parameters:
row (pandas.Series) – The row to split into multiple rows.
- calculate_log_p(inchi)
Calculate logP using RDKit
- Parameters:
inchi (str) – InCHI for compound
- Returns:
logP
- Return type:
float
- find_chebi(inchi)
Find a ChEBI based on inchi. If there are multiple InChIs in the row,
- Parameters:
inchi (str) – The InChI to search
- Returns:
ChEBI ID
- Return type:
str
- generate_inchi_key(inchi)
Generate inchi_key using RDKit
- Parameters:
inchi (str) – InChI for compound
- Returns:
InChI key for compound
- Return type:
str
- get_cas_from_pubchem(pubchem_data)
Get CAS from pubchem
- Parameters:
pubchem_data (dict) – The pubchem record
- Returns:
The pubchem record
- Return type:
dict
- get_classyfire_reference(value)
Get the Classyfire ontology reference
- Parameters:
value (str) – The Classyfire ID to get the reference from
- Returns:
The Classyfire ontology reference
- Return type:
str
- get_from_chebi(inchi)
Get from ChEBI by inchi
- Parameters:
inchi (str) – The InChI
- Returns:
The ChEBI ID
- Return type:
str
- get_from_chembl(inchi_key)
Get from ChEMBL by inchi_key
- Parameters:
inchi_key (str) – The InChI Key
- Returns:
ChEMBL ID
- Return type:
str
- get_from_classyfire(inchi_key)
Get content from classyfire
- Parameters:
inchi_key (str) – InChI key to search with
- Returns:
data from classyfire
- Return type:
dict
- get_from_hmdb(inchi_key)
Get from HMDB row by inchi_key
- Parameters:
inchi_key (str) – The InChI key to search by
- Returns:
The HMDB dataset row number
- Return type:
int
- get_from_kegg(pubchem_cid)
Get from Kegg
- Parameters:
pubchem_cid (int) – The Pubchem CID
- Returns:
kegg id
- Return type:
str
- get_from_lipidmaps(inchi_key)
Get from lipidmaps by inchi_key
- Parameters:
inchi_key (str) – The InChI Key to search by
- Returns:
The row number from lipidmaps dataset
- Return type:
int
- get_from_pubchem(inchi_key)
Get from pubchem by inchi_key
- Parameters:
inchi_key (str) – The inchi_key to search by
- Returns:
The pubchem result
- Return type:
dict
- get_from_pubchem_api(lookup_field, lookup_value)
Get from pubchem by lookup_field and lookup_value
- Parameters:
lookup_field (str) – The field to search with
lookup_value (str) – The value to search with
- Returns:
the found record
- Return type:
dict
- get_from_refmet(subrow, inchi_key=None)
Get from refmet by subrow or inchi_key
- Parameters:
subrow (pandas.Series) – The row of the ROI file
inchi_key (str, optional) – The InChI Key, defaults to None
- Returns:
The row number of the refmet database
- Return type:
int
- get_inchi_key_from_pubchem_or_hmdb(inchi, hmdb_id)
Get an inchi_key from an InChI via Pubchem or HMDB_ID
- Parameters:
inchi (str) – The InChI to find the inchi_key.
hmdb_id (int) – The HMDBID
- Returns:
The InChI Key
- Return type:
str
- get_lipid_maps_reference(value)
Get the lipid maps ontology reference from the ID
- Parameters:
value (str) – The LipidMAPS ID to strip the ontology reference from
- Returns:
The LipidMAPS ontology reference
- Return type:
str
- get_or_add_compound_external_db(compound, external_db_name, database_ref)
Get or add CompoundExternalDB.
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update
external_db_name (str) – The name of the db to update
database_ref (str) – The database ref to add
- get_pubchem_prop(pubchem_compound, label, name=None)
Get a pubchem property based on it’s label and name
- Parameters:
pubchem_compound (dict) – The pubchem record
label (str) – The label to search for
name (str, optional) – The name to search for, defaults to None
- Returns:
The property
- Return type:
object
- get_pubchem_view_from_api(pubchem_cid)
Get a pubchem view record (more info) by pubchem_cid
- Parameters:
pubchem_cid (int) – The Pubchem CID
- Returns:
The found record
- Return type:
dict
- load_data()
Load the databases + the ExternalDB.ids
- loop_and_map_data()
Override this method in your CompoundTasks
- parse_pubchem_value(value_dict)
Parse a pubchem value into it’s defined type.
- Parameters:
value_dict (dict) – The value dict of the record.
- Returns:
The converted typed value (str, int, or float)
- Return type:
object
- process()
Process method. Loads the data and then maps it.
- update_annotation(cpd_name, feature_dict, version)
Update a compound annotation with the config to store. Used for adding the ion types from a ROI row
- Parameters:
cpd_name (str) – The name of the annotated compound
feature_dict (dict) – The config dictionary
version (str) – The version of the Annotation
- update_name_to_refmet(compound, lookup_field=None, lookup_value=None)
Updata the compound name to refmet. Optionally use a lookup_field. Defaults to Compound.inchi_key
- Parameters:
compound (phenomedb.models.Compound) – The Compound to update the name for.
lookup_field (str, optional) – The field to use for searching, defaults to None
lookup_value (str, optional) – The value to use for searching, defaults to None
- Returns:
refmet_name
- Return type:
str
- class phenomedb.compounds.ExportCompoundsToCSV(methods=None, username=None, task_run_id=None, output_file_path=None, annotation_config_field=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
Export Compounds To CSV Class
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- build_dataset()
Build the dataset from the compounds
- build_row(compound)
Build a row from a Compound
- Parameters:
compound (phenomedb.models.Compound) – Compound to build a row from
- process()
Export all the compounds in the db
- class phenomedb.compounds.ImportBrukerBiLISACompounds(bilisa_file=None, version=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
Import the Bruker BILISA Lipoprotein fractions. The file is in test_data/compounds/ These are imported as Annotations, not Compounds, as they represent centrifuged fractions of lipoproteins, not individual compounds.
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- add_annotation(row)
Add Annotation
- Parameters:
row (pandas.Series) – The row from the file
- load_data()
Loads the annotation method and then assay. The
- loop_and_map_data()
Loop and map data
- class phenomedb.compounds.ImportBrukerBiQuantCompounds(version=None, biquant_compounds_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
Import Bruker BI-QUANT Compounds.
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- add_compound_and_mappings(row)
Add compound and mappings
- Parameters:
row (pandas.Series) – The row from the file
- load_data()
load the data
- loop_and_map_data()
Loop and map data
- class phenomedb.compounds.ImportCompoundsFromCSV(input_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
- load_data()
Load the databases + the ExternalDB.ids
- loop_and_map_data()
Override this method in your CompoundTasks
- class phenomedb.compounds.ImportROICompounds(roi_file=None, assay_name=None, roi_version=None, update_names=False, task_run_id=None, username=None, pipeline_run_id=None, upstream_task_run_id=None, na_values=None, na_none=None, db_env=None, db_session=None, execution_date=None, missing_lipid_classes=False)
Import ROI Compounds
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- add_or_update_compound(row, annotation, inchi, inchi_key, chemical_formula, monoisotopic_mass, log_p, iupac, smiles, sub_cpd_name=None)
Adds or updates a compound
- Parameters:
row (pandas.Series) – The row from the ROI dataframe
annotation (phenomedb.models.Annotation) – The Annotation object
inchi (str) – The InChI for the Compound
inchi_key (str) – The InChI Key for the Compound
chemical_formula (str) – The chemical formula for the Compound
monoisotopic_mass (float) – The monoisotopic_mass of the Compound
log_p (float) – The logP (partition coefficient) of the Compound
iupac (str) – The IUPAC identifier of the Compound
smiles (str) – The SMILES string of the Compound
sub_cpd_name (str, optional) – The split cpd_name of the Annotation, defaults to None
- Returns:
The Compound
- Return type:
phenomedb.models.Compound
- add_or_update_compound_from_subrow(row, annotation)
Add or update compound from subrow
- Parameters:
row (pandas.Series) – The row from the ROI dataframe
annotation (phenomedb.models.Annotation) – The Annotation object
- import_row(row)
Imports a row from the file.
Breaks the row into subrows and imports seperately.
Finds or adds Annotation + harmonised annotation
Adds compounds + identifiers
Adds groups
- Parameters:
row (pandas.Series) – The row from the ROI dataframe
- Returns:
The row from the ROI dataframe
- Return type:
pandas.Series
- load_data()
Loads data
- loop_and_map_data()
Loop and map the data
- process()
Main method
- class phenomedb.compounds.ImportStandardsV1(standards_file=None, username=None, task_run_id=None, db_env=None, db_session=None, execution_date=None, pipeline_run_id=None, upstream_task_run_id=None)
- load_data()
Load the databases + the ExternalDB.ids
- loop_and_map_data()
Override this method in your CompoundTasks
- class phenomedb.compounds.ParseHMDBXMLtoCSV(input_file_path=None, output_file_path=None, hmdb_type=None, username=None, task_run_id=None, pipeline_run_id=None, upstream_task_run_id=None)
Parse HMDB XML to CSV, used for simpler lookups
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- download_file()
Downloads the HMDB file
- process()
Main method. Downloads file or uses cache, then loops and builds a CSV
- reset_found_fields()
Resets the fields when they are found
- Returns:
found field dictionary
- Return type:
dict
- class phenomedb.compounds.ParseKEGGtoPubchemCIDTask(output_file_path=None, compound_type=None, test=False, task_run_id=None, username=None, pipeline_run_id=None, upstream_task_run_id=None)
This task parses KEGG and builds a dataframe of KEGGID -> Pubchem CID lookups
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- extract_and_set_compound_id(name)
Extract KEGG ID from name
- Parameters:
name (str) – KEGG element name
- extract_kegg_ids()
Extracts the KEGG IDs from the brite codes
- get_pubchem_cid(pubchem_sid)
Get pubchem CID from pubchem SID
- Parameters:
kegg_id (str) – Pubchem SID
- Returns:
Pubchem CID
- Return type:
str
- get_pubchem_sid(kegg_id)
Get pubchem SID from kegg id
- Parameters:
kegg_id (str) – KEGG ID
- Returns:
Pubchem SID
- Return type:
str
- loop_into_brite_fields(element)
Recurse into brite fields to extract all required parameters
- Parameters:
element (object) – element to recurse into
- parse_kegg_compound_class(brite_code)
Parse a KEGG compound class by brite code
- Parameters:
brite_code (str) – The code to search with
- process()
Main method
- class phenomedb.compounds.UpdateCompoundRefs(username=None, task_run_id=None, pipeline_run_id=None, upstream_task_run_id=None)
Update the Compound properties and external references from the InChI lookups
- Parameters:
CompoundTask (phenomedb.compounds.CompoundTask) – The Base CompoundTask
- load_data()
Load the data
- loop_and_map_data()
Loop over compounds and update the references