vardb.metadata_wrangling package

This package is responsible for retrieving and parsing metadata from internal projects to produce vardb .loader files, which are used for actually loading the data to vardb.

  • Each MetadataCollector is responsible for assembling the metadata for a single pipeline. It makes calls to other databases through their connections class, and/or scrapes the filesystem for the data and metadata for the pipeline.
  • The main function in this package is make_loader, which collects the metadata, and optionally compares to a previous version to find only new and changed data for loading.

Submodules

vardb.metadata_wrangling.configuration module

class vardb.metadata_wrangling.configuration.Config(config=None)

Bases: object

evaluate(key, *args, **kwargs)

Evaluates function parameters, and returns the result

Parameters:
  • key – function key
  • args – function positional arguments
  • kwargs – function keyword arguments
Returns:

the return value of the function

get(key)

Gets the parameter associated with key in the configuration

Parameters:key
Returns:value of Config key
keys()

Returns keys

Returns:keys
set(key, val)

Sets a key in the configuration dictionary

Parameters:
  • key
  • val
update(config)

Update the Config object with new data

Parameters:config – a dictionary of key value pairs
Raises:Value error if a function parameter is not a function recognized in locals
validate(required_keys)

Makes sure that all of the required keys are defined

Parameters:required_keys
Raises:ValueError if some keys are not defined

vardb.metadata_wrangling.get_bam_cnvs module

Locates all bam_CNVs-bam file pairs for the controlfreec pipeline. This is necessary because controlfreec is not currently tracked on a database. The bam_CNVs and bam files will be used to look up metadata on BioApps and LIMS

exception vardb.metadata_wrangling.get_bam_cnvs.GetBamCNVsException

Bases: exceptions.Exception

vardb.metadata_wrangling.get_bam_cnvs.get_bam_cnvs(bam_cnv_pattern)

Locates the bam_cnvs and bam file pairs under a particular search pattern.

Returns:A pandas dataframe containing the BioApps lookup path (originating merged bam file path), library name, output data path, pipeline, and pipeline version for a given pair of bam_cnvs and bam files.

vardb.metadata_wrangling.helpers module

vardb.metadata_wrangling.helpers.get_patient_identifier(df)
vardb.metadata_wrangling.helpers.get_pog_controlfreec_library_name(df)
vardb.metadata_wrangling.helpers.get_pog_gene_model(df)
vardb.metadata_wrangling.helpers.get_pog_id(df)

vardb.metadata_wrangling.loader_maker module

Creates loader files to be used by vardb.variant_file_loaders.load_files to load data and metadata to vardb.

vardb.metadata_wrangling.loader_maker.make_loader(output_directory, project, query, previous_metadata_file=None, debug=False)

Creates a loader file, which includes all records matching a project and analysis query that need to be loaded to vardb. All metadata associated with the project and query is obtained, and then compared to the same results on a previous day (from previous_metadata_file). The rows that need to be loaded included all new/changed/deleted rows in the new metadata as compared to the previous metadata. If previous_metadata_file is not specified, all rows in the current metadata are added to the loader file.

Parameters:
  • output_directory – Destination for new metadata and loader files
  • project – project
  • query – analysis to query for (e.g. vcall)
  • previous_metadata_file – path to metadata file created on a previous day
  • debug – True if you want to suppress errors for debugging purposes
Returns:

path to loader file (None if no modified records were found)

vardb.metadata_wrangling.locate_metadata_changes module

Locates changes between two (variant data) metadata DataFrames, including row changes, deletions, and additions.

vardb.metadata_wrangling.locate_metadata_changes.locate_metadata_changes(old_metadata, new_metadata)

Finds new/changed data to be loaded to the database

Parameters:
  • old_metadata – Includes all metadata found for a project and pipeline at time of last loading to vardb
  • new_metadata – All new metadata for the same project and pipeline
Returns:

A dataframe with just the new and changed metadata, or None if no changes occured. This is used to make a loader file.

vardb.metadata_wrangling.metadata_collector module

MetadataCollector is a base class with common functionality for assembling, cleaning and extracting information from various database sources. A MetadataCollector subclass must be defined for each new data type. Any information that can not be obtained from databases directly can be specified by the Config object.

class vardb.metadata_wrangling.metadata_collector.ControlFreeCCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector

Collects metadata associated with controlfreec pipeline

data_type = 'controlfreec'
class vardb.metadata_wrangling.metadata_collector.ExpressionCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector

Collects metadata associated with the gene coverage pipeline

collect_metadata()

The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.

Returns:a Metadata object with the collected metadata
Modifies:self.metadata
data_type = 'expression'
class vardb.metadata_wrangling.metadata_collector.Metadata(df=None, path=None, debug=False)

Bases: object

Metadata is a class for storing metadata information for loading to vardb. It takes either a dataframe or a path. It loads the data, validates it, adds default values.

difference(old_metadata)

Finds the difference between metadata and another MetaData object

Parameters:old_metadata – a Metadata object to compare to
Returns:
k = 'production'
output_to_tsv(output_path)

Writes the given DataFrame to a tab-delimited file in the specified load file directory.

Parameters:output_path – full path to destination file
class vardb.metadata_wrangling.metadata_collector.MetadataCollector(config, debug=False)

Bases: object

Abstract class which defines the common operations needed to collect metadata for loading to vardb.

collect_metadata()

The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.

Returns:a Metadata object with the collected metadata
Modifies:self.metadata
data_type
classmethod factory(config, data_type, debug=False)

Returns the correct MetadataCollector subclass which corresponds to the data_type requested

Parameters:
  • config – a Config object which contains all required parameters to fully specify the MC class
  • data_type – the data type that is to be collected
  • debug – True if you want to suppress errors for debugging purposes
Returns:

MC subclass corresponding to the data type

Raises:

MetadataCollectorException if essential information is missing from config

metadata = None
exception vardb.metadata_wrangling.metadata_collector.MetadataCollectorException

Bases: exceptions.Exception

class vardb.metadata_wrangling.metadata_collector.ReviewedSomaticCNVCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector, vardb.metadata_wrangling.metadata_collector.TCFilter

Collects metadata associated with the reviewed somatic CNV pipeline

collect_metadata()

The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.

Returns:a Metadata object with the collected metadata
Modifies:self.metadata
data_type = 'somatic_cnv'
class vardb.metadata_wrangling.metadata_collector.ReviewedSomaticLOHCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector, vardb.metadata_wrangling.metadata_collector.TCFilter

Collects metadata associated with the somatic LOH pipeline.

collect_metadata()

The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.

Returns:a Metadata object with the collected metadata
Modifies:self.metadata
data_type = 'somatic_loh'
class vardb.metadata_wrangling.metadata_collector.SomaticSmallVariantCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector

Collects metadata associated with strelka and mutationseq pipelines

data_type = 'small_somatic'
class vardb.metadata_wrangling.metadata_collector.TCFilter

Bases: object

Collection of routines for filtering somatic cnv pipelines by the reviewed tumour content

filter_metadata(bioapps_df, tumour_df)
get_tumour_content(output_data_path)

Retrieves tumour content from path for somatic_cnv pipeline

Returns:tumour content
class vardb.metadata_wrangling.metadata_collector.VCallCollector(config, debug=False)

Bases: vardb.metadata_wrangling.metadata_collector.MetadataCollector

Collects metadata associated with the vcall pipeline

data_type = 'vcall'
vardb.metadata_wrangling.metadata_collector.throw_exception(msg, debug)

Raises MetadataCollectorException if debug = False, logs the error message

Parameters:
  • (str) (msg) – error message
  • (bool) (debug) – true if you do NOT want to actually raise the exception, false if you just want to log to file