vardb.variant_file_loaders package

This package is responsible for loading the data files and their metadata to the variant database. To prevent data integrity issues, where the data and metadata are out of sync, we require that the data and metadata be loaded at the same time, in a single transaction.

The main input to the load_files function is the loader file. It is a tab-delimited with columns for all of the required metadata, as well as the full path to the data file. The metadata_wrangling package is responsible for creating this file from metadata obtained on other GSC databases, or by scraping the file system if the pipeline is not tracked in a database.

The metadata is stored and validated using the SampleMetaData class. This class contains all of the metadata requirements for vardb, as well as their expected type, and any defaults.

The information about each type of data file and how it is to be parsed and loaded to vardb is contained in the variant_data_files classes.

Submodules

vardb.variant_file_loaders.loader module

Loads data and metadata specified in a loader file to a variant database

exception vardb.variant_file_loaders.loader.LoaderException

Bases: exceptions.Exception

vardb.variant_file_loaders.loader.download_unannotated_snps_indels(hawq, unannotated_vcf_name)

Saves all unannotated snps and indels to file for annotating offline

Parameters:
  • hawq – a vardb loader/pivotal connection object
  • unannotated_vcf_name – output path
Raises:

LoaderException if the query is unsuccessful

vardb.variant_file_loaders.loader.load_file(file, metadata, hawq, simulate=False)

Loads a single file to vardb

Parameters:
  • file – variant file object (check package variant_data_files)
  • metadata – the metadata to load to sample and analysis
  • hawq – a vardb Loader connection
  • simulate – True if you want to just simulate, False if you want to actually load
Raises:

LoaderException if file is not loaded properly

vardb.variant_file_loaders.loader.load_files(hawq, load_filename, simulate=False, **options)

Loads files from loading file “load_filename” into vardb

Parameters:
  • hawq – A vardb Loader object
  • load_filename – Tab delimited file where each line has the parameters needed for the data file type. First parameter MUST be the data path.
  • simulate – True if you want to just simulate, False if you want to actually load
Returns:

number of files not loaded

vardb.variant_file_loaders.loader.load_vcf_annotations(hawq, annotated_vcf_name, simulate=False, truncate=True)

Loads an annotations vcf file to the snp_eff table

Parameters:
  • hawq – A vardb loader/pivotal connection object
  • annotated_vcf_name – filename of annotated vcf file
  • simulate – True if you want to simulate loading and not actually load to the database
  • truncate – True if you want to truncate the unannotated_snps_indels table after the annotations are successfully loaded
vardb.variant_file_loaders.loader.remove_file(output_data_path, hawq, simulate=False)

vardb.variant_file_loaders.sample_metadata module

class vardb.variant_file_loaders.sample_metadata.Column(type, default, required)

Bases: object

contains data on each column of the metadata

class vardb.variant_file_loaders.sample_metadata.SampleMetaData(*args, **kwargs)

Bases: _abcoll.MutableMapping

SampleMetaData is a dictionary-type class which contains the metadata for files loaded to vardb. It has routines to clean data, check for valid data, as well as to input defaults.

analysis_items()
analysis_keys()
analysis_values()
columns = {'aligner': type: <type 'str'>, default: None, required: True, 'analysis_date': type: <type 'datetime.datetime'>, default: None, required: True, 'analysis_object_id': type: <type 'int'>, default: None, required: False, 'analysis_object_type': type: <type 'str'>, default: None, required: False, 'anatomic_site': type: <type 'str'>, default: None, required: True, 'anonymous_patient_id': type: <type 'str'>, default: None, required: True, 'auxiliary_analysis': type: <type 'str'>, default: None, required: False, 'cancer': type: <type 'bool'>, default: None, required: True, 'cancer_stage': type: <type 'str'>, default: None, required: False, 'cancer_subtype': type: strlist, default: None, required: False, 'control_type': type: <type 'str'>, default: None, required: False, 'developmental_stage': type: <type 'str'>, default: Unknown, required: False, 'diagnosis_age': type: <type 'int'>, default: None, required: False, 'disease_status': type: <type 'str'>, default: None, required: None, 'diseased': type: <type 'bool'>, default: None, required: True, 'entry_date': type: <type 'str'>, default: now, required: False, 'ethnicity': type: <type 'str'>, default: Unknown, required: False, 'exon_capture_version': type: <type 'str'>, default: None, required: False, 'fixation': type: <type 'str'>, default: Unknown, required: False, 'gender': type: <type 'str'>, default: Unknown, required: False, 'gene_model': type: <type 'str'>, default: None, required: False, 'genome_reference': type: <type 'str'>, default: None, required: True, 'input_data_path': type: <type 'str'>, default: None, required: False, 'is_cell_line': type: <type 'bool'>, default: False, required: False, 'library_construction_protocol': type: <type 'str'>, default: None, required: False, 'library_name': type: <type 'str'>, default: None, required: True, 'library_strategy': type: <type 'str'>, default: None, required: True, 'md5sum': type: <type 'str'>, default: None, required: True, 'microbial_status': type: <type 'str'>, default: None, required: False, 'output_data_path': type: <type 'str'>, default: None, required: True, 'pathology_alias': type: <type 'str'>, default: None, required: False, 'pathology_type': type: <type 'str'>, default: None, required: False, 'patient_id': type: <type 'str'>, default: None, required: True, 'permission': type: <type 'str'>, default: GENERAL_USE, required: False, 'pipeline': type: <type 'str'>, default: None, required: True, 'pipeline_version': type: <type 'str'>, default: None, required: True, 'ploidy': type: <type 'str'>, default: None, required: False, 'pre_existing_condition': type: <type 'str'>, default: None, required: False, 'production': type: <type 'bool'>, default: True, required: False, 'project': type: <type 'str'>, default: None, required: True, 'read_length': type: intlist, default: None, required: False, 'reference_library': type: <type 'str'>, default: None, required: False, 'seq_tumour_content': type: <type 'float'>, default: None, required: False, 'sequencing_centre': type: <type 'str'>, default: BCGSC, required: False, 'sequencing_protocol': type: <type 'str'>, default: BCGSC, required: False, 'source_type': type: <type 'str'>, default: Internal, required: False, 'sow': type: <type 'str'>, default: None, required: False, 'tumour_content': type: <type 'float'>, default: None, required: False}
complete()

Checks to make sure all required fields are present

Returns:True if all required data is filled in, False if there is still missing data
default(key)
input_keys = ['patient_id', 'anonymous_patient_id', 'library_name', 'permission', 'project', 'sow', 'ethnicity', 'gender', 'developmental_stage', 'diagnosis_age', 'cancer', 'diseased', 'is_cell_line', 'cancer_stage', 'pathology_type', 'pathology_alias', 'cancer_subtype', 'pre_existing_condition', 'microbial_status', 'anatomic_site', 'fixation', 'tumour_content', 'seq_tumour_content', 'ploidy', 'library_strategy', 'library_construction_protocol', 'read_length', 'sequencing_protocol', 'exon_capture_version', 'sequencing_centre', 'source_type', 'control_type', 'reference_library', 'genome_reference', 'aligner', 'gene_model', 'pipeline', 'pipeline_version', 'auxiliary_analysis', 'input_data_path', 'output_data_path', 'analysis_object_type', 'analysis_object_id', 'production']
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
key = 'entry_date'
keys() → list of D's keys
required(key)
sample_items()
sample_keys()
sample_values()
type(key)
values() → list of D's values