vardb.variant_file_loaders package¶
This package is responsible for loading the data files and their metadata to the variant database. To prevent data integrity issues, where the data and metadata are out of sync, we require that the data and metadata be loaded at the same time, in a single transaction.
The main input to the load_files function is the loader file. It is a tab-delimited with columns for all of the required metadata, as well as the full path to the data file. The metadata_wrangling package is responsible for creating this file from metadata obtained on other GSC databases, or by scraping the file system if the pipeline is not tracked in a database.
The metadata is stored and validated using the SampleMetaData class. This class contains all of the metadata requirements for vardb, as well as their expected type, and any defaults.
The information about each type of data file and how it is to be parsed and loaded to vardb is contained in the variant_data_files classes.
Submodules¶
vardb.variant_file_loaders.loader module¶
Loads data and metadata specified in a loader file to a variant database
-
exception
vardb.variant_file_loaders.loader.
LoaderException
¶ Bases:
exceptions.Exception
-
vardb.variant_file_loaders.loader.
download_unannotated_snps_indels
(hawq, unannotated_vcf_name)¶ Saves all unannotated snps and indels to file for annotating offline
Parameters: - hawq – a vardb loader/pivotal connection object
- unannotated_vcf_name – output path
Raises: LoaderException if the query is unsuccessful
-
vardb.variant_file_loaders.loader.
load_file
(file, metadata, hawq, simulate=False)¶ Loads a single file to vardb
Parameters: - file – variant file object (check package variant_data_files)
- metadata – the metadata to load to sample and analysis
- hawq – a vardb Loader connection
- simulate – True if you want to just simulate, False if you want to actually load
Raises: LoaderException if file is not loaded properly
-
vardb.variant_file_loaders.loader.
load_files
(hawq, load_filename, simulate=False, **options)¶ Loads files from loading file “load_filename” into vardb
Parameters: - hawq – A vardb Loader object
- load_filename – Tab delimited file where each line has the parameters needed for the data file type. First parameter MUST be the data path.
- simulate – True if you want to just simulate, False if you want to actually load
Returns: number of files not loaded
-
vardb.variant_file_loaders.loader.
load_vcf_annotations
(hawq, annotated_vcf_name, simulate=False, truncate=True)¶ Loads an annotations vcf file to the snp_eff table
Parameters: - hawq – A vardb loader/pivotal connection object
- annotated_vcf_name – filename of annotated vcf file
- simulate – True if you want to simulate loading and not actually load to the database
- truncate – True if you want to truncate the unannotated_snps_indels table after the annotations are successfully loaded
-
vardb.variant_file_loaders.loader.
remove_file
(output_data_path, hawq, simulate=False)¶
vardb.variant_file_loaders.sample_metadata module¶
-
class
vardb.variant_file_loaders.sample_metadata.
Column
(type, default, required)¶ Bases:
object
contains data on each column of the metadata
-
class
vardb.variant_file_loaders.sample_metadata.
SampleMetaData
(*args, **kwargs)¶ Bases:
_abcoll.MutableMapping
SampleMetaData is a dictionary-type class which contains the metadata for files loaded to vardb. It has routines to clean data, check for valid data, as well as to input defaults.
-
analysis_items
()¶
-
analysis_keys
()¶
-
analysis_values
()¶
-
columns
= {'aligner': type: <type 'str'>, default: None, required: True, 'analysis_date': type: <type 'datetime.datetime'>, default: None, required: True, 'analysis_object_id': type: <type 'int'>, default: None, required: False, 'analysis_object_type': type: <type 'str'>, default: None, required: False, 'anatomic_site': type: <type 'str'>, default: None, required: True, 'anonymous_patient_id': type: <type 'str'>, default: None, required: True, 'auxiliary_analysis': type: <type 'str'>, default: None, required: False, 'cancer': type: <type 'bool'>, default: None, required: True, 'cancer_stage': type: <type 'str'>, default: None, required: False, 'cancer_subtype': type: strlist, default: None, required: False, 'control_type': type: <type 'str'>, default: None, required: False, 'developmental_stage': type: <type 'str'>, default: Unknown, required: False, 'diagnosis_age': type: <type 'int'>, default: None, required: False, 'disease_status': type: <type 'str'>, default: None, required: None, 'diseased': type: <type 'bool'>, default: None, required: True, 'entry_date': type: <type 'str'>, default: now, required: False, 'ethnicity': type: <type 'str'>, default: Unknown, required: False, 'exon_capture_version': type: <type 'str'>, default: None, required: False, 'fixation': type: <type 'str'>, default: Unknown, required: False, 'gender': type: <type 'str'>, default: Unknown, required: False, 'gene_model': type: <type 'str'>, default: None, required: False, 'genome_reference': type: <type 'str'>, default: None, required: True, 'input_data_path': type: <type 'str'>, default: None, required: False, 'is_cell_line': type: <type 'bool'>, default: False, required: False, 'library_construction_protocol': type: <type 'str'>, default: None, required: False, 'library_name': type: <type 'str'>, default: None, required: True, 'library_strategy': type: <type 'str'>, default: None, required: True, 'md5sum': type: <type 'str'>, default: None, required: True, 'microbial_status': type: <type 'str'>, default: None, required: False, 'output_data_path': type: <type 'str'>, default: None, required: True, 'pathology_alias': type: <type 'str'>, default: None, required: False, 'pathology_type': type: <type 'str'>, default: None, required: False, 'patient_id': type: <type 'str'>, default: None, required: True, 'permission': type: <type 'str'>, default: GENERAL_USE, required: False, 'pipeline': type: <type 'str'>, default: None, required: True, 'pipeline_version': type: <type 'str'>, default: None, required: True, 'ploidy': type: <type 'str'>, default: None, required: False, 'pre_existing_condition': type: <type 'str'>, default: None, required: False, 'production': type: <type 'bool'>, default: True, required: False, 'project': type: <type 'str'>, default: None, required: True, 'read_length': type: intlist, default: None, required: False, 'reference_library': type: <type 'str'>, default: None, required: False, 'seq_tumour_content': type: <type 'float'>, default: None, required: False, 'sequencing_centre': type: <type 'str'>, default: BCGSC, required: False, 'sequencing_protocol': type: <type 'str'>, default: BCGSC, required: False, 'source_type': type: <type 'str'>, default: Internal, required: False, 'sow': type: <type 'str'>, default: None, required: False, 'tumour_content': type: <type 'float'>, default: None, required: False}¶
-
complete
()¶ Checks to make sure all required fields are present
Returns: True if all required data is filled in, False if there is still missing data
-
default
(key)¶
-
input_keys
= ['patient_id', 'anonymous_patient_id', 'library_name', 'permission', 'project', 'sow', 'ethnicity', 'gender', 'developmental_stage', 'diagnosis_age', 'cancer', 'diseased', 'is_cell_line', 'cancer_stage', 'pathology_type', 'pathology_alias', 'cancer_subtype', 'pre_existing_condition', 'microbial_status', 'anatomic_site', 'fixation', 'tumour_content', 'seq_tumour_content', 'ploidy', 'library_strategy', 'library_construction_protocol', 'read_length', 'sequencing_protocol', 'exon_capture_version', 'sequencing_centre', 'source_type', 'control_type', 'reference_library', 'genome_reference', 'aligner', 'gene_model', 'pipeline', 'pipeline_version', 'auxiliary_analysis', 'input_data_path', 'output_data_path', 'analysis_object_type', 'analysis_object_id', 'production']¶
-
items
() → list of D's (key, value) pairs, as 2-tuples¶
-
iteritems
() → an iterator over the (key, value) items of D¶
-
iterkeys
() → an iterator over the keys of D¶
-
itervalues
() → an iterator over the values of D¶
-
key
= 'entry_date'¶
-
keys
() → list of D's keys¶
-
required
(key)¶
-
sample_items
()¶
-
sample_keys
()¶
-
sample_values
()¶
-
type
(key)¶
-
values
() → list of D's values¶
-