Variant Representation

The goal is to create a useful JSON-LD Item hierarchy for working with variants. To do this, we create several different item types that are related to each other. It is possible more relationships are added, but these items form a logically related grouping so are discussed in depth in this document. Previously, these types were generated from the mapping table, but now are maintained directly.

  1. GeneAnnotationField - these are annotation fields on Genes that come from Daniel’s annotation DB but are processed from the gene mapping table, from this point on referred to as the gene table.
  2. Gene - the gene item schema is dynamically built from the gene table. The gene items come from Daniel’s annotation DB and are posted as is and validated against the schema.
  3. AnnotationField - these are annotation fields on Variants that also come from Daniel’s annotation DB but are processed from the variant mapping table, from this point on referred to as the variant table.
  4. VariantConsequence - this is an item that encompasses information on the consequence of a variant. It has a well-defined schema and items that are generated by the wranglers.
  5. Variant - the variant item schema was built from the variant table but is now directly maintained. These items represent a catalogued SNV discrepancy between an expectation set by the reference genome and an observed/measured reality that has been seen to occur in some individuals. Most commonly in our data, this is an SNV, but in principle it could be any kind of variant. To generate these items we process annotated VCF files which have passed through the bio-informatics pipeline then annotated by Daniel’s annotation DB. They are again validated against the schema on post.
  6. VariantSample - the variant_sample schema (and embeds) was built from the variant table but is now directly maintained. These items encompass information about the sample(s) that registered this variant and is also read from the VCF. Since multiple people could have been sequenced, the variant samples on a particular VCF do not necessary have the variant genotype (but the proband does).
  7. StructuralVariant - same idea as variant, except it represents a CNV/SV call.
  8. StructuralVariantSample - same idea as variant_sample, except a sample for a structural_variant.

This last step of this pipeline, which will be discussed below, is ingesting these elements. The following 5 things constitute an “ingestion version”:

  1. Variant Table (–> AnnotationFields)
  2. Gene Table (–> GeneAnnotationFields)
  3. Gene List (–> Genes)
  4. (Collection of) VCF(s) (–> Variants, VariantSamples)
  5. Set of VariantConsequence items

All items ingested as part of the first 4 will be given a ‘MUTANNO_Version’ field, which will hold the annotation DB version used to ingest. Note that only one of these ‘versions’ can exist in our system since the schemas are dynamically generated. The relevant items below also have the following links:

  1. Variants link to Gene and VariantConsequence items
  2. VariantSample links to Variant (and thus Gene and VariantConsequence)

See ingestion.py for information on code structure. Note that this structure is now deprecated in favor of direct maintaining of the schemas. The only step that runs now is VCF ingestion.

Ingestion Step 1: Gene/Variant Table Intake

The first thing we need to do is build the annotation field items and generated schemas for variant, variant_sample and gene. The code that does this mostly lives in variant_table_intake.py and is extended/repurposed in gene_table_intake.py. Since the tables are similar there are very few code differences needed in parsing the two tables. Note that Gene must come before Variant!

Ingestion Step 2: Gene Ingestion

Once we have our schemas, we can ingest/post the genes. These come directly from Daniel’s annotation DB so not much work is required. They are posted as is, see ingest_genes.py.

Ingestion Step 2.5: VariantConsequences

All variant consequence items must be posted prior to ingesting the VCF, as these variants will linkTo the appropriate consequence items. These items should change very infrequently as changes to them will cause cascading invalidation and may have revision history implications.

Ingestion Step 3: VCF Ingestion

The last part of the process is ingesting the variants from VCF files and forming appropriate links. Links to Gene and VariantConsequence will happen automatically. Links from VariantSample to Variant are created manually. This requires both parsing the VCF file and formatting the items appropriately based on the schema. See ingest_vcf.py.

VCF Parsing Details

After producing the schemas it is time to ingest the annotated VCF. This file has a complicated structure described below. This step is written in a more object-oriented way with VCFParser as the main class containing several methods specific to VCF parsing. Helper functions handle specific steps and culminate in the run method, which processes and entire VCF file producing all the variant and variant sample items. An overview of the steps is below.

  1. Read VCF Metadata. This includes splitting VCF fields into annotation and non-annotation fields, that way we know which fields will require additional post processing.
  2. Parse standard VCF fields. These are easily acquired as there is nothing special about them. The variant sample item consists entirely of these fields.
  3. Parse annotation fields. These are much trickier because they are formatted differently and must be encoded a certain way to not break the VCF specification. More on this follows in the VCF specification.

Annotatated VCF Specification

Below is an outline of the annotated VCF structure with an example on how exactly it is processed.

VCF-Specific Restrictions

For the annotated VCF we make use of INFO fields to encapsulate our annotations. This field is part of the VCF structure and has the following restrictions on values within the field (ie: ‘AC=2;VEP=1|2…’ etc). 1. String format (conversion to type specified on the Mapping Table is done later) 2. No whitespace (tabs, spaces or otherwise) 3. No semicolon (delineates fields in INFO block) 4. No equals = (delineates fields in INFO block, ie: AC=2;VEP=1,2,3;) 5. Commas can only be used to separate annotation values

Our Restrictions

Annotation fields that should be processed as such must be marked with a MUTANNO tag in the VCF metadata as below.

Annotation fields that have MUTANNO tags must also have a corresponding INFO tag. This tag must specify the format if the annotation is multi-valued and must be pipe (|) separated. An example of each is below.

If an annotation field can have multiple entries, as is the case with VEP, these entries must be comma separated as consistent with the VCF requirements. See raw row entry below.

If an annotation field within a sub-embedded object is an array, such as vep_domains, those entries must be tilde (~) separated and no further nesting is allowed.

Separator Summary

  1. Tab separates VCF specific fields and is thus restricted.
  2. Semicolon separates different annotation fields within INFO and is thus restricted.
  3. Comma separates sub-embedded objects within a single INFO field (such as VEP) and cannot be used in any other way.
  4. Pipe separates multi-valued annotation fields and cannot be used in any other way
  5. Tilde separates sub-embedded objects that are also arrays, such as vep_domain and cannot be used in any other way.

Parsing Example

Given these restrictions, below is a detailed walk through of how the VCF parses the annotation fields given this specification. A truncated example entry is below. Assume we are able to grab appropriate MUTANNO/INFO header information. New lines are inserted for readability but are not present in the actual file.

The first line is the VCF field header. Fields other than INFO are readily accessible. All annotation fields are collapsed into the INFO section. FORMAT and HG002 follow after INFO. The fields below are tab separated as consistent with the VCF specification. A tab separates the last part of the data above and the INFO data below.

These annotations are all single valued and are thus processed directly as strings. Conversion to actual types is done later.

Above is a VEP annotation entry that is both multi-valued and has multiple entries. To parse this we first split on the comma to get the groups. Newlines are inserted to visualize the groups. We then split on pipe since the fields are pipe separated. Even if a field is blank a pipe must be present for that field otherwise we will not be able to determine which fields go with which values. Once we have all the fields, we then go through each one and post-process. If it is an array field (not shown in this example but consistent with point 4 above) then we split again on tilde to determine the array elements, otherwise the field value is cast to the appropriate type.

How to Provision Annotations

This section will describe how to “provision annotations”, which roughly means the process of ingesting annotation related items to the portal. Note that the paths in the commands that follow may change.

Local Machine

Follow the below steps. It takes 30-45 minutes to run.

  1. Startup back-end resources: make deploy1
  2. Startup waitress: make deploy2
  3. (If first time) Download genes: make download-genes
  4. Load annotations: make deploy3

Output

The ingestion command uses tqdm to show progress bars, so you can tell what stage of the process is currently ongoing. At the end the output will look something like the below.

100%|███████████████| 284/284 [00:09<00:00, 30.90gene_annotation_fields/s]

100%|███████████████| 21873/21873 [20:12<00:00, 18.04genes/s]

100%|███████████████| 340/340 [00:18<00:00, 18.79variant_annotation_fields/s]

46variants [00:18,  2.44variants/s]

ERROR:encoded.commands.variant_ingestion:Encountered VCF format error: could not convert string to float: '18,0,19,0'

The error at the end is expected with the latest VCF - if a different error occurs there should be some reasonable description. As an example, the one below looks like this:

ERROR:encoded.commands.variant_ingestion:Encountered VCF format error: could not convert string to float: '18,0,19,0'

It tells you exactly which file threw the error (src/encoded/commands/variant_ingestion.py), what type of error it was (VCF format error) and what caused it (TypeError). Errors like these should be reported, along with the VCF row which threw the error (the 47th variant in the VCF since we posted 46). In this case that line has an actual VCF spec validation error.

Ingesting Additional VCFs

To ingest more VCFs with the current setup, use the variant-ingestion command. See src/encoded/commands/variant_ingestion.py.