VCF Expression Annotator

The VCF Expression Annotator will take an output file from Cufflinks, Kallisto, or StringTie and add the data from that file to your VCF. The expression file type is specified using kallisto, stringtie, or cufflinks in the list of positional parameters.

In addition, the type of expression data, either gene or transcript, needs to be specified. This will result in the expression value to be written to the GX or TX field, respectively.

The input VCF needs to be annotated with VEP with gene and transcript information so that the VCF Expression Annotator can match a variant’s Ensembl gene and transcript identifier in the VCF to the one in the expression file. When running in gene mode, Ensembl IDs - not gene names - are used. Depending on the expression software used, the transcript identifiers might contain version numbers. To add transcript version numbers to your VEP annotation, use the --transcript_version when running VEP. You can also use the --ignore-ensembl-id-version flag of the VCF Expression Annotator to ignore the version of Ensembl gene and transcript IDs when finding the matching entry in your expression file.

The VCF Expression Annotator also accepts a custom tab-delimited (TSV) file input for the expression file. This TSV file will need to contain one column with gene or transcrip Ensembl IDs and one column with the expression values. This file then needs to contain a header line that is used to identify the contents of each column. This is done via the --id-column and --expression-column parameters which need to match the gene/transcript identifier and expression value column headers. In order to use this option the expression file format option will need to be set to custom. Please note that when running in gene mode, the ID column will need to contain Ensembl Gene IDs, not gene names.

By default the output VCF will be written to a .tx.vcf or .gx.vcf file next to your input VCF file. You can set a different output file name using the --output-vcf parameter.

Usage

usage: vcf-expression-annotator [-h] [-i ID_COLUMN] [-e EXPRESSION_COLUMN]
                                [-s SAMPLE_NAME] [-o OUTPUT_VCF]
                                [--ignore-ensembl-id-version]
                                input_vcf expression_file
                                {kallisto,stringtie,cufflinks,custom}
                                {gene,transcript}

A tool that will add the data from several expression tools' output filesto
the VCF INFO column. Supported tools are StringTie, Kallisto, and Cufflinks.
There also is a ``custom`` option to annotate with data from any tab-delimited
file.

positional arguments:
  input_vcf             A VEP-annotated VCF file
  expression_file       A TSV file containing expression estimates
  {kallisto,stringtie,cufflinks,custom}
                        The file format of the expression file to process. Use
                        `custom` to process file formats not explicitly
                        supported. The `custom` option requires the use of the
                        --id-column and --expression-column arguments.
  {gene,transcript}     The type of expression data in the expression_file

optional arguments:
  -h, --help            show this help message and exit
  -i ID_COLUMN, --id-column ID_COLUMN
                        The column header in the expression_file for the
                        column containing gene/transcript ids. Required when
                        using the `custom` format.
  -e EXPRESSION_COLUMN, --expression-column EXPRESSION_COLUMN
                        The column header in the expression_file for the
                        column containing expression data. Required when using
                        the `custom` format.
  -s SAMPLE_NAME, --sample-name SAMPLE_NAME
                        If the input_vcf contains multiple samples, the name
                        of the sample to annotate.
  -o OUTPUT_VCF, --output-vcf OUTPUT_VCF
                        Path to write the output VCF file. If not provided,
                        the output VCF file will be written next to the input
                        VCF file with a .tx.vcf or .gx.vcf file ending.
  --ignore-ensembl-id-version
                        Assumes that the final period and number denotes the
                        Ensembl ID version and ignores it (i.e. for
                        "ENST00001234.3" - ignores the ".3").