VCF Expression Annotator¶
The VCF Expression Annotator will take an output file from Cufflinks, Kallisto,
or StringTie and add the data from that file to your VCF. The expression file type is
specified using kallisto
, stringtie
, or cufflinks
in the list of
positional parameters.
In addition, the type of expression data, either gene
or transcript
, needs to
be specified. This will result in the expression value to be written to the
GX
or TX
field, respectively.
The input VCF needs to be annotated with VEP with gene and transcript information so
that the VCF Expression Annotator can match a variant’s Ensembl gene and transcript
identifier in the VCF to the one in the expression file. When running in
gene
mode, Ensembl IDs - not gene names - are used. Depending on the
expression software used, the transcript identifiers might contain version
numbers. To add transcript version numbers to your VEP annotation, use the
--transcript_version
when running VEP. You can also use the
--ignore-ensembl-id-version
flag of the VCF Expression Annotator to ignore
the version of Ensembl gene and transcript IDs when finding the matching entry in your expression
file.
The VCF Expression Annotator also accepts a custom tab-delimited (TSV) file input for the
expression file. This TSV file will need to contain one column with gene or
transcrip Ensembl IDs and one column with the expression values. This file
then needs to contain a header line that is used to
identify the contents of each column. This is done via the --id-column
and --expression-column
parameters which need
to match the gene/transcript identifier and expression value column headers.
In order to use this option the expression file format option will need to be
set to custom
. Please note that when running in gene
mode, the ID
column will need to contain Ensembl Gene IDs, not gene names.
By default the output VCF will be written to a .tx.vcf
or .gx.vcf
file next to
your input VCF file. You can set a different output file name using the
--output-vcf
parameter.
Usage¶
usage: vcf-expression-annotator [-h] [-i ID_COLUMN] [-e EXPRESSION_COLUMN]
[-s SAMPLE_NAME] [-o OUTPUT_VCF]
[--ignore-ensembl-id-version]
input_vcf expression_file
{kallisto,stringtie,cufflinks,custom}
{gene,transcript}
A tool that will add the data from several expression tools' output filesto
the VCF INFO column. Supported tools are StringTie, Kallisto, and Cufflinks.
There also is a ``custom`` option to annotate with data from any tab-delimited
file.
positional arguments:
input_vcf A VEP-annotated VCF file
expression_file A TSV file containing expression estimates
{kallisto,stringtie,cufflinks,custom}
The file format of the expression file to process. Use
`custom` to process file formats not explicitly
supported. The `custom` option requires the use of the
--id-column and --expression-column arguments.
{gene,transcript} The type of expression data in the expression_file
optional arguments:
-h, --help show this help message and exit
-i ID_COLUMN, --id-column ID_COLUMN
The column header in the expression_file for the
column containing gene/transcript ids. Required when
using the `custom` format.
-e EXPRESSION_COLUMN, --expression-column EXPRESSION_COLUMN
The column header in the expression_file for the
column containing expression data. Required when using
the `custom` format.
-s SAMPLE_NAME, --sample-name SAMPLE_NAME
If the input_vcf contains multiple samples, the name
of the sample to annotate.
-o OUTPUT_VCF, --output-vcf OUTPUT_VCF
Path to write the output VCF file. If not provided,
the output VCF file will be written next to the input
VCF file with a .tx.vcf or .gx.vcf file ending.
--ignore-ensembl-id-version
Assumes that the final period and number denotes the
Ensembl ID version and ignores it (i.e. for
"ENST00001234.3" - ignores the ".3").