File information: gene annotations
A key component of ErmineJ gene annotations, which you provide as an imput. Gene annotations are entered into ErmineJ in two ways: The main annotation file you provide at startup, and via your “custom” gene sets that live in your ermineJ.data directory.
When ErmineJ starts up, you are asked to provide an “annotation file”. The annotation file provides three types of information important to the operation of ermineJ:
- It provides mappings between probes and genes on microarray platforms. This need is predicated on the assumption that your input data files are keyed by the probes. If this is not true, then you won’t be using this feature, but you must still provide an annotation file.
- It provides human-readable descriptions for the probes or genes. This is needed even if you aren’t using a microarray.
- It provides Gene Ontology annotations for the genes. (These can be omitted if you are supplying some other gene groups)
Note: Even if you are not using GO, you must still provide at least a minimal annotation file that lists the genes you are using.
Note: Gene sets that have only one gene will not be loaded. In addition, by default the user interface hides sets that are empty, even though the software knows about them (such as GO terms). You can reveal these with the context (pop-up) menu in the table or tree view.
You can use annotation files we provide. You can also use files you have created in the appropriate format. The files can be gzipped or zipped. There is no need to unpack them.
You inform ermineJ of which format you are using with the pull-down menu used at startup. As of ErmineJ 3.0, you can get annotation files from within the software by clicking on the “Get from Gemma” button on the startup screen (this relies on fetching data from Gemma so requires an internet connection). See also the startup screen documentation.
The list of platforms provided includes those in Gemma which have annotations available. Any files for platforms not from Gemma will be found here.
Using files we provide
In ermineJ, we refer to these as “ermineJ format”, but the files are very simple and useful in other contexts
You can download annotation files from Gemma here. Note: We recommend using the annotation files that have “No parents”. For more explanation see section below.
We provide annotation files for some popular (and many not-so-popular) microarray platforms, provided by Gemma. These files contain the probe (or probe set) identifiers, the gene symbols and names, and GO membership information. For our current annotations, this means that a list of the Gene Ontology terms associated with a gene are listed. For each term, the ‘parent’ terms are also implicitly included, so that genes associated with very specific terms are also included in the less specific categories.
If you are not using an expression array platform, we provide some “generic” annotation files that are keyed to official gene symbols.
For species or platforms we don’t support, ask us for assistance or set it up yourself. The files are not hard to prepare if you have Gene Ontology (or other gene set descriptor) annotations available.
For species we support, but for new platforms, often you will be able to create a new annotation file by pulling information out of our existing files using a simple Perl script.
Description of the format
- The file is tab-delimited text. Comma-delimited files or Excel spreadsheets (for example) are not supported.
- There is a one-line header included in the file for readability.
- The first column contains the probe identifier. The probe IDs must exactly match the ones you provide in your Gene score file. Any probes not having an entry will be ignored. If you are not using probes, this will probably contain gene symbols. The main requirement here is that it matches the identifiers you provide in your input data files.
- The second column usually contains a gene symbol. This should not be blank. If the gene name is not known, a sequence identifier or arbitrary code can be used instead. This is used to determine whether a gene has more than one probe, as well as providing information for display purposes.
- The third column contains the gene name (or description). This can be blank. It is only used for display purposes.
- The fourth column contains a delimited list of GO identifiers. These include the “GO:” prefix. Thus they read “GO:00494494” and not “494494”. The ids within this field can be delimited by spaces, commas, or pipe (‘|’) symbols. This field can be blank if there are no GO annotations (or if you aren’t using GO).
Using files you create
Annotation files that you created can be used so long as they adhere to one of the accepted formats. There are a few things to consider:
- The probe IDs must exactly match the ones you provide in your input data files (gene scores and raw data). Any probes not having an entry will be ignored.
- The gene symbols are used internally by the software to decide which genes are present on the array more than once. Therefore, if two probes refer to the same gene, make sure the symbol you use is the same for both probes. (It doesn’t actually matter what the symbol is).
- The gene names or descriptions are optional, and blank values will just show up as “No description” or something similar.
- In the ermineJ format, the GO ids must be in “long” format (with the GO: prefix). The GO terms themselves should be omitted. The parents of all terms listed are automatically included in the analysis (subject to other constraints such as the maximum Gene Set size you set in the analysis), so there is no need to list these explicitly.
Note on using annotation files with “parents”
In the “No parents” version of the Gemma annotation files, only “direct” annotations to the genes are listed, omitting the one inferred via parent-child relations in the Gene Ontology. ErmineJ rapidly computes the parent-associated terms at startup. Using the “No parents” versions has two advantages. First it will result in faster startup times for ermineJ. It also ensures that the direct annotations are interpreted correctly with respect to the version of the Gene Ontology you loaded at startup. Changes to the GO structure are common.