Gene set files
User-defined gene sets can be created from within ErmineJ, or imported using several different formats. This way you can define gene sets yourself or use other schemes such as KEGG by providing an appropriate file for ErmineJ.
Knowledge of the formats gives you the ability to define gene sets outside of ErmineJ. When ErmineJstarts up, it looks for these files in a predefined location (see below) and loads them.
Gene set files created by ErmineJare saved in the directory
ermineJ.data/genesets . (e.g.,
C:/Documents and Settings/[your user name]/ermineJ.data/genesets ). You should place your own “handmade” gene set files in this location so they are automatically visible to the software.
Note! If you create a gene set when using one platform, and then switch to another next time you run ErmineJ, ErmineJ will try to load your old gene sets. If any probes on the previous design match the identifiers on the current one, the gene set will be loaded to the extent it can. We may change this in a future version of ErmineJ, to provide more species and platform information in each file. Let us know if this is important to you.
Note: Gene sets that have only one gene will not be shown. In addition, by default the user interface hides sets that are empty, even though the software knows about them (such as GO terms).You can reveal these with the context (pop-up) menu in the table or tree view.
Option 1: ErmineJ-native format
This format allows you to store one or more gene set in a single file with a very simple format, identified either by probe (handy for mapping to expression arrays) or gene symbols that match the ones in your annotation file. Here is a sample:
# this is a comment probe MyGeneSet Genes I Like 36495_at 271_s_at 37983_at 34071_at 128_at 129_g_at 206_at 38466_at 32017_at 346_s_at 32018_at ==== probe MySecondGeneSet More genes I like 37983_at 34071_at 128_at 129_g_at 206_at 38466_at 32017_at 346_s_at
(or download the sample as a file)
The full description of the format is follows.
- The file is plain text (ASCII)
- There can be more than one gene set defined, demarcated by “===” on a line by itself.
- Lines beginning with “#” are ignored.
- Blank lines are ignored.
Within each gene set definition, you must declare at least four non-blank, non-comment lines:
- The first line describes the type of identifier in the file and is either “probe” or “gene”. The former must match the identifiers in the first column of your annotation file. The second should match the symbols in the second column in your annotation file.
- The second line is the unique ID or name of the gene set. This name must be distinct from other groups used in the session (including GO terms).
- The third line is a longer description of the gene set. There is no limit to the length of this description but in practice it should be just a few words.
- The fourth and subsequent lines are the identifiers (probe ids or official gene names).
This is a slightly simpler alternative to the native format, with the limitation that only gene symbols are supported. Your annotation file will be used to figure out which probes are relevant. Here is a sample. Like the files described for Option 1, these files should be placed in your ermineJ.data/genesets directory, where they will automatically be detected and imported by the software.
- The file is tab-delimited ASCII text
- There is one gene set defined per line
- One one line, the fields are:
- A unique gene set identifier
- A description (can be blank, but cannot be ommitted)
- The remaining fields are interpreted as gene symbols (keyed to the second column of your annotation file).
Option 3: Import files containing lists of genes using the “Define new gene set” menu item
This method has the benefit of requiring a very simple format, but you must load the files one at a time using ermineJ’s graphical interface. (If this is a pain, a simple Perl or Python script can convert the lists into the other format.)
The file in this case is just a list of genes, with one on each line. The names must be the gene symbols that are used in your gene annotation file. Other symbols will be ignored. Here’s an example with just three genes:
alox12b ALOX15 alox12
A full description of the file format is:
- Each file describes just one gene set.
- The file is plain text (ASCII)
- Each line contains the official symbol of one gene
- Capitalization is ignored
- Blank lines and symbols not found in the current array design are ignored
On loading in, the list of genes is converted to a list of probes. You will be given the chance to edit the gene list and give it a name before finalizing it.