Input file formats: gene scores
A “gene score” is any value that is applied to genes in an experiment and which represents some measure of “quality” or “interest”. Examples might be a t-test p value or fold change. These scores must be computed separately and then supplied to the software for analysis.
! Make sure your gene scores are on a sensible scale
- If your gene score are raw p-values (most common case), then you should be sure to either -log transform your values, or use the -log options in the software.
- If your gene scores are “fold-change”, you might want to use the absolute value of log(fold change).
- If your gene scores are NOT raw p-values, then make sure you select the right combination of settings to get your data interpreted correctly.
! You need a complete set of gene scores, not just the “selected gene list”
Unlike some software packages, ErmineJ requires a complete set of gene scores, rather than just those for the “selected genes”. Thus, if your assay has 12,000 genes (or probes), you will provide 12,000 gene scores. A caveat to this is that if you have filtered your data to exclude “unexpressed” genes (for example), you might only have gene scores for some genes/probes. This is perfectly fine and the analysis will be based only on the probes for which you provide data. However, the analysis is really only valid if you include all the probes that you performed your gene selection analysis on.
As of ErmineJ 3, when using the ‘ORA’ method you have the option to use a simple “hit list” of genes, rather than preparing a score file yourself (a “quick list”). Caution: If you use this feature, the “non-hits” will be all the rest of the genes listed in your annotation file. That might not be appropriate if the annotation file includes genes that were not assayed in your experiment. This is most likely to be a problem if your annotation file is a list of all the genes in the genome.
The ‘correlation’ method does not need gene scores.
! Make sure your gene scores are on the right format
The gene score file is a simple tab-delimited text file, minimally having just two columns (like this example , which is from the Affymetrix HG-U95A microarray design). For a simple case please use the following rules:
- A one-line header is expected (if you don’t include the header, the first line of data will be skipped). It doesn’t matter what the header says, it can even be a blank line.
- The first column contains the unique probe or gene identifiers, one identifier per row. Duplicate identifiers will be ignored (currently, the first instance encountered is used). If you have more than one score per gene, filter your file so there is just one, or provide a unique identifier such as A.1 and A.2 (for multiple occurrences of gene A). How to combine such scores is a setting you make during your analysis.
- The second column contains the scores. Non-numeric values such as “NaN” or “#NUM!” are interpreted as zero. If you want to avoid this you should remove those rows from your file entirely.
- Files with more than two columns are also fine.
- The file CANNOT be an Excel spreadsheet. Use “Save as… text” in Excel. For detailed instructions, see this page.
The table below shows the basic idea. Remember that the header can contain anything, it doesn’t have to follow this example.