I need help using ErmineJ, what do I do? Please go here for instructions on how to ask us for help. If you are having problems, be sure to provide the information requested there, such as your log file and information about your computer system.
Is ErmineJ only for expression data? No. ErmineJ can be used to analyze rankings of genes from any experiment. While the original use of ErmineJ (back in 2003 or so) was for analyzing expression data from microarrays, the software is very generic. You only need a list of genes that you tested and their “scores”, and some kind of organization of the genes into “gene sets”. You don’t need to use GO (though GO support is built in). If you are creative you can use it to analyze non-biological data, as the statistical methods simply work with lists of things organized into groups. If you need help figuring out how to make ermineJ work for your situation, let us know.
Can I use ErmineJ with RNAseq data? Yes. See the previous question on “Is ErmineJ only for expression data”. The relevant gene annotations can be found for some organisms by using the “Generic” platforms in Gemma. The list of platforms can be access from the GUI.
Can I use ErmineJ with something besides GO? Yes. ErmineJ allows you to define your own gene sets. This allows you to use schemes like KEGG if you can obtain an appropriate annotation file (we do not provide them).
I just have genes, not probe ids, can I use ermineJ? Sure. You can use the “Generic” gene list annotations instead of a microarray platform annotation file. We provide files keyed by gene symbols, Ensembl and Entrez IDs. The full list of platforms is here, but you can access them directly from the software.
Do I input the list of gene scores for the significantly changed genes, or all the genes in my experiment (e.g., microarray or genome)? All of them (but see next question). The resampling, precision-recall and ROC methods use the gene scores for all the genes. The built-in overrepresentation analysis will use the threshold you set in the gene score threshold parameter to select genes, unless you use a ‘quick list’. If your data have been filtered, the output panel does not list the actual number of genes (or probes) in your gene sets. See this page for a clarification.
I don’t have scores for my genes, I just have a ‘hit list’, can I use the software? Yes. Before ErmineJ 3.0, this required you make up scores for all the genes such that the ones on your hit list are selected by the threshold you set. For example. your hit list genes would have the score 1 and all the other genes the score -1, and you set the threshold to 0. This is still a recommended approach because it lets you be explicit about which genes you consider the “non-hits”. However, by popular request ErmineJ introduces a new feature,”‘quick lists” that simplifies the procedure. See the documentation on ORA for details.
What are the system requirements for running ErmineJ? It should run on any computer that can run Java 8 (Including Mac, Windows, Linux) assuming you have a decent amount of RAM (8Gb should be fine). See the manual for more information, including advice on avoiding memory problems.
Which gene scoring method should I use? If you are looking at differential expression (a common case), we recommend using the gene score resampling, precision recall, or the ROC methods. ORA is most appropriate when you have a natural threshold for gene selection, such as “on chromosome 2”. Otherwise, changing the threshold can alter the results, sometimes dramatically. If you are interested in clustering genes, the correlation method might be appropriate. There is more information on choosing methods on the help pages for each method.
Should I run all methods and combine the results? We don’t really recommend running all the methods and looking for overlaps, though running multiple methods and comparing the results might be useful. Familiarize yourself with the different methods to be aware of their properties. It is particularly important to recognize that correlation analysis works very differently from the other methods, and ORA analysis results can be sensitive to the threshold used.
How should I deal with up-regulated vs. down-regulated genes? I want them analyzed separately. To make your input p-values “aware” of the direction of change, you typically need to use a one-tailed test to generate your p-values. ErmineJ doesn’t handle that internally, you have to provide your own pvalues. So there would be one list of pvalues for up-regulation and one for down, using one tail and then the other of the distribution from which p-values are determined. Alternatively you can use the fold-changes. However, our recommendation is that you use p-values, especially if that is how you are ranking genes for other parts of your project. But some people use fold change. You have to decide which will best answer your question. If you think fold-change is a better representation of the ranking of genes, then use fold change.
The messages at the bottom of the window go by too fast to read. You can read all messages (and other information) in the log file. You can view the log file from within ermineJ by selecting the “help” menu, choose “view log”. The log file for the current or last run of ermineJ is stored on your hard drive in a file called ermineJ.log, it should be in your home directory.
All my gene set p-values are coming out as zero. What is wrong? This can happen if you have not set the parameters for the analysis correctly. In particular, if your input file contains raw p-values, make sure the “larger scores are better” box is unchecked in the last analysis wizard step.
What is the meaning of the “score” for each gene set? During analysis, each gene set is given a score which, along with the size of the gene set, is used to determine the statistical significance of the gene set. The meaning of this score depends on the type of analysis and is explained here.
How do I know if a gene set p value is ‘significant’? One has to consider the problem of multiple testing, but because the gene sets often overlap, simple methods will be too conservative.
The software implements a false discovery rate (FDR) algorithm that leads to p values shown in a color at different levels of false discovery rate (see this page). However, this method still assumes the gene sets are independent and will lead to a slightly convervative estimate of the FDR.
If you want to use Bonferroni correction , you can either use the command-line interface, or compute them yourself. To compute the Bonferroni-corrected p values, multiply your gene set p values by the number of gene sets analyzed (maximum of 1). Then you can use a threshold of 0.05 (which would not otherwise be reasonable).
What is multifunctionality? This refers to the fact that some genes are in more gene sets (e.g., GO groups) than others. This has a substantial effect on the type of analysis ErmineJ does. This topic is covered in more detail in the manual.
My data set has no “significant” genes, but when I run ermineJ with resampling I get significant gene sets. How do I interpret these results? Carefully. ErmineJ’s resampling methods are based on analyzing your gene expression data in isolation, without reference to any theoretical background model. It is capable of identifying the most interesting gene sets in your data, but this does not necessarily mean those gene sets are interesting in general. We recommend using such findings as an exploratory method for helping identify potential genes of interest, but because no gene is statistically significant on its own, the results for any given gene would have to be dealt with on a case-by-case basis using considerations other than statistical significance.
For a gene set to be considered statistically significant outside the context of your data set, one reasonable null hypothesis would yield a distribution of p-values in the gene set that is uniform (though this assumes the genes are independent). A background distribution of scores is displayed in the details view of each class. When the blue line (your gene scores) is far to the right of the expected distribution under the null (grey line), you can have higher confidence in the importance of that gene set. The reason we do not use this method exclusively is that in data sets where many genes have changed, every gene set will appear significant. Reasonable interpretation of the results requires consideration of both types of analyses.
A related issue is that gene sets whose members have correlated transcription in all conditions (that is, not relevant to your study) can sometimes yield misleading results. The remedy to this problem is to do resampling over the samples, but ermineJ currently does not support this. In the meantime, when you have weak or no clear differential expression effect on the genes in a gene set, and the genes in that set are highly correlated, you should view the results carefully. Typically this affects gene sets like the ribosomal proteins, histones or proteasome, whose members are always highly correlated, at least in RNA transcript data. You can test for gene sets like this using correlation scoring, though this seems likely to give helpful results only if the differential expression effects in your experiment are weak..
Data sets where there are weak signals can also be more heavily influenced by multifunctionality, which tends to mask subtle signals at the expense of generic ones.
How do I find out which genes are in a gene set without doing an analysis? Several ways. You can double-click on a gene set, or use the “View/Modify Gene set” tools, available from the main ‘analysis’ menu, or use the context-specific pop-up menu on the gene set list. See this page
When I open the output file in Notepad it looks like a mess of letters and numbers. How do I view this file? The output might not look right when viewed in a text editor or a web browser. It will look correct in Excel if you open it as a tab delimited file.
My excel spreadsheet with my gene scores isn’t being accepted as input. You cannot input Excel spreadsheets; you have to save them as text first. See this page for details.
I don’t see an annotation file for my microarray or genome. What should I do? Through the Gemma system, we provide annotations for hundreds of microarray platforms and several popular genomes. Feel free to let us know if you need help with annotation files.
Why are the numbers of genes in a group in the table not matching my annotation file? Because of propagation of annotations in GO. The GO hierarchy requires that a gene annotated with a “low-level” term (say, a leaf term) inherits all of the parent annotations in the hierarchy, all the way to the root. This causes an expansion in the number of annotations per gene. ErmineJ always computes the propagation during the loading of the annotation file. If your file is already “propagated” then nothing should change (unless the version of GO you used is different than the one you load into ErmineJ). The annotation files we offer that are labeled “noparents” are the unpropagated ones, and they will tend to load faster. Note: We occasionally hear doubts about whether propagation of annotations is the right thing to do, but those harboring such doubts are misinformed.
Why are some groups “Not run” in the results column? Because they are either too small, too big, or were in an excluded part of the GO hierarchy. Check your analysis settings.
When doing an analysis, why did I see a warning “Attempt to take the log of a non-positive value”? If your gene score file contains negative or zero values, and you check the “negative log-transform gene scores” box, you will see this error. If this concerns you, you should clean up your gene score file. If not, ermineJ sets these values to a small number (10 -15 ).
When doing an analysis, why did I see a warning “Some probes in your gene score file don’t match the ones in the annotation file”? This happens if your gene annotation file isn’t the right one for the platform you used (in which case your analysis will not work well). The analysis still proceeds when this happens, but suggest you check that the annotation file you are using is the correct one for the gene score or data matrix files you are providing.
When doing an analysis, why did I see a warning “Non-numeric gene scores(s) ( ‘#NUM!’ ) found for input file. These are set to an initial value of zero.” , or something similar? This warning appears if the column you selected for your gene score file contains non-numeric data. This can happen if you have missing or invalid values (sometimes appearing as ‘#NUM!’ in Microsoft Excel), but can also happen if you have chosen the wrong column in your data. Because such values are interpeted as zeros, you should be careful that your file contains the data you want. If you want to avoid problems you should remove those rows from your file entirely.
The resampling is taking too long, any suggestions? This problem primarily affects the correlation score analysis in particular, which is very computationally intensive.
There are few ways to speed things up. One is to uncheck the “Always use full resampling” checkbox in the analysis setup wizard. This enables approximations which, while potentially yielding less accurate p-values, greatly increase processing speed. The results you get with this box unchecked will be reasonably similar to what you would get with full resampling.
Another tip is to not look at very large values for the “maximum class size” value. We usually use a value on the order of 200.
Finally, you can reduce the number of iterations. You might try temporarily setting this to 10000 while you experiment with the other settings, and then, once you find a setting you like, you can crank up the iterations for the “final” run (for example to 200000, which still should not take very long for the gene-score based resampling).
Why do my resampling results vary slightly from run to run? This happens either when too few iterations are run, or sometimes when using the approximation methods. To ensure the highest accuracy, set the number of iterations to a larger value (say, 200,000 for gene score resampling) and check the Always use full resampling checkbox in the analysis setup wizard.
What happens to probes/genes that don’t have any annotations? They are ignored entirely.
For resampling analysis of gene scores, how is the gene score threshold determined? This is a trick question. There is no threshold used. Instead, all genes in a set contribute to the score for the gene set. For more details on how this works, see this page.
What happens if there are two probes for one gene? A single score has to be arrived at for the genes. ErmineJ offers two ways to do this. See this page.
Why are the number of gene and probes shown in the output panel incorrect when my data set has been filtered? The numbers shown refer to the annotation file. If you hover over a set’s p-value in a Result Run column, you get a tooltip that shows you the actual values used for that analysis. for more information see this page.
What do the different icons mean in the tree view? The icons are used to indicate where to look to find “significant” gene sets. The specific icons are explained here.
What do the different colors mean in the table and tree views? The colors indicate different levels of statistical significance, as explained here.