Gene multifunctionality (or just “multifunctionality” for short) refers to the fact that some genes have more “functions” than others. Recently our lab showed (Gillis and Pavlidis, 2011) how this has a profound effect on the analysis of functional genomics data (see also Gillis and Pavlidis 2012). A manuscript describing the effects on enrichment-type analyses is in press:
Ballouz S., Pavlidis P., Gillis J. Using predictive specificity to determine when gene set analysis is biologically meaningful (2016) Nucleic Acids Research in press.
ErmineJ has several features designed to make it easier to detect and understand the effect of multifunctionality on your analysis. These are detailed in the help sections relevant to the features. In particular see:
- Information about multifunctionality in the main view is described here.
- The multifunctionality diagnostics available from the main window Analysis menu.
- The gene set multifunctionality information in the details windows.
This page provides some background information on how multifunctionality is computed and how to interpret it.
For the purposes of ErmineJ, a “function” is defined by the annotations a gene has. For standard uses of ErmineJ, this means gene ontology terms. Whether those annotations are a “true” representation of the multifunctionality of a gene is of no consequence for the effects on the analysis. For more discussion of this issue see the articles linked above.
Computing the multifunctionality of a gene
While the number of annotations (e.g. GO terms) a gene has is a simple measure of multifunctionality, ErmineJ uses the definition given in Gillis and Pavlidis (2011). This definition takes into account the size of GO groups. Thus a gene that is a member of many small GO groups is more multifunctional than a gene that is a member of the same number of large GO groups. This definition arises from a mathematical proof of the “optimal” ranking of genes (under one reasonable definition of “optimal”), but it makes sense that large, “general” functions be counted less. In practice this doesn’t make that much of a difference and you will notice the computed multifunctionality is correlated with the number of GO terms. If it makes it easier to understand, just think of multifunctionality as the number of GO terms.
Note: In ErmineJ 3.0, multifunctionality is always computed using all the annotations available in your input annotation file, even if you choose not to use those annotations in your analyses. Thus if you only check the “biological process” aspect, multifunctionality (and corrections for it) are still based on all the annotations. While it might be argued that the overall annotations are a good estimate of multifunctionality, since we treat it as a property of the annotations, it might be sensible to compute multifunctionality based only on the annotations analyzed. This might be especially problematic is if you only analyze your “custom” groups and don’t want to use GO at all.
We may change this in a future release. If it bothers you, let us know! In the meantime you could use an annotation file that includes only the aspects you intend to use. We make available “biological-process only” annotation files. To use KEGG or some other scheme as “user-defined” groups, you could provide an annotation file that doesn’t have any annotations (ErmineJ will still require the GO XML file, even if it is isn’t used). For instructions on defining custom gene sets, see this page.
Computing the multifunctionality of a gene group
Some gene groups (e.g. GO terms) tend to contain genes which are also members of many other GO groups. Such “correlated” GO terms are particularly problematic. ErmineJ computes the multifunctionality of a gene group as described in Gillis and Pavlidis (2011). The score is the area under the receiver operating characteristic curve obtained by comparing the genes in the group to the ranking provided by the multifunctionality definition given above. Thus, a value of 0.5 indicates that the genes in the group are not biased towards multifunctionality; 1.0 is the maximum bias; and 0.0 (not observed in practice) would mean the genes in the group are highly “unifunctional”. However, note that gene group multifunctionality is not used for correction; it is simply displayed for information purposes.
There are two ways that ErmineJ uses multifunctionality to highlight its importance. These ways are demonstrated in more detail on the relevant help pages. Here we address some of the theoretical considerations.
Multifunctionality bias in gene scores – When you provide ErmineJ with a ranked list of genes, presumably that list came from some experimental result. Given any experiment, a priori, we expect multifunctional genes to “show up”. This should be intuitive: a gene that is involved in many functions will be found to be “relevant” in many types of experiments.
The GO terms associated with a gene can be thought of as a rough proxy measure for the true multifunctionality of a gene (since the true multifunctionality is not readily observable, and the apparent multifunctionality is what is actually exposed to enrichment methods). Using the measure of multifunctionality based on GO, ErmineJ will tell you if your input list is “enriched” for multifunctional genes. In our experience, it is not all that common for an input list to be strongly correlated with the multifunctionality ranking, but even very weak correlations can have strong effects. ErmineJ helps you determine if this is likely to be a problem in your data set, and shows you the impact it has on the results.
FAQs about multifunctionality
- Is multifunctionality really a problem? We think that having the results of a functional genomics study skewed by a small number of genes is a bad thing. We are not claiming that multifunctionality is incorrect or “bad”. Genes with many functions really are biologically important, but in computational analyses they “distract” the algorithms from other genes of potential interest.
- Isn’t this just a problem with GO? No. We have observed the same phenomenon with KEGG and other gene grouping schemes, and it’s not specific to the “biological process” part of GO either, for example, nor with particular evidence codes in GO. The problem is that not every gene has the same number of functions, and there are overlaps. This is presumably at least partly a biological fact (however you want to define “function”). To a first approximation, if every gene had the same number of functional annotations, the problem would go away.
- Can’t I just filter out the multifunctional genes? We don’t recommend this as a preprocessing step; instead you should let ErmineJ help you. There is no clear separation between “multifunctional” and “non-multifunctional” genes. Second, highly multifunctional genes are important, so removing them entirely could be counterproductive. ErmineJ has methods to measure the impact of multifunctionality in your data, and to reduce it in carefully designed ways.
- My most-enriched gene set is full of highly multifunctional genes from my results, but ErmineJ didn’t do any correction. Is that a bug? Probably not (but feel free to check with us). This can happen if your gene ranking (or hit list) is not multifunctionality-biased, or if there are very few or no significantly enriched gene sets (say, a single one at an FDR of 0.1). The reasoning is that if your data is not biased, there is no need for correction, and if there is no enrichment, there is no need for correction. Despite this, you are right to think that there is something fishy about that top group, which is both highly multifunctional and marginally significant. Part of the solution to the problem of multifunctionality is being aware of its effects on interpretation. When there are multifunctionality effects, the term that is most enriched is potentially irrelevant to what is actually going on in the experiment.