Overview of Help-Pages
LitInspector logo

LitInspector Background

[LitInspector Help] [LitInspector Background] [LitInspector Quality] [LitInspector Numbers]
[Gene recognition] [Homonym resolution] [Organism assignment] [Tissue and disease tagging] [Pathway mining]


Frisch M, Klocke B, Haltmeier M, Frech K (2009)
LitInspector: literature and signal transduction pathway mining in PubMed abstracts
Nucleic Acids Res.
[PUBMED: 19417065]

Gene recognition

The gene recognition in LitInspector is based on the comprehensive gene synonym lists provided by NCBI's Entrez Gene. These synonym lists are complemented by Genomatix' own synonym databases which were assembled over the last years, containing additional synonyms as well as deprecated synonyms which were realized to result in predominantly wrong taggings.

Homonyms and ambiguous synonyms

Many gene synonyms are ambiguous, i.e. one synonym is used for multiple genes or even in a complete different, "non-gene" context. For instance, the synonym "MBP" is mentioned in about 6800 PubMed abstracts. MBP is a homonym, it is used for three different genes: In addition, "MBP" is used in the scientific literature in a "non-gene" context as an abbreviation for: Even human experts may have difficulties in resolving some homonyms and ambiguities.

Therefore, a main challenge of automatic gene data mining is the disambiguation especially of short gene synonyms. LitInspector uses a combination of automatic disambiguation modules, context databases manually curated by Genomatix and half-automatically generated and manually edited filtering lists. Disambiguation of gene homonyms makes use of the occurrence of further gene synonyms in the same abstract as well as automatically and manually generated context lists.

Although best effort is undertaken with LitInspector to resolve ambiguities, it is unavoidable that automatic data mining programs will show a certain error rate. But the advantage of LitInspector over solely graphical or schedular representations is that the scientist retains full control over the software processed data as the sentences containing the identified and highlighted synonyms are directly verifiable. In many cases a human expert will recognize wrongly assigned synonyms solely by scanning the sentence or abstract context. If you discover erroneously annotated synonyms we would appreciate your feedback at, especially if this synonym causes several errors in a larger number of abstracts like the example "CO2" above. This would help us to improve the next LitInspector release.

Assignment of organism information

LitInspector makes use of the organism information annotated by the MeSH consortium provided within the MeSH terms. However, for the most recent abstracts the MeSH annotation is not yet completed and in other publications an organism information is generally not available. For some publications, it is hard to identify an organism information even if the complete paper is scanned. To make sure that no publications are skipped because there is no organism annotated, LitInspector uses only soft criteria for the organism assignment. In case of the mammalian gene tagging LitInspector uses all abstracts and excludes only those for which a "non-mammalian" organism (e.g. Caenorhabditis, Xenopus, or plants) is annotated in the MeSH terms. Example: for a recent paper the MeSH organism information is not yet annotated. If LitInspector identifies a synonym in the abstract, e.g. "WT1", this synonym will be annotated for all mammalian organisms for which a gene synonym "WT1" is known, Homo sapiens (gene ID 7490), Mus musculus (22431) and Rattus norvegicus (24883). Consistent with that, even if a mammalian organism like Homo sapiens is annotated in the MeSH terms, the abstract is also tagged for all other mammalian organisms like Mus musculus and Rattus norvegicus. Only in papers with a "non-mammalian" MeSH organism annotation like "Xenopus" WT1 is not annotated for the mammalian species. In case of Caenorhabditis only certain journals that contain a Caenorhabditis annotation in the MeSH terms are assigned to this organism.

Tissue, disease, and small molecule tagging

The tissue, disease, and small molecule (drug) tagging is based on UMLS (Unified Medical Language System) as source for the controlled vocabulary and synonyms.

Signal transduction pathway mining

The LitInspector signal transduction pathway mining is based on Genomatix' proprietary and manually curated database of pathway components (Example: WNT) and keywords (Examples: "signal transduction" or "signaling cascade"). Currently, the database comprises nearly 500 signaling pathways and 75 pathway keywords. To most of the signaling pathways canonical pathways from BioCarta, STKE, or KEGG are assigned and hyperlinked. Please note that the graphics provide an overview, they may not necessarily contain the query genes. For pathway mining the PubMed database is scanned for co-occurence of the user input gene and the Genomatix pathway components and keywords at sentence level.

The output table is sorted by the number of references found, since a higher number of references is assumed to provide higher evidence. In addition, the user has full access to the references to verify the software predicted data by clicking the link to NCBI's PubMed.

An identified association of the query gene to a pathway can have several possible meanings:

The advantage of automatic pathway mining compared to manually curated databases and static pathway associations is that the results are always up to date. This advantage is bought by a certain error rate which is inherent to all automatic text mining systems. The pathway mining, moreover, does not indicate a direction of the gene-pathway associations. The LitInspector pathway mining provides an actual overview of possible pathway associations and potential interactions of the query gene. It also provides the literature references which allows direct verification by the scientist.

Example: Signal Transduction Pathway associations and potential interactions of WT1 (Wilms tumor 1).
The result table is sorted by the number of references for a pathway, in case of WT1 the most references were found for WNT (Wingless type) signaling (7 references).

LitInspector pathway result example


For verification the scientist has access to all references by clicking the hyperlinked numbers.

In case of the WT1 example:

7 references were found for WNT signaling.

6 references for Beta catenin signaling.
3 references for ABL signaling.
3 references for PKA signaling.
3 references for TP53 signaling.
2 references for BCL2 signaling.