Nucleic Acids Research Advance Access originally published online on December 12, 2008
Nucleic Acids Research 2009 37(3):771-777; doi:10.1093/nar/gkn986
Nucleic Acids Research, 2009, Vol. 37, No. 3 771-777
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Identifying and classifying biomedical perturbations in text
Raul Rodriguez-Esteban*,
Phoebe M. Roberts and
Matthew E. Crawford
Pfizer Research Technology Center, 620 Memorial Dr., Cambridge, MA 02139, USA
*To whom correspondence should be addressed. Email: Raul.Rodriguez-Esteban{at}pfizer.com
Received September 11, 2008. Revised November 11, 2008. Accepted November 21, 2008.
 |
ABSTRACT
|
|---|
Molecular perturbations provide a powerful toolset for biomedical
researchers to scrutinize the contributions of individual molecules
in biological systems. Perturbations qualify the context of
experimental results and, despite their diversity, share properties
in different dimensions in ways that can be formalized. We propose
a formal framework to describe and classify perturbations that
allows accumulation of knowledge in order to inform the process
of biomedical scientific experimentation and target analysis.
We apply this framework to develop a novel algorithm for automatic
detection and characterization of perturbations in text and
show its relevance in the study of gene–phenotype associations
and protein–protein interactions in diabetes and cancer.
Analyzing perturbations introduces a novel view of the multivariate
landscape of biological systems.
 |
INTRODUCTION
|
|---|
In the early days of biological research, mutations that caused
discernable phenotypes were the primary tool for understanding
how a biological system worked—in the absence of a mutation
a gene was invisible. Today, biologists are armed with a whole
arsenal of tools to regulate gene, mRNA, and protein abundance
and activity, thereby promoting the discovery of mechanisms
and how a system gone awry can lead to disease (
1). Among these
are tools for suppressing the activity of a gene or gene product
(e.g. site-directed mutagenesis, RNA interference, small molecule
inhibitors) or enhancing activity (e.g. activating mutations
or receptor agonist). Markedly different approaches can be used
to perturb biological systems with similar effects. For instance,
interfering with protein activity using small-molecule inhibitors
should have a phenotype similar to reducing the abundance of
the corresponding mRNA with anti-sense oligonucleotides (
2).
Likewise, similar responses are expected whether increases in
intracellular protein concentration are achieved via an inducible
promoter or by addition of recombinant protein (
3). As such,
perturbations form the core of understanding how biological
systems work, how diseases arise, and how they can be treated.
Any serious attempt to analyze a biological process starts by
identification and characterization of perturbations that have
been used in prior work. This task requires a framework that
can be systematically applied and that is amenable to both manual
and automatic means.
Currently, there is no established categorization that sufficiently represents the range of described experimental manipulations beyond high-level semantic and grammatical classifications (4,5) or description of techniques (6). For example, the closest concept we have found is altered expression, defined as altered expression level of a gene/protein (7). We believe that this concept is overly specific and fails to cover important phenomena, among others, changes in protein activity or gene mutations. We propose, instead, taking the existing concept of perturbation and broadening it to comprise the range of terms used in text to indicate changes in the abundance or activity of DNA, RNA and proteins. Perturbations, in this new formulation, would refer to a collection of phenomena in a manner analogous to the way protein–protein interactions refer to biological phenomena of different type (e.g. bind, activate, inhibit). Since this proposition, like any other, needs to be tested for validity and utility, we have applied it to a case study involving gene–phenotype associations in disease and have developed a mining algorithm that detects the diverse forms in which perturbations appear in text. Therefore, we are introducing in this work both a new way to understand a crucial part of biology and a new text-mining method tailored to its extraction.
 |
MATERIALS AND METHODS
|
|---|
We created three corpora that we named design,
test and analysis. As initial step,
we created the design corpus to develop an analytical framework
for annotation. The purpose of this corpus was to identify challenges
in the annotation process and to refine guidelines that would
help the annotators in choosing their evaluations. Annotating
perturbations requires at times thorough knowledge of experimental
biology, which can only be captured and organized within a solid
framework. Therefore we sought to perform a preliminary analysis
on a test corpus to improve on subsequent annotations. The design
corpus was not used for any other purpose. This corpus was limited
to sentences that included disease-related gene–phenotype
relationships. Using the semantic relationship nomenclature
of Tsai
et al. (
8), we selected reports in which the agent
that deliberately performs an action is represented by a gene
or protein, and the patient that is the recipient
of the action corresponds to disease phenotypes. The information
we sought stands in contrast to associative relationships, such
as elevated protein levels correlating with disease activity.
To create the design corpus, our initial query matched Medline sentences containing ordered triplets of a gene or protein name, a causative verb and a phenotype related to cancer or diabetes. Each member of the triplet was separated within the sentence by a maximum of three words. The retrieval was performed using Linguamatics I2E version 3.0 (Cambridge, UK). This software package has the ability to retrieve sentences from text that include word patterns established by user queries. The queries may include both syntactic and semantic constructs. Semantically, term classes can be defined combining external ontologie and adding term morphological variations and regular expression patterns. Syntactically, it recognizes part-of-speech and shallow parsing constructs such as noun phrases, verbs and prepositions. In terms of scope, queries can be confined to different document or text sections, including abstracts, titles or sentences. For example, a sentence-level query may comprise a term class protein, a list of verbs (I2E can automatically generate morphological variants) and a list of phenotype objects. The gene/protein thesaurus was internally developed and based on BioThesaurus (9). Forty verbs that signal causality (e.g. inhibition, stimulation) were compiled manually, as was a set of disease-related phenotypes relevant to cancer (e.g. tumorigenesis, vascularization) and diabetes (e.g. serum glucose levels, weight gain).
A set of 100 retrieved sentences and relationships were annotated by three PhD-level evaluators for relevance and direction of perturbation (see Table 1 for examples). Relevance annotation was used to mark relationships that should be eliminated from the corpus for being irrelevant to the intent of the retrieval query, e.g. if the gene was not acting as an agent. Direction indicated the type of perturbation (increased, decreased or unknown) relative to the starting state. For example, if a gene is added back to cells that carry a deletion in that gene to restore the wild-type state, it is noted as an increase, because the abundance of the gene was increased relative to the starting state. Sentences were annotated as unknown if there was no stated perturbation, or direction could not be inferred at the sentence level. It is worth noting that although an unknown direction could be frequently resolved by reviewing the abstract or the full text of the article, we strictly limited our scope to evidence found at the sentence level.
While most sentences contained straightforward descriptions
of perturbations (e.g. mutations in gene A or
protein B inhibition), the broad range of perturbations
in the literature, combined with the complex grammatical structures
found in biomedical text, made some sentence–relationship
pairs difficult to annotate consistently, mainly due to differences
in knowledge of experimental descriptions and settings by the
evaluators. When inter-annotator agreement proves to be challenging,
other groups have adopted strategies to further improve the
final gold standard annotation set (
10). For this work, sentences
deemed difficult to evaluate were set aside for later annotation
by discussion and group consensus.
Therefore, the gold standard was comprised of annotations with three-way agreement from the individual assessment, plus the consensus annotations. When each evaluator's set of straightforward independent annotations (n = 237 out of 300 annotations, 79%) was compared to the gold standard, agreement averaged 92.9%. Sentence–relationships for which there was no agreement or only two-way agreement were discussed and annotated by consensus (14%, n = 14). While overall inter-annotator agreement cannot be measured with the annotation procedure devised, the agreement metrics described are helpful as proxies to the level of ambiguity of the task. Pairs that are only evaluated by consensus differ from the rest largely in the knowledge required from the evaluator to elucidate the annotation, rather than in the factual ambiguity of the sentence. Hence, discussion and consensus assessment of pairs yields a better annotation set than individual annotation. An agreement of 92.9% can be considered high for the task. Only 7% (n = 7) of sentence–relationship pairs were marked irrelevant.
Following the same guidelines, we created the test corpus. Each of the three evaluators assessed different sets of 500 sentence–relationship pairs, 250 for diabetes and 250 for cancer. Sentences deemed not straightforward were left without annotation (n = 126, 8.4%) and later annotated by consensus. Only 6.3% of all relationships (n = 95) were considered irrelevant, demonstrating the high precision of the retrieval query. Overall, the procedure used to construct the corpus assured high quality to the 1405 sentence–relationship annotation. Genes were mapped to standard nomenclature using our protein/gene thesaurus. We measured the accuracy of this mapping using 250 sentences from the corpus. Accuracy was 89%.
We performed detection of increased, decreased or unknown perturbations using machine learning techniques. For that purpose, we constructed a vector of features for each relationship–sentence pair using the test corpus above. The vector was composed of different sets of features. One of them represented token presence as a binary vector of token weights wi, d = {w1,...,w|T|}, where T is the set of sentence tokens: a weight has value 1 if the token is present in the sentence and value 0 otherwise, an approach called set of words (SOW) (11). Tokens were created by stemming and tokenization of the sentence words. Hyphenated names were considered both as single and separate tokens in the SOW in order to capture affixes like anti-. Since proximity to the gene name can be important in determining a token's role in describing a perturbation, the sentence was further divided in several sections relative to the gene name's position. Each section was characterized by a set of features using SOW. The sections considered were: n tokens before the gene name, where n = {5, 10, all}, and n tokens after the gene name, where n = {10, all}. We noted that many perturbation descriptions were adjacent or very close to the gene name (e.g. overexpression of p53). Hence, we included a feature with the distance between the gene name and the beginning of the sentence (e.g. distance of 0 or 1 may indicate that the perturbation is unknown, e.g. TNF-
induces apoptosis ... or The p53 gene ...). We also created a set of features using a small perturbation ontology developed independently over a disease-agnostic set of retrieved sentences. If a member of the ontology was present in the sentence a feature was added with value 1, otherwise with value 0. All the feature sets described were integrated in a single feature vector for each sentence. An algorithm based on the principles of maximum entropy (12, http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html) was trained using 80% of the test corpus, randomly sampled from the cancer and diabetes sets, and tested over the remaining 20% (results were averaged over 10 runs). The training allowed the algorithm to predict the presence of a perturbation and its direction. Performance measures were evaluated and compared favorably to an SVM algorithm (13).
To create the analysis corpus, we extended the scope of our methodology used to retrieve the test corpus by eliminating the disease-specific phenotype constraints in the retrieval query (hence, phenotypes were not included in the query). We applied the query to Medline diabetes abstracts after 1996 and retrieved 359 385 relationships related to different conditions and phenotypes. This output was run through the machine-learning algorithm to create a wide-scope, disease-agnostic set of 191 240 perturbation predictions. To create a diabetes-specific subset, only sentences from Medline abstracts containing the word diabetes in their MeSH descriptors were included.
 |
RESULTS
|
|---|
There were significant differences both in technique and direction
between the cancer and diabetes perturbations in the test corpus,
with decreased perturbations more prevalent in cancer.
Table 2 shows the performance of the perturbation–detection algorithm
built using different combinations of feature sets. Our algorithm
detected perturbations with
F-measure of 79.4%. Detection of
increased and decreased perturbations had a lower
F-measure,
72.9% and 71.2%, respectively. When we excluded the perturbation
ontology in feature generation, results were only slightly lower,
whereas when we exclusively used the perturbation ontology,
results were much lower, notably due to reduced recall. Straightforward
relationships (91.6% of the total) were those that evaluators
annotated without consultation with other annotators. These
relationships were less challenging for humans, and the algorithm
had better performance over this subset than overall. Due to
absence of previous work in perturbation detection, these results
cannot be compared, but they fall in ranges typical of other
successful biomedical text mining tools.
To determine whether the disease impacted the frequency of perturbation
types, we compared cancer and diabetes literature. The diabetes
literature was significantly enriched in increased perturbations,
whereas the cancer literature showed an even distribution between
increased and decreased perturbations. In cancer, more perturbations
were performed with antisense oligonucleotide (
n = 9), antibody
(
n = 16) and RNA interference (
n = 26) than in diabetes (antisense,
n = 3; antibody,
n = 2; RNAi,
n = 0). Perturbations in diabetes
were more frequently described as injections (
n = 132) and/or
administered by dose (
n = 57) than in cancer (dose,
n = 3; injection,
n = 6). Perturbations in cancer were also more frequently described
as being
in vivo (
n = 12) or
in vitro (
n = 14) than in diabetes
(
in vivo,
n = 5;
in vitro,
n = 0). We examined the sentences
for frequently occurring terms, grouped by their level of regulation
(i.e. protein, mRNA or DNA), if known.
Table 3 illustrates the
wide variation in terms and affixes used in both diseases. Many
of the terms related to dosing and routes of administration
(e.g. administration, intracerebroventricular, injection) show
a strong dominance in the diabetes literature compared to the
cancer literature. Although these usually indicate an increasing
perturbation, there can be exceptions, such as systemic delivery
of an inhibitor.
View this table:
[in this window]
[in a new window]
|
Table 3. Ontology occurrence count in sentences from the test corpus, separated by diabetes and cancer phenotypes
|
|
The phenotypes in this study were selected based on prevalence
in the literature without regard to their use
in vivo or
in vitro. For instance, in diabetes a change in blood glucose levels
signals a change in disease state—clearly a phenotype
monitored following perturbation
in vivo. Likewise, in cancer
a change in the number of metastases is exclusively described
in
in vivo systems. In contrast, insulin sensitivity can be
used to describe both
in vivo and
in vitro systems. For cancer,
cell proliferation and apoptosis rates are also described both
in vivo and
in vitro.
If our hypothesis that perturbed genes and proteins form the underpinnings of disease mechanisms, these entities should be well represented in pathway diagrams and among drug candidates. We applied our trained algorithm to a set of 14 345 sentence–relationship facts for genes from our analysis corpus belonging to a diabetes pathway (Figure 1). Predictably, our algorithm found a strong correlation between the number of Medline abstracts in which a gene is mentioned and the number of times it is described as perturbed (r2 = 0.70). The results in Figure 1 show the difference in intensity and modality in which each gene is described or pursued experimentally. Some genes were typically more increased or decreased, often reflecting their roles in therapeutics. Examples of significantly (P < 10–6) increased were insulin, interleukin 2 or parathyroid hormone. Among the decreased were such genes as epidermal growth factor receptor; caspase 8 and plasminogen activator, urokinase receptor.

View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Perturbations extracted from genes involved in the Diabetes Mellitus Type II pathway in the Kyoto Encyclopedia of Genes and Genomes (21). Differences in perturbation direction are large for some members of the pathway, note that the scale bar is logarithmic.
|
|
We compared detected perturbation counts from the analysis corpus
against the gene–disease associations included in the
OMIM Morbid Map (OMIM online reference,
http://www.ncbi.nlm.nih.gov/omim/).
The average abstract mention and gene perturbation counts were
higher for genes with MorbidMap associations (
t-test,
P <
10
–4). This was consistent with our expectation that genes
associated to disease would be the subject of deeper study.
However, genes that had been perturbed numerous times were not
necessarily linked to disease. For example, out of the 100 most
perturbed genes, 54 were not linked to disease in OMIM MorbidMap,
including at the top such genes as jun oncogene, interleukin
4, fibroblast growth factor 2 (basic), colony stimulating factor
2 (granulocyte–macrophage), colony stimulating factor
3 (granulocyte) and mitogen-activated protein kinase 3.
 |
DISCUSSION
|
|---|
Perturbations are relevant for areas like the study of gene–disease
associations, protein–protein interactions (PPI) or gene
regulatory networks. Gene–disease association extraction
studies have largely focused on simply detecting associations
rather than characterizing them (
7,
14). PPI and gene regulation
extraction studies have side-stepped perturbation types, without
considering anything further than the experimental technique
(
15,
16). We note that PPI relationships are frequently devoid
of perturbations and the focus is on how a PPI was detected,
which does not necessarily involve a causative relationship.
The Proteomics Standards Initiative – Molecular Interactions
(PSI-MI) ontology (
6) is a comprehensive effort to describe
molecular interactions. This ontology, while including a detailed
experimental preparation section, lacks expressivity in describing
perturbations generically. Similarly to Bundschus
et al. (
7),
it includes a section on expression level with entries under
expressed, over expressed and physiological.
The lack of a general framework for recognizing, characterizing and classifying perturbations is surprising when one considers their importance in phenomena encountered experimentally. Researchers with interest in characterizing previous and current perturbation work on a biological system face the challenge of a naturalist trying to deal with animal species without a Linnaean taxonomy. This is reflected in the methodological landscape that was set in the early literature in the fields of biomedical ontologies and text mining. For example, the comprehensive text mining tool MedLEE (17) only considered the state of a gene or protein, where the state has an adjectival role such as mutated in the phrase mutated X. Perturbation descriptions, however, should be considered carefully. Observe the differences between the sentences (i) X activates Y. and (ii) Inhibition of X activates Y. From the point of view of classic PPI extraction both relationships are equivalent and can be represented with the triplet X activate Y. The perturbation in sentence (ii), however, signals that it is likely that protein X is inhibiting protein Y instead. We have called this phenomenon reversal. Given the results of the present assessment, a review of the relationship data available should be considered under this model.
We have focused our methodology in gene–phenotype associations in disease but the principles shown are applicable to other well-known areas, such as PPI, as well as less explored ones such as identification of biomarkers or cellular processes. A perturbation taxonomy, like the one described in Table 4, could capture the different dimensions that may be of interest to the inquiring scientist. This taxonomy has four annotation types: relevance, direction, molecule and effect. Relevance annotation marks relationships that are irrelevant to the intent of the retrieval query. Direction distinguishes between perturbations that represent an increase or a decrease over starting levels. Unknown direction annotation is intended for perturbations whose direction cannot be inferred at the sentence level. Molecule annotation characterizes the type of molecule being primarily affected by the perturbation: gene, RNA or protein. Expression annotation is used for a change of expression level without clarifying whether the change is in RNA or protein. Effect annotation differentiates between changes in activity or function and changes in abundance. The following examples illustrate these annotations: A gene mutation is a decrease in gene activity, where the function/activity of a gene is specifically understood as making a wild-type transcript. Exogenous addition of a gene via viral transfection, plasmid transformation, etc., is an increase in gene abundance. A gene duplication is an increase in gene abundance while a knockout is a decrease in gene abundance. Dominant negative is a mutation in a gene, which indicates that, compared to wild-type, it has defective function. Silencing, knock-down and antisense all apply to a decrease in RNA abundance. An antibody blocking a protein decreases the protein activity or function. Interfering with a protein binding another protein is a decrease on a protein's function or activity. Treatment, incubation, recombination, or synthetic refer to exogenous addition of protein.
Perturbations evolve, notably as new techniques are developed
and targets are identified. We expect perturbations to be subject
to trends and popularity variations similar to those in other
aspects of biomedicine (
18–20).
 |
FUNDING
|
|---|
Funding for open access charge: Pfizer Systems Biology.
Conflict of interest statement. None declared.
 |
ACKNOWLEDGEMENTS
|
|---|
We would like to thank Michael L. McGlashen for his continued
support to the work that led to this manuscript.
 |
REFERENCES
|
|---|
- Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Manipulating Proteins, DNA and RNA. In: Molecular Biology of the Cell (2002) 4th edn. New York: Garland.
- Evans R, Naber C, Steffler T, Checkland T, Keats J, Maxwell C, Perry T, Chau H, Belch A, Pilarski L, et al. Aurora A kinase RNAi and small molecule inhibition of Aurora kinases with VE-465 induce apoptotic death in multiple myeloma cells. Leuk. Lymphoma (2008) 49:559–569.[CrossRef][Web of Science][Medline]
- Providence KM, Higgins SP, Mullen A, Battista A, Samarakoon R, Higgins CE, Wilkins-Port CE, Higgins PJ. SERPINE1 (PAI-1) is deposited into keratinocyte migration "trails" and required for optimal monolayer wound repair. Arch. Dermatol. Res. (2008) 300:303–310.[CrossRef][Web of Science][Medline]
- Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. Biomed. Inform. (2002) 35:222–235.[CrossRef][Web of Science][Medline]
- Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics (2007) 8:50.[CrossRef][Medline]
- Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al. The HUPO PSI's molecular interaction format—a community standard for the representation of protein interaction data. Nat. Biotechnol. (2004) 22:177–183.[CrossRef][Web of Science][Medline]
- Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics (2008) 9:207.[CrossRef][Medline]
- Tsai RT, Chou WC, Su YS, Lin YC, Sung CL, Dai HJ, Yeh IT, Ku W, Sung TY, Hsu WL. BIOSMILE: a semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics (2007) 8:325.[CrossRef][Medline]
- Liu H, Hu ZZ, Zhang J, Wu C. BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics (2006) 22:103–105.[Abstract/Free Full Text]
- Colosimo ME, Morgan AA, Yeh AS, Colombe JB, Hirschman L. Data preparation and interannotator agreement: BioCreAtIvE task 1B. BMC Bioinformatics (2005) 6(Suppl. 1):S12.
- Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys (2002) 34:1–47.[CrossRef][Web of Science]
- Berger AL, Della Pietra VJ, Della Pietra SA. A maximum entropy approach to natural language processing. Comput. Linguist. (1996) 22:39–71.
- Joachims T. Making large-scale support vector machine learning practical. In: Advances in Kernel Methods: Support Vector Machines—Schölkopf B, Burges C, Smola A, eds. (1998) Cambridge, MA: MIT Press. 392.
- Chun HW, Tsuruoka Y, Kim JD, Shiba R, Nagata N, Hishiki T, Tsujii J. Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts. BMC Bioinformatics (2006) 7(Suppl. 3):S4.
- Rzhetsky A, Koike T, Kalachikov S, Gomez SM, Krauthammer M, Kaplan SH, Kra P, Russo JJ, Friedman C. A knowledge model for analysis and simulation of regulatory networks. Bioinformatics (2000) 16:1120–1128.[Abstract/Free Full Text]
- Beisswanger E, Lee V, Kim JJ, Rebholz-Schuhmann D, Splendiani A, Dameron O, Schulz S, Hahn U. Gene Regulation Ontology (GRO): design principles and use cases. Stud. Health Technol. Inform. (2008) 136:9–14.[Medline]
- Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J. Am. Med. Inform. Assoc. (1994) 1:161–174.[Abstract/Free Full Text]
- Cokol M, Iossifov I, Weinreb C, Rzhetsky A. Emergent behavior of growing knowledge about molecular interactions. Nat. Biotechnol. (2005) 23:1243–1247.[CrossRef][Web of Science][Medline]
- Pfeiffer T, Hoffmann R. Temporal patterns of genes in scientific publications. Proc. Natl Acad. Sci. USA (2007) 104:12052–12056.[Abstract/Free Full Text]
- Cokol M, Rodriguez-Esteban R. Visualizing evolution and impact of biomedical fields. J. Biomed. Inform. (2008) 41:1050–1052.[CrossRef][Web of Science][Medline]
- Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. (2000) 28:27–30.[Abstract/Free Full Text]

CiteULike
Connotea
Del.icio.us What's this?