NLProt: extracting protein names and sequences from papers
1 CUBIC and 2 NorthEast Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA, 3 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 Saint Nicholas Avenue, New York, NY 10032, USA and 4 Institute of Physical Biochemistry, University Witten/Herdecke, Stockumer Strasse 10, 58448 Witten, Germany
* To whom correspondence should be addressed. Tel: +1 212 305 4018; Fax: +1 212 305 7932; Email: mika{at}cubic.bioc.columbia.edu
Received February 12, 2004; Revised March 26, 2004; Accepted April 12, 2004
Automatically extracting protein names from the literature and linking these names to the associated entries in sequence databases is becoming increasingly important for annotating biological databases. NLProt is a novel system that combines dictionary- and rule-based filtering with several support vector machines (SVMs) to tag protein names in PubMed abstracts. When considering partially tagged names as errors, NLProt still reached a precision of 75% at a recall of 76%. By many criteria our system outperformed other tagging methods significantly; in particular, it proved very reliable even for novel names. Names encountered particularly frequently in Drosophila, such as white, wing and bizarre, constitute an obvious limitation of NLProt. Our method is available both as an Internet server and as a program for download (http://cubic.bioc.columbia.edu/services/NLProt/). Input can be PubMed/MEDLINE identifiers, authors, titles and journals, as well as collections of abstracts, or entire papers.
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Torii, Z. Hu, C. H. Wu, and H. Liu BioTagger-GM: A Gene/Protein Name Recognition System J. Am. Med. Inform. Assoc., March 1, 2009; 16(2): 247 - 255. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Dieterich, U. Karst, J. Wehland, and L. Jansch MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results Bioinformatics, August 15, 2005; 21(16): 3450 - 3451. [Abstract] [Full Text] [PDF] |
||||

