Skip Navigation


Nucleic Acids Research Advance Access first published online on August 30, 2008
This version published online on September 8, 2008

Nucleic Acids Research, doi:10.1093/nar/gkn546
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (3285K) Freely available
Right arrow Screen PDF (437K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D921    most recent
gkn546v2
gkn546v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Citing Articles
Right arrowScopus Links
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Errami, M.
Right arrow Articles by Garner, H. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Errami, M.
Right arrow Articles by Garner, H. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Database Issue

Déjà vu: a database of highly similar citations in the scientific literature

Mounir Errami1,*, Zhaohui Sun1, Tara C. Long2, Angela C. George2 and Harold R. Garner1,2

1Division of Translational Research and 2McDermott Center for Human Growth and Development, The University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9185, USA

*To whom correspondence should be addressed. Tel: +1 214 648 5992; Fax: +1 214 648 1445; Email: mounir.errami{at}utsouthwestern.edu

Received August 7, 2008. Accepted August 9, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
In the scientific research community, plagiarism and covert multiple publications of the same data are considered unacceptable because they undermine the public confidence in the scientific integrity. Yet, little has been done to help authors and editors to identify highly similar citations, which sometimes may represent cases of unethical duplication. For this reason, we have made available Déjà vu, a publicly available database of highly similar Medline citations identified by the text similarity search engine eTBLAST. Following manual verification, highly similar citation pairs are classified into various categories ranging from duplicates with different authors to sanctioned duplicates. Déjà vu records also contain user-provided commentary and supporting information to substantiate each document's categorization. Déjà vu and eTBLAST are available to authors, editors, reviewers, ethicists and sociologists to study, intercept, annotate and deter questionable publication practices. These tools are part of a sustained effort to enhance the quality of Medline as ‘the’ biomedical corpus. The Déjà vu database is freely accessible at http://spore.swmed.edu/dejavu. The tool eTBLAST is also freely available at http://etblast.org.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Authorship of scientific papers is one of the most valuable currencies for scientists and engineers, and is an asset not only for climbing the corporate or academic ladder (1), but also most importantly to secure funding for academic laboratories. The fierce competition in most scientific disciplines and the increasing necessity to publish may lead authors to engage in questionable behavior such as publishing a single piece of work more than once, or emulating the style, or copying the content of another person's work. Duplicate publication may be useful to provide wider access to the scientific community or to report important updates to surveys or clinical trials, but publications that simply reproduce a previous work with virtually identical results and conclusions often lack the novelty to justify additional publication. The latter types of duplicate publication are considered unethical because they undermine the public confidence in scientific integrity. Others have previously described additional duplicate publication behaviors referred to as ‘salami slicing’ (dissecting a scientific work into multiple least publishable units) and ‘meat extenders’ (building on a previous publication with new data that would not be publishable alone) (2–4). Most previous studies of duplicate publication have been limited to a particular scientific field where duplication was painstakingly identified manually, underscoring the need for an automated method to detect putative duplications (5–16).

We have established a method to identify highly similar citations in Medline, the comprehensive literature database of life sciences and biomedical information, using the text similarity search engine eTBLAST (17,18). We were able to statistically calibrate eTBLAST to identify citations that have unusually high similarity, which were then saved in Déjà vu pending manual inspection (19,20).


    CONTENT AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
Identification of highly similar citations
Technical details describing the detection of highly similar citations and its application to the entire Medline database have been reported previously (19,20). Briefly, the method which has contributed the preponderance of entries in Déjà vu involves ‘eTBLASTing’ each Medline citation against its most related article (a feature available from Medline). Upon comparison, citation pairs are so highly similar that predetermined similarity thresholds exceeded are flagged as a highly similar pair and stored in Déjà vu awaiting manual verification by human curators.

Manual classification of highly similar citations
Déjà vu was designed and developed to allow for collaborative work among the multiple curators. It was also necessary to define a broad, flexible and extensible classification scheme to accommodate a wide range of highly similar documents dealing with all areas of biomedical research, reflecting different publication behaviors, styles and agreements. Upon manual verification, highly similar citation pairs were classified in one or more of the categories listed and defined in Table 1. In particular, we sought to distinguish between appropriate and inappropriate duplication, a process which is admittedly subjective. A pair of duplicates with different authors may indicate potential plagiarism, while two publications with shared authors may indicate multiple publication of the same study. Updates to clinical trials or survey type research are instances where complete duplication is not necessarily inappropriate. Similarly, studies with different outcomes using similar phraseology may bring valuable new information. Errata, which may or may not be tagged as such in Medline, are most similar to the initial record, often involving only a typographical correction. All of these determinations are difficult or impossible to accomplish computationally, and thus are best made by human curators.


View this table:
[in this window]
[in a new window]

 
Table 1. Déjà vu content by category and category definitions

 
Déjà vu in numbers
All data collected have been consolidated into a web-accessible database, available at http://spore.swmed.edu/dejavu. As of 22 July 2008, Déjà vu contains a total 74 760 records of which 5645 have been manually inspected (Table 1). Déjà vu has received over 40 000 visits since 1 January 2008 and currently receives an average of about 2000 visits per month.


    QUERIES AND INTERFACE
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
The Déjà vu interface was designed using python (http://python.org) and the Django web framework (http://djangoproject.com). Data are stored in a backend MySQL Database (http://mysql.com). Déjà vu was designed to allow real-time collaborative annotation by multiple curators who need not be programmers to add comments and updates or create new records.

On the Déjà vu website users can: (i) browse Déjà vu entries with no specific search method (Each entry links to the scientific citation along with full text when freely available.); (ii) perform generic searches within the Déjà vu content by authors, address, title word, abstract word, year and comment word; (iii) perform detailed searches by specifying search criteria specific to PMID, journal names, title words, abstract, address and year; (iv) filter and view Déjà vu results in a particular category or identified by particular authors (same or different), language, availability of full text, discovery method, etc.; (v) send comments or reports to contest a record or submit a potential duplication to be reviewed by human curators; and (vi) access statistics using different filters including category, language, country, journals, etc.

For each duplicate record, a viewing window presents citations side-by-side with similarities or differences highlighted (Figure 1), providing a user-friendly interface to search, browse and facilitate rapid and rigorous interpretation of the results. Déjà vu data are also available for data mining in two formats: comma-separated values and a MySQL script to recreate the MySQL database.


Figure 1
View larger version (74K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. The Déjà vu citation presentation output. (A) Browsing interface for database content. (B) Query box to search duplicate records by author names, title, abstract, year of publication and comment words. (C) List of records in Déjà vu including PMIDs, author names, publication date and links to Medline citations and free full text when available. (D) Category filters to browse records in a particular category. (E) Side-by-side view of a duplicate record highlighting overlapping keywords in blue. (F) Miscellaneous information for each article involved.

 

    CONCLUSION AND FUTURE DIRECTIONS
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
The Déjà vu database is the first of its kind to publically present cases of highly similar citations in Medline. In addition to presenting the list of highly similar citations, a goal of Déjà vu is to help scientists study in depth the behaviors of authors and the characteristics underlying multiple publications and related ethics issues surrounding the process of scientific publication. A friendly interface provides users with various browsing options along with a graphical representation of the overlapping information between citations. Ultimately, Déjà vu may act as a deterrent to the unethical practice of duplication.

Further work, currently in progress, that will substantially improve Déjà vu includes: (i) a streamlined process to update Déjà vu on a daily basis. (ii) a more collaborative approach for recruitment and qualification of topical experts as volunteer curators for specific publication areas. (iii) New methods to better address the question most often asked by authors introduced to Déjà vu, Am I in it, or has my work been duplicated? ’ Authors can now check if their work has been duplicated by submitting their abstracts one by one directly to eTBLAST, which then flags highly similar citations for the authors to pursue. Utilities are being developed to allow authors to scan their entire bibliography at once (retrieved using Medline Entrez keyword queries) to obtain a list of highly similar citations for each citation entered. Authors will also be able to automatically submit suspicious highly similar citations found by this process directly to Déjà vu curators. (iv) Currently, duplications found in Déjà vu were obtained from Medline citations. Other literature databases will be added as they are scanned by eTBLAST, including the Institute of Physics, NASA and NIH CRISP.


    FUNDING
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 
P.O’B. Montgomery Distinguished Chair (to H.G.); the Hudson Foundation (to H.G.); National Institute of Health/National Library of Medicine grant (R01 LM009758-01 to H.R.G.). Funding for open access charge: P.O'B;. Montgomery Distinguished Chair.


    ACKNOWLEDGEMENTS
 
The authors thank David Trusty for computer administrative support, Dr John Loadsman as a substantial contributing curator, Dr Wayne Fisher for useful comments and discussions and Linda Gunn for administrative assistance. They also wish to thank numerous Déjà vu users who have reported inaccuracies or have alerted them to questionable publications.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 CONTENT AND METHODS
 QUERIES AND INTERFACE
 CONCLUSION AND FUTURE DIRECTIONS
 FUNDING
 REFERENCES
 

  1. Budinger TF, Budinger MD. Ethics of Emerging Technologies, Scientific Facts and Moral Challenges (2006) NJ: John Wiley and Sons.

  2. Broad WJ. The publishing game: getting more for less. Science (1981) 211:1137–1139.[Free Full Text]

  3. Huth EJ. Irresponsible authorship and wasteful publication. Ann. Intern. Med. (1986) 104:257–259.[Abstract/Free Full Text]

  4. von Elm E, Poglia G, Walder B, Tramer MR. Different patterns of duplicate publication: an analysis of articles used in systematic reviews. J. Am. Med. Assoc. (2004) 291:974–980.[Abstract/Free Full Text]

  5. Schein M, Paladugu R. Redundant surgical publications: tip of the iceberg? Surgery (2001) 129:655–661.[CrossRef][Web of Science][Medline]

  6. Rosenthal EL, Masdon JL, Buckman C, Hawn M. Duplicate publications in the otolaryngology literature. Laryngoscope (2003) 113:772–774.[CrossRef][Web of Science][Medline]

  7. Roig M. Re-using text from one's own previously published papers: an exploratory study of potential self-plagiarism. Psychol. Rep. (2005) 97:43–49.[CrossRef][Web of Science][Medline]

  8. Mojon-Azzi SM, Jiang X, Wagner U, Mojon DS. Redundant publications in scientific ophthalmologic journals: the tip of the iceberg? Ophthalmology (2004) 111:863–866.[CrossRef][Web of Science][Medline]

  9. Kostoff RN, Johnson D, Rio JAD, Bloomfield LA, Shlesinger MF, Malpohl G, Cortes HD. Duplicate publication and ‘paper inflation’ in the Fractals literature. Sci. Eng. Ethics (2006) 12:543–554.[CrossRef][Web of Science][Medline]

  10. Gotzsche PC. Multiple publication of reports of drug trials. Eur. J. Clin. Pharmacol. (1989) 36:429–432.[CrossRef][Web of Science][Medline]

  11. Durani P. Duplicate publications: redundancy in plastic surgery literature. J. Plast. Reconstr. Aesthet. Surg. (2006) 59:975–977.[CrossRef][Web of Science][Medline]

  12. Chennagiri RJR, Critchley P, Giele H. Duplicate publication in the Journal of Hand Surgery. J. Hand Surg. (2004) 29:625–628.[CrossRef]

  13. Bloemenkamp DG, Walvoort HC, Hart W, Overbeke AJ. [Duplicate publication of articles in the Dutch Journal of Medicine in 1996]. Ned. Tijdschr. Geneeskd. (1999) 143:2150–2153.[Medline]

  14. Blancett SS, Flanagin A, Young RK. Duplicate publication in the nursing literature. Image J. Nurs. Sch. (1995) 27:51–56.[Medline]

  15. Barnard H, Overbeke AJ. [Duplicate publication of original manuscripts in and from the Nederlands Tijdschrift voor Geneeskunde]. Ned. Tijdschr. Geneeskd. (1993) 137:593–597.[Medline]

  16. Bailey BJ. Duplicate publication in the field of otolaryngology-head and neck surgery. Otolaryngol. Head Neck Surg. (2002) 126:211–216.[CrossRef][Web of Science][Medline]

  17. Lewis J, Ossowski S, Hicks J, Errami M, Garner HR. Text similarity: an alternative way to search MEDLINE. Bioinformatics (2006) 22:2298–2304.[Abstract/Free Full Text]

  18. Errami M, Wren JD, Hicks JM, Garner HR. eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic Acids Res. (2007) 35:W12–W15.[Abstract/Free Full Text]

  19. Errami M, Garner H. A tale of two citations. Nature (2008) 451:397–399.[CrossRef][Web of Science][Medline]

  20. Errami M, Hicks JM, Fisher W, Trusty D, Wren JD, Long TC, Garner HR. Deja vu–a study of duplicate citations in Medline. Bioinformatics (2008) 24:243–249.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
M. Y. Galperin and G. R. Cochrane
Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D1 - D4.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (3285K) Freely available
Right arrow Screen PDF (437K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D921    most recent
gkn546v2
gkn546v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowScopus Links
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Errami, M.
Right arrow Articles by Garner, H. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Errami, M.
Right arrow Articles by Garner, H. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?