Skip Navigation

Nucleic Acids Research 2004 32(Web Server Issue):W257-W260; doi:10.1093/nar/gkh396
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (236K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Vega, V. B.
Right arrow Articles by Lin, C.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vega, V. B.
Right arrow Articles by Lin, C.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2004, the authors
Nucleic Acids Research, Vol. 32, Web Server issue © Oxford University Press 2004; all rights reserved

BEARR: Batch Extraction and Analysis of cis-Regulatory Regions

Vinsensius B. Vega*, Dhinoth Kumar Bangarusamy, Lance D. Miller, Edison T. Liu and Chin-Yo Lin

Genome Institute of Singapore, 60 Biopolis Street, #02-01, Singapore 138672

* To whom correspondence should be addressed. Tel: +65 6478 8041; Fax: +65 6478 9058; Email: vegav{at}gis.a-star.edu.sg
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors

Received February 9, 2004; Revised and Accepted March 24, 2004


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
Transcription factors play important roles in regulating biological and disease processes. Microarray technology has enabled researchers to simultaneously monitor changes in the expression of thousands of transcripts. By identifying specific transcription factor binding sites in the cis-regulatory regions of differentially expressed genes, it is then possible to identify direct targets of transcription factors, model transcriptional regulatory networks and mine the dataset for relevant targets for experimental and clinical manipulation. We have developed web-based software to assist biologists in efficiently carrying out the analysis of microarray data from studies of specific transcription factors. Batch Extraction and Analysis of cis-Regulatory Regions, or BEARR, accepts gene identifier lists from microarray data analysis tools and facilitates identification, extraction and analysis of regulatory regions from the large amount of data that is typically generated in these types of studies. The software is publicly available at http://giscompute.gis.a-star.edu.sg/~vega/BEARR1.0/.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
Microarray technology has been applied extensively to interrogate changes in gene expression patterns that underlie biological and disease processes. The observed alterations in transcript levels are mediated by intracellular signalling pathways and downstream transcription factors that interact with specific regulatory regions of the genome and the transcription machinery. Therefore, DNA microarrays are especially well suited for experiments where specific transcription factors are known to play important roles. Transcription factors that are involved in cancers, such as p53 and Myc, have been studied using this new technology. By combining expression data and computational analysis of regulatory regions, it is then possible to begin identifying the molecular mechanisms of transcriptional regulation, to map the network of transcription factors and their target genes which affect the phenotypes being studied, and to generate additional hypothesis for testing in the laboratory [see comments in (1)].

Some online computational tools for mammalian gene regulatory region extraction and analysis are available to researchers, although with varying degrees of accessibility, capability and focus. The most comprehensive tools, including TOUCAN (2), EZRetrieve (3), the Genomatix suite (www.genomtix.de) and TRANSPLORER (www.biobase.de) from BIOBASE (the former two are from academic sources and the latter two are only available commercially for batch analysis), usually employ annotation-based sequence retrieval and position weight matrices, typically from the TRANSFAC database (4) or specialized databases, for sequence extraction (e.g. the upstream region extraction tool developed by the Harvard-Lipper Center, http://arep.med.harvard.edu/labgc/adnan/hsmmupstream) and binding site identification, respectively. There are also standalone [e.g. MEME (5), BioProspector (6), MDScan and REDUCE] and integrated [e.g. MotifSampler and phylogenetic footprinting modules in TOUCAN] motif discovery tools designed to uncover conserved sequences in extracted regulatory regions (7,8). While the sophisticated approaches in these programs represent innovations in comprehensive binding site prediction and discovery and are useful for detailed analysis and mining of refined or idealized datasets—for example members of the same gene family or known targets of specific transcription factors—nevertheless, based on our experience, they may be of limited use against the large amount of ‘noisy’ data generated in microarray studies that often confound the interpretation of the analytical results or fail to yield conserved regulatory sequences. Whereas these tools favour comprehensive approaches for identifying all known binding sites or conserved sequence discovery in the dataset, we propose that a critical and complementary step prior to a more comprehensive analysis and experimental validation is a hypothesis-driven and focused computational strategy based on the experimental design. Furthermore, although batch options are available for some of the tools noted, they have not been optimized for batch operations and are not suitable for the large amount of data that can be generated from genome-scale studies. Therefore, we created Batch Extraction and Analysis of cis-Regulatory Regions, or BEARR, to assist biologists in performing batch extraction and analysis of cis-Regulatory regions of hundreds or even thousands of differentially expressed genes identified in microarray studies. Here, we describe the system design and functionalities.


    SYSTEM DESIGN
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
The system is divided into two parts: (i) the regulatory region sequence extraction module and (ii) the sequence analysis module. Essentially, the sequence extraction module generates the nucleotide sequences, using annotation and genomic sequence databases, to be queried by the analysis module, based on transcription factor binding sites or any conserved regions defined by the user. The modular system design allows for maximum component reusability and extensibility. To ensure platform independence and ease of installation, no specialized library or programs were employed. An interactive web-based user interface was developed for ease of use, interoperability and accessibility. Figure 1 shows a screenshot of the web interface. The system schema of BEARR is also available in the FAQ section of website.



View larger version (104K):
[in this window]
[in a new window]
 
Figure 1. BEARR web interface.

 

    REGULATORY REGION EXTRACTION
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
RefSeq accession numbers, recognizable by the NM_ and XM_prefixes, are the primary gene identifiers employed in the modules, although the system is capable of translating other identifiers from microarray analysis tools, including the UniGene cluster ID and the IMAGE clone ID. The list of gene identifiers can be easily copied from the output files of most expression analysis tools and pasted into the input window of the extraction module. This input feature allows users the flexibility of utilizing the optimal microarray data analysis tools to identify differentially expressed genes and then efficiently moving into the extraction and analysis of their respective regulatory regions. Currently, BEARR accepts identifiers for human and mouse genes.

The sequence extraction module utilizes NCBI's LocusLink and RefSeq (9) databases to identify the genes and pinpoint their loci within the genome. These annotations were chosen for their comprehensiveness, in terms of number of annotated genes, and their consistency with the current state of the NCBI contig databases, the underlying genomic sequence database used in BEARR. Using the transcription start site (TSS), defined as the 5'-most nucleotide in the reference transcript, and the 3' terminus of the transcript as reference points, the module can extract user-defined regions both upstream and downstream of the references. Extraction speed has been enhanced by utilizing downloaded databases on the BEARR server rather than accessing the information on the NCBI servers via the internet. The resulting sequences are saved in the FASTA format and are downloadable by users. It is important to note that the TSS locations annotated in LocusLink and RefSeq may only approximate true start sites due to incomplete information at the 5' ends of some reference sequences. The large number of annotated genes in LocusLink and RefSeq, however, provides the advantage of maximizing the number of regulatory regions we are able to extract for further analysis. To better annotate human start sites, we have incorporated the information from DBTSS (10), a database of experimentally extended 5' sequences in human transcripts, to assist users in orienting the locations of predicted sites with reference to the more accurate TSS. It is also expected that as the sequence and annotation information is updated regularly in LocusLink and RefSeq, the accuracy of the current extraction method will improve.


    SEQUENCE ANALYSIS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
Once the desired sequences are obtained by the extraction module, the sequence analysis subsystem queries the input for putative transcription factor binding sites. Although it may be possible to map all known binding sites by incorporating information from TFD (11), TRANSFAC (12) and TRRD (13), BEARR is specifically designed to efficiently detect a focused group, as determined by experimental design or the microarray data, of transcription factor binding sites. We have, however, provided links to the information in TFD and TRANSFAC to enable users to extract consensus sites or matrices of their choice for use in BEARR. Users also have the option of inputting their own binding sites derived from the latest experimental data rather than relying on the potentially dated information in the databases. Extracted sequences can be queried using single or multiple consensus binding sites, including tandem sites, or position weight matrices of binding sites [PWMs; (14)]. Stringency is adjusted by defining the number of mismatches or spacer sequences allowed in the consensus searches or the similarity score threshold in the PWM output. The PWM score signifies the goodness of the match, where higher scores correspond to better matches. An optional empirical P-value calculation is also available for users who wish to assess the statistical significance of the motifs being identified against the null hypothesis of finding the motifs in randomly generated sequences. The analysis results are displayed in a tab-delimited text file and include the detected putative binding sites, number of hits and their positions relative to either 5' end or 3' terminus. Additional columns of PWM score, empirical P-value, and DBTSS 5'-extension are also included under the appropriate settings (see Figure 2). Further information of the analysis module can be found in the Help sections of the website.



View larger version (18K):
[in this window]
[in a new window]
 
Figure 2. Sample output from BEARR when used to identify potential functional Estrogen Response Element, based on TRANSFAC ERE position weight matrix. Output of consensus pattern searches is similar, with the exception of PWM score and empirical P-value columns.

 
In some cases, the user may find it necessary to redefine the published consensus sequences or matrices for binding sites because the reported derivations may be based on outdated or incomplete information. It is also highly recommended that the users optimize the stringency thresholds, using a test set of known target genes of the transcription factors of interest, prior to the analysis of the extracted sequences. These parameters will directly impact the sensitivity and the specificity of the analysis. For further detailed and comprehensive analysis of the extracted sequences, users can upload the FASTA sequence files into other available regulatory region analysis tools [e.g. MEME (5)] described above from the links on the website.


    ANALYSIS EXAMPLES
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
BEARR was originally created to assist the discovery of potential estrogen receptor binding sites in the regulatory regions of hormone-responsive genes identified in our microarray experiments. We have subsequently used BEARR to analyse expression data from studies of other nuclear receptors and transcription factor families in order to identify target genes and downstream pathways. In general, this software is well suited for the analysis of data generated from study designs where transcription factor activity is manipulated experimentally. The software has also been applied in the analysis of the role of transcription factors identified in microarray studies of clinical samples in mediating the observed alterations in gene expression profiles of diseased tissues. In addition to the analysis of regulatory regions of genes identified in array studies, genome-wide surveys (all genes annotated in LocusLink and RefSeq) of transcription factor binding sites of interest, within defined regulatory regions, have been carried out to determine the frequency and distribution of putative binding sites. In a typical analysis flow, a user starts by formulating the underlying hypothesis, followed by designing and carrying out the appropriate experiments while at the same time studying the literature for related transcription factor binding sites to construct an initial binding site model. The experiments enable the user to group interesting genes, from which BEARR tries to identify potential functional binding sites based on the input transcription factor binding site model. Refinement of the model might also be necessary. Further biological investigations and experimentations could be performed on the binding sites found. Examples, tutorials and analysis workflows can be found in the online supplementary material on the website.


    FUTURE DEVELOPMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 
As more genomes are sequenced and annotated in LocusLink and RefSeq, additional databases will be integrated into BEARR to enable users to extract and analyse regulatory regions of genes from other model organisms. These upgrades will enhance the comparative analysis of conserved transcription factor binding sites. Other areas of development include motif search algorithms that will allow users to identify conserved sequences within the regulatory regions of co-regulated genes without prior knowledge of binding sites, i.e. consensus sequences or defined matrices. These additional functions will necessitate the development of a graphical results output for optimal data visualisation and analysis.


    Notes
 
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 SYSTEM DESIGN
 REGULATORY REGION EXTRACTION
 SEQUENCE ANALYSIS
 ANALYSIS EXAMPLES
 FUTURE DEVELOPMENTS
 REFERENCES
 

  1. Werner,T. ( (2001) ) The Promoter Connection. Nature Genet., , 29, , 105–106.[CrossRef][Web of Science][Medline]

  2. Aerts,S., Thijs,G., Coessens,B., Staes,M., Moreau,Y. and De Moor,B. ( (2003) ) TOUCAN: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res., , 31, , 1753–1764.[Abstract/Free Full Text]

  3. Zhang,H., Ramanathan,Y., Soteropoulos,P., Recce,M.L. and Toias,P.P. ( (2002) ) EZRetrieve: a web-server for batch retrieval of coordinate-specific human DNA sequences and underscoring putative transcription factor-binding sites. Nucleic Acids Res., , 30, , e121.[Abstract/Free Full Text]

  4. Wingender,E., Dietze,P., Karas,H., and Knuppel,R. ( (1996) ) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., , 28, , 316–319.

  5. Bailey,T.L. and Elkan,C. ( (1994) ) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp. 28–36.

  6. Liu,X., Brutlag,D.L. and Liu,J.S. ( (2001) ) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proceedings of the Pacific Symposium on Biocomputing, World Scientific Publishing, Hawaii, pp. 127–138.

  7. Liu,X.S., Brutlag,D.L., and Liu,J.S. ( (2002) ) An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotech., , 20, , 835–839.[Web of Science][Medline]

  8. Bussemaker,H.J., Li,H. and Siggia,E.D. ( (2001) ) Regulatory element detection using correlation with expression. Nature Genet., , 27, , 167–171.[CrossRef][Web of Science][Medline]

  9. Pruitt,K.D. and Maglott,D.R. ( (2001) ) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acid Res., , 29, , 137–140.[Abstract/Free Full Text]

  10. Suzuki,Y., Yamashita,R., Nakai,K., and Sugano,S. ( (2002) ) DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res., , 30, , 328–331.[Abstract/Free Full Text]

  11. Ghosh,D. ( (1993) ) Status of the transcription factors database (TFD). Nucleic Acids Res., , 21, , 3117–3118.[Abstract/Free Full Text]

  12. Wingender,E., Dietze,P., Karas,H. and Knuppel,R. ( (1996) ) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res., , 28, , 316–319.

  13. Kolchanov,N.A., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A., Stepanenko,I.L., Merkulova,T.I., Pozdnyakov,M.A., Podkolodny,N.L., Naumochkin,A.N. and Romashchenko,A.G. ( (2002) ) Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res., , 30, , 312–317.[Abstract/Free Full Text]

  14. Stormo,G.D. ( (1990) ) Consensus patterns in DNA. Methods Enzymol., , 183, , 211–221[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
I. J. Donaldson and B. Gottgens
TFBScluster web server for the identification of mammalian composite regulatory elements.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W524 - W528.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (236K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Vega, V. B.
Right arrow Articles by Lin, C.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vega, V. B.
Right arrow Articles by Lin, C.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?