Skip to main content
Purdue University Purdue Logo Purdue Libraries

R for Molecular Biosciences: Databases

The guide was created to support the course, R for Molecular Biosciences, an introductory undergraduate course in data science.

The National Center for Biotechnology Information


The National Center for Biotechnology Information  (NCBI) provides analysis and retrieval resources that include GeneBank, Entrez, MyNCBI, PubMed, BLAST, Electronic PCR, Cancer Chromosomes, among many others.  The links give here are through the Purdue Libraries proxy so that you can access articles from off campus with your Purdue ID.

PubMed is the NCBI interface to MEDLINE, a database of over 20 million journal articles.

The Gene Expression Omnibus (GEO) is a public functional genomics data repository containing both raw and processed microarray and sequencing data.  GEO provides tools to search, analyze and acquire microarray and sequencing data.  GEO can be searched both by experiment (dataset) or by gene (GEO profiles).

The Sequence Read Archive (SRA) is a repository for raw sequencing data from a variety of platforms.  Data deposit and acquisition requires special tools provided by NCBI.

The Gene database integrates data for many species.  This is a great place to start to find structural and functional information about a gene and the encoded protein or RNA.

OMIM (Online Mendelian Inheritance in Man) is a database that links genes to phenotypes or diseases. 

Gene Ontology Consortium

The Gene Ontology (GO) project is a collaborative bioinformatics project with the goal of providing complete and consistent descriptions of gene products across all organisms.  The descriptions fall into three large categories (ontologies): Biological Process, Cellular Compartment and Molecular Function.  Accordingly, each gene product is likely to have at least three descriptions.  However, some gene products have no descriptions whereas others can have a dozen or more. 

This database is a great resource to help you to understand all aspects of a gene's function in the context of a normal organism.  You can also use the GO database to identify genes of similar function.  Finally, a GO enrichment analysis is standard practice for anyone that perform a differential expression analysis because it can help you identify groups of genes with similar function within your results. 

This database will be important for two projects in class. 

Cell Miner

Cell Miner (new window) is a web application and database developed by the National Cancer Institute to provide access to the NCI-60 cell line data.  This panel of 60 cell lines was chosen to represent specific cancers.  For more than 20 years, the Developmental Therapeutics Program has screened thousand of drugs and potential drugs against these cell lines to help identify possible therapeutic agents to treat cancer.  The Cell Miner database contains drug sensitivity data, gene expression data and genomic data for these cell lines.

This database will be used for one class exercise. 

European Bioinformatics Institute


The European Bioinformatics Institute "EMBL-EBI provides freely available data from life science experiments,  performs basic research in computational biology and offers an extensive user traing programs, supporting researchers in academia and industry."


Ensembl is a database of vertebrate and other select eukaryotic genomes.  The content of the databases is similar to NCBI, but Ensembl has a more modern interface. 

The Array Express archive is a repository for functional genomics data including microarray and sequencing data.  It is similar to GEO and some data is available on both sites.

University of California Santa Cruz Genomics Institute

UCSC Genome Bioinformatics Site contains the reference sequence and working draft assemblies for a large collection of genomes.

The UCSC Genome Browser provides access to a wealth of functional human genomics data.  This tool allows researchers to browse through the human genome, viewing a wide range of data types.  Researchers can also upload their own data tracks or download data from UCSC.  There are free Open Helix tutorials for the UCSC Genome Browser.

The UCSC Table Browser is a tool to find and download functional genomics data for a wide range of genomes.  This is a complex tool that require some training and experience to use effectively.  There is a good user's guide and an OpenHelix tutorial.


"The BioMart project provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. The project adheres to the open source philosophy that promotes collaboration and code reuse."

Multiple databases use BioMart software, and one of the most generally useful is available at Ensembl.

Intermine Databases


InterMine is open source software designed specifically for the creation of complex biological databases.  InterMine also provides tools to query these databases.  Like BioMarts, InterMine has been adopted by multiple research communities.  Generally, research communties will adopt either BioMart or InterMine, e.g. Gramene maintains a BioMart of biological data for plants whereas Araport provides ThaleMine, an InterMine database for Arabidopsis research.

protein dbs


The World Wide Protein Data Bank is a global collaboration.  This ensures that the PDB archive of 3D structural data for proteins and nucleic acids is uniform and available globally.  The RCSB PDB website is maintained by the Research Collaboratory for Structural Biology, located in the United States.  This website provides resources to access and analyze 3D structural data.

Plant-related Databases

Araport is a one-stop-shop for Arabidopsis thaliana genomics. Araport offers gene and protein reports with orthology, expression, interactions and the latest annotation, plus analysis tools, community apps, and web services. Araport is 100% free and open-source. Registered members can save their analysis, publish science apps, and post announcements.”

 SoyBase, the USDA-ARS soybean genetic database, is a comprehensive repository for professionally curated genetics, genomics and related data resources for soybean.  

 SALAD is a motif-based database of protein annotations for plant comparative genomics. Contains information on proteome data sets of rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, algae, and yeast.

 The Plant Transcription Factor Database (PlnTFDB) provides putatively complete sets of transcription factors (TFs) and other transcriptional regulators  in plant species whose genomes have been completely sequenced and annotated.

 The Plant microRNA Database (PMRD) integrates available plant miRNA data deposited in public databases, collected from the  literature, and data generated in-house.

NIH Data Sharing Repositories

Data sharing is a critical requirement of many funding agencies and journals.  The National Institutes of Health has an ever-evolving list of Data Sharing Repositories that researcher can use to satisfy this requirement,

Nucleic Acids Research Annual Special Issues for Bioinformatics

Nucleic Acids Research publishes annual issues on Databases and Web Servers for biological sciences.  Nucleic Acids Research also maintains other Special Collections.