Skip to main content
Purdue University Purdue Logo Purdue Libraries

Using Scholarly APIs for Text Data Mining: Home

 

In the modern world, the sheer volume of information available to us on the internet sometimes makes it difficult to quickly and easily gather the data we're really looking for. Luckily, a number of scholarly publishers and vendors offer APIs that can allow users with basic programming skills to successfully parse large volumes of information into a more usable format.

 

What is an API?

API stands for application program interface. An API is a set of guidelines that allows a user to interface with and request data from a third party application. Scholarly APIs offer valuable "back end" access to data that might not be easy--or even possible--to gather from the normal user interface. APIs are particularly useful when you want to extract a large amount of data programmatically.

Interested in using scholarly APIs, but not sure how to get started? Check out this how-to guide for instructions on how to begin, even if you don't have a background in programming.

 

 

What is text data mining?

Text data mining refers to the process of extracting useful and brief information from large volumes of text. Because of the sheer number of articles that are available to us on the internet, we rely on computers to extract relevant information from their contents.

Often we rely on APIs in order to extract the data that we want to mine. Because of the licensing agreements Purdue has with various publishers, using text scrapers or crawlers is typically prohibited, and users must employ the publisher's API in order to access this information.

 

Acquisitions

 

Are you interested in text mining a resource that isn't on this list? Let us know, and we'll find out what options might be available for you.

 

List of APIs for Purdue Libraries Resources

Resource Description Access Result Format Limitations Registration Contact
arXiv Provides access to metadata and article abstracts for the e-prints hosted on arXiv.org. REST Atom none none required arXiv help
BioMed Central Provides access both to metadata and full-text content for the 260,000 open access journals published on BioMed Central. REST XML, JSON none key required info@biomedcentral.com
CrossRef Provides access to metadata records with CrossRef DOIs, covering about 75 million scholarly works from around 5000 publishers. REST JSON none none labs@crossref.org
Digital Public Library of America Provides metadata on items and collections indexed by the DPLA. Also includes partner data from Harvard, New York Public Library, ARTstor, and others. REST JSON-LD none key required codex@dp.la
HathiTrust (Bibliographic API) Provides bibliographic and rights information for items in the HathiTrust Digital Library. Please note that this API is not intended for bulk-retrieval of records. REST MARC-XML, JSON no specific limits, however only intended for small numbers of items none feedback@issues.hathitrust.org
HathiTrust (Data API) Provides access to HathiTrust and Google digitized texts of public domain works. REST, rsync XML, JSON no specific limits, however please see their policies on data use key required feedback@issues.hathitrust.org
IEEE Xplore Provides metadata for IEEE Xplore articles. REST XML max 200 results per query must subscribe to or be a member of an institution that subscribes to IEEE Xplore onlinesupport@ieee.org
JSTOR Not a true API, but provides access to full text of documents available on JSTOR for computational purposes. Web Interface CSV max 1,000 articles per dataset; users requiring more may contact JSTOR directly on-site registration required Data for Research help
Nature Blogs Provides access to metadata and indexing information to Nature Blogs posts, as well as full-text access to a number of news stories. REST JSON, Atom 2 calls per second, 5,000 calls per day, Atom requests limited to 100 items none developers@nature.com
Nature OpenSearch Provides access to bibliographic data for content hosted on Nature.com REST SRU, JSON, Atom, Turtle 2 calls per second, 5,000 calls per day none developers@nature.com
National Library of Medicine NLM offers 29 separate APIs for accessing a wide variety of content from various NLM databases. varies varies varies varies varies
OECD Provides access to a selection of OECD datasets. REST SDMX-JSON max 1,000,000 results per query, max URL length of 1,000 characters none OECD.Stats help
PubMed Provides access to the information stored in 38 NCBI databases. REST XML max 3 requests per second none required, but registration is recommended eutilities@ncbi.nlm.nih.gov
ScienceDirect and Scopus Provides access to full-text content from ScienceDirect and Scopus, as well as seven other APIs with various functionalities. REST XML none key required Elsevier Developers help
Springer Provides access to full-text open access Springer content, as well as metadata from other Springer resources REST XML, JSON none key required support.api@springer.com
UN Comtrade Provides access to data and metadata from the UN Comtrade database. REST, SOAP XML, CSV none some access is IP-authenticated; IP address must be associated with a subscribing institution comtrade@un.org
Web of Science Provides access to metadata and record information within the Web of Science Core Collection. SOAP XML depends on host institution's subscriptions; Purdue is entitled to the Basic API and restricted to Purdue-authored publications must subscribe or be associated with an institution with a subscription Web of Science help

 

The information in this table was partially adapted from MIT's APIs for Scholarly Resources LibGuide by Mark Clemente.