Research Guides: Using Scholarly APIs for Text Data Mining: Home

⚠️ NOTE: OFF-CAMPUS API ACCESS

A number of scholarly APIs require an on-campus IP address to authenticate. For more information about how to access scholarly APIs from off-campus, please visit our FAQ page.

In the modern world, the sheer volume of information available to us on the internet sometimes makes it difficult to quickly and easily gather the data we're really looking for. Luckily, a number of scholarly publishers and vendors offer APIs that can allow users with basic programming skills to successfully parse large volumes of information into a more usable format.

What is an API?

API stands for application program interface. An API is a set of guidelines that allows a user to interface with and request data from a third party application. Scholarly APIs offer valuable "back end" access to data that might not be easy--or even possible--to gather from the normal user interface. APIs are particularly useful when you want to extract a large amount of data programmatically.

Interested in using scholarly APIs, but not sure how to get started? Check out this how-to guide for instructions on how to begin, even if you don't have a background in programming.

What is text data mining?

Text data mining refers to the process of extracting useful and brief information from large volumes of text. Because of the sheer number of articles that are available to us on the internet, we rely on computers to extract relevant information from their contents.

Often we rely on APIs in order to extract the data that we want to mine. Because of the licensing agreements Purdue has with various publishers, using text scrapers or crawlers is typically prohibited, and users must employ the publisher's API in order to access this information.

List of APIs for Purdue Libraries Resources

Resource	Description	Access	Result Format	Registration	Terms of Use	Contact
arXiv	Provides access to metadata and article abstracts for the e-prints hosted on arXiv.org.	HTTP GET	Atom 1.0	none	Terms of Use	arXiv API Google Group
Cambridge Structural Database System (WebCSD)	Provides access to CSD data.	Python module	Python objects	some access is restricted to subscribing institutions; contact us for license key if necessary	Conditions of Use	support@ccdc.cam.ac.uk
CrossRef	Provides access to metadata records with CrossRef DOIs, covering about 75 million scholarly works from around 5000 publishers.	HTTP GET	JSON	none	License and Etiquette	CrossRef support
Digital Public Library of America	Provides metadata on items and collections indexed by the DPLA. Also includes partner data from Harvard, New York Public Library, ARTstor, and others.	HTTP GET	JSON-LD	key required	no specific limitations, however they reserve the right to limit or block disruptive use	codex@dp.la
HathiTrust (Bibliographic API)	Provides bibliographic and rights information for items in the HathiTrust Digital Library. Please note that this API is not intended for bulk-retrieval of records.	HTTP GET	MARC-XML, JSON	none	Acceptable Use Policy	feedback@issues.hathitrust.org
HathiTrust (Data API)	Provides access to HathiTrust and Google digitized texts of public domain works.	rsync	XML, JSON	key required	Acceptable Use Policy	feedback@issues.hathitrust.org
IEEE Xplore	Provides metadata for IEEE Xplore articles.	HTTP GET	XML	key required	200 results per query	onlinesupport@ieee.org
JSTOR	Not a true API, but provides access to full text of documents available on JSTOR for computational purposes.	Web Interface	CSV	on-site registration required	25,000 documents per dataset; users requiring more may contact JSTOR directly	support@jstor.com
National Library of Medicine	NLM offers 29 separate APIs for accessing a wide variety of content from various NLM databases.	varies	varies	varies	varies	varies
NYTimes	Provides access to New York Times data. They currently have 10 APIs: Archive, Article Search, Books, Most Popular, Semantic, Times Newswire, TimesTags, and Top Stories.	HTTP GET	JSON	key required	Terms of Use	code@nytimes.com
OECD	Provides access to a selection of OECD datasets.	HTTP GET	SDMX-JSON	none	1,000,000 results per query; URL length of 1,000 characters	OECD.Stats help
PubMed	Provides access to the information stored in 38 NCBI databases.	HTTP GET	XML	none required, but registration is recommended	3 requests per second	eutilities@ncbi.nlm.nih.gov
ScienceDirect and Scopus	Provides access to full-text content from ScienceDirect and Scopus, as well as seven other APIs with various functionalities.	varies	varies	key required	no specific limits, however please see their policies on data use	integrationsupport@elsevier.com
Springer Nature Metadata API	Provides access to metadata for online Springer Nature documents.	HTTP GET	JSON, Prism Aggregate (PAM)	register through the Springer Nature developer portal	Springer Nature TDM Policy	tdm@springernature.com
Springer Nature Open Access API	Provides access to both metadata and full-text content for open access Springer Nature documents	HTTP GET	JSON, Prism Aggregate (PAM)	register through the Springer Nature developer portal	Springer Nature TDM Policy	tdm@springernature.com
Ulrichsweb	Provides access to the Ulrichsweb Global Serials Directory.	HTTP GET	XML, JSON	contact us for API key	Terms of Use
UN Comtrade	Provides access to data and metadata from the UN Comtrade database.	HTTP GET	SDMX	some access is IP-authenticated; IP address must be associated with a subscribing institution	1 query per second; 100 queries per hour; 100,000 records per query	comtrade@un.org
Web of Science	Provides access to metadata and record information within the Web of Science Core Collection.	SOAP	XML	must subscribe or be associated with an institution with a subscription	depends on host institution's subscriptions	Web of Science support

The information in this table was partially adapted from MIT's APIs for Scholarly Resources LibGuide by Mark Clemente.