Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Purdue University Purdue Logo Purdue Libraries

Applied Big Data Workshop: Big Data Life Cycle

Activities in the Big Data life cycle


At the Planning stage, because of potential data volume and growth, the selection of data for preservation must be discussed.  Keeping all raw data may be required as  some experiments are too costly to reproduce.  In some cases, the volume and velocity of data preclude the preservation of raw data. In other cases, it is cheaper and easier to run a simulation or a sequencer again to obtain the raw data than to preserve it.


The Acquire activity reflects how data is produced, generated, and ingested in the research process.  Data acquisition may be the result of using remote sensors, computational simulations, and downloads from external sources such as a Disciplinary Repository or the Twitter API (Application Programmer’s Interface). 


Preparing datasets and making them ready for analysis is a time-consuming step with Big Data and its complexity is often overlooked.  It's often called data wrangling when these steps involve reformatting, cleaning, and integrating data sets

Examples of Big Data Life Cycle

This figure presents the Big Data life cycle from the point of view of a project.  Researchers understand the research life cycle, but often confuse it with the data curation life cycle.  This diagram looks at research from the point of view of data curation.  The Describe and Assure activities are presented outside the cycle to emphasize that they should be present at every step of the Big Data life cycle.

Activities in the Big Data life cycle


 Describing the data and processes used in the analysis at every step – capturing the provenance trace - is crucial for Big Data.  The earlier curation-related tasks are being planned in the data management life cycle, the easier they may be to execute.


Documenting data sources, experimental conditions, instruments and sensors, simulation scripts, processing of datasets, analysis parameters and thresholds ensures not only much needed transparency of the research, but also data discovery and future use in science.  This documentation also provides a basis and a justification for decision-making.

Activities in the Big Data life cycle


The Analysis activity is the domain of the scientists performing research.  Statistical methods and machine learning, in particular, feature prominently with Big Data.  However, recording and preserving the parameters of experiments, including simulation scripts, and the entire computational environment are needed for the reproducibility of results.


The preservation activity includes the creation of pipelines or workflows that track dependencies between data and processes and allow linking raw data to results in a publication.  Preservation activities should aim to capture data transformations in order to address the challenges of Big Data.


The Discover activity refers to the set of procedures that ensures that datasets relevant to a particular analysis or collection can be found.  At this stage, one must decide which data will be made discoverable.  Integrating the results of different search methods – keyword-based, geo-spatial searches, metadata-based, semantic searches helps providing direct answers to user questions, rather than links to documents containing the information.