Skip to main content
Purdue University Purdue Logo Purdue Libraries

Applied Big Data Workshop: Home

Big Data

Big Data is data that is too large for standard analytical techniques, accumulates too quickly, has too much variety, is subject to too many errors or is generally outside of standard analytical techniques and require new forms of technology or analysis to make meaning from the data set. Big Data has also been defined as whatever data is too big for you to handle, exemplifying the relative nature of the topic.  

Big Data Resources

Definitions

Big Data has often been defined using 3 or 4 Vs (or more).  This practice originated with one of the earliest definitions of Big Data by Doug Laney in 2001, a now Garner consultant.  The Vs include volume, velocity, variety, and veracity. 

Volume or size of the data has been a thorny issue from the start.  The question of how big, is big, has been answered by a moving target: a consensus seems to emerge around the fact that Volume/size characterizes data that exceeds the capacity of what can be stored with conventional means, and what seems big today will be small tomorrow[1]

Variety or complexity features prominently among the characteristics of Big Data.  Big Data often refers to the vast amounts of unstructured data, such as tweets, videos, images, for instance medical images.  In 2012, it was estimated that 85% of all data is unstructured and generated by humans[2].  Variety of formats and sources underlies the complexity of the analysis, as data from numerous, heterogeneous sources must be processed and adequately integrated prior to analysis, especially in commercial enterprises. 

Velocity, the speed at which data accumulates, including its rate of change, presents challenges for storage, access, and analysis.  Speed creates flows of data that need to be managed, organized and analyzed within timeframes that re-define real-time.  Thus, with Big Data, it is more advantageous to move compute power and processing algorithms to the data than bring data to the computing nodes.   

The three Vs of Big Data were expanded by IBM and others with the concept of Veracity.  Veracity refers to the quality of Big Data, understood in terms of accuracy (reliable methods of data acquisition), completeness of data (are there duplicates and missing data?), consistency (are measurement and unit conversions accurate?), uncertainty about its sources, and model approximations[3].  Big Data can also be full of errors, or noise, such that its analysis can become meaningless.  In addition Big Data may be prone to overfitting, a case when the learning algorithms used to analyze existing data are not robust to noise and lead to inaccurate predictions.  In order for Big Data to yield the insights it is expected to, users must be able to trust the data and its transformations.



[1] Ward, J. S., & Barker, A. (2013). Undefined By Data: A Survey of Big Data Definitions. arXiv preprint arXiv:1309.5821.

[2] Mills, S, Lucas, S, Irakliotis, L, Rappa, M, Carlson, T, Perlowitz, B. (2012).  Demystifying Big Data: A Practical Guide to Transforming the Business of Government.

[3] Lukoianova, T., & Rubin, V. L. (2014). Veracity Roadmap: Is Big Data Objective, Truthful and Credible? Advances in Classification Research Online, 24(1), 4-15.


School of Information Studies Professor

Megan Sapp Nelson's picture
Megan Sapp Nelson
Contact:
3053 WALC
By Appointment

Make an appointment at https://sappnelson.youcanbook.me
765-49-42871
Website Skype Contact: sappnelson

Your Data Specialist