Research Guides: Applied Big Data Workshop: Veracity

What is Veracity?

Veracity

How do you know that accurate information is kept in a Big Data project? How do you program to ensure that data is accurate? How do you prevent outsiders from introducing incorrect or inaccurate information? This concept is called veracity and it is among the most difficult Big Data challenges posed by working with large volumes of data in the cloud.

Factors impacting Veracity

Factors impacting veracity:

What does it mean to be able to trust your data?
If data is not trusted, what are the implications for the analysis?
Quality:
- what is high/low quality?
- how do you recognize it when you can't examine the data?
- what are the impacts of missing data? how do you minimize them?
- code verification: how do you know if the scripts return what you expect?
Provenance:
- where does the data come from?
- Who is producing it
- What are the processes that transform the data
- How do we deal with low quality data (and what this mean for the project you are working in)?
Documentation
- implications of poor documentation
- what needs to be documented varies by projects
- what do you need in the documentation to be able to tell that teh data is poor quality
Infrastructure
- how does this affect the reproducibility of the experiments or the analysis?
- what happens when a new version of the VMs is installed
- how can you tell if the algorithms or scripts are not broken that quality was degraded?
Data cleaning
- how do you perform data cleaning if at all?
- what could constitute data cleaning in your project?
- does it change between the data formats, videos, photos, text
- do you use algorithms, which ones?
- is it possible or desirable to do it by hand?
- do you have a standard pipeline for data cleaning?