What is “good” data?
- Defined consistently (definition of labels y is unambiguous)
- Cover of important cases (good coverage of inputs x)
- Has timely feedback from production data (distribution covers data drift and concept drift)
- Sized appropriately
What kind of problem are we trying to solve?
What data sources already exist?
What privacy concerns are there?
Is the data public?
Where should we store the data?