The Imperative of Data Quality
Working with data implies daily processes leading to important decisions. Using bad data or bad models will lead to bad decisions, which may prove catastrophic for an individual (e.g. a refused loan). It is therefore important to understand the sources of potential errors, in order to remove them.
Even in marketing, where the consequences of bad data seem less catastrophic, you are at risk of annoying your consumers, and of giving the impression that you don’t care about them.
To prevent this, it is necessary to keep in mind the three main risks that can plague data validity:
- The sample you use must be representative
- The attributes you work with must be (ethically) relevant
- Errors in data processing must be avoided
As these risks usually appear during the targeting & analysis phases of our marketing campaigns, we need to be particularly attentive to those moments.
Choosing a representative sample
When processing data, we are limited in our conclusions by the data that we have. We call this the Drunk Search effect (named after the drunk person who searches his lost wallet at night under the light pole, because there is the only place with light… even if he lost his wallet elsewhere!). This phenomenon leads to sampling biases which, in time, are reinforced:
- I’m only sending my ad to women
- Because 90% of my database are women
- Because we only mailed women in the past…
You should always keep in mind the difference between the data that you have, and the data you wish you had. Always try to balance important attributes. Are age, gender, language likely to matter?
Also remember that, like your data, the expectations of society evolve. Projections have limited reliability. The past population is not always the same as the future population. Therefore, analysis based on the past will work in the future only if the future is similar to the past.
Watch out for singularities (e.g. the Covid lockdown), but also from gradual drift (ex: age at which women have their first child drifted from 26 to 31 years old in 30 years).
Choosing the right attributes
When working with data, you are always limited by what is available to you. Additional attributes can be collected, but it can be a time (and money) consuming process.
Should you collect them anyway? To answer this question, you need to do a cost to value tradeoff. Here, again, ethics should be put in the balance. Would the missing attributes help me in reducing biases? Would they make me respect the consumer more? If the answer to these questions is ‘yes’, then that is certainly a value you’ll want to add to your processes.
When deciding what data to collect, think also about its relevance to your business. The law tells you to leave some attributes out, like race, sexual orientation, or religion. But most often, ethics are going to guide you: it is ok for a diaper brand to collect the age of a baby? But what about a car or a phone brand? The Front-Page Test we introduced earlier, can be of help to make your choice.
Avoiding errors in data processing
Modern data processing has many faces:
- extracting sentiment from text
- recognizing faces from photos
- merging two records for the same person
Keep in mind that none of those techniques are perfect: sarcasm, doppelgangers & homonyms will always come in the way of your data quality.
Even for more trivial usages, errors can arise as soon as in the data entry process. A lot of human and subjective errors are possible:
- incorrect codes
- misunderstanding a scale order
- misunderstanding the meaning of a field
- inverting fields (name vs first name, city vs zip code)
Even if not voluntary, these errors lead to bad results, bad choices and bad decisions that can have consequences for the consumer. Think about credit scoring processes or decisions to allow post pay to a certain client.
As a data processor, you have an ethical obligation to use the right data to take a decision that will impact your consumers. Remember that data subjects expect:
- validity: your sources need to be authoritative, complete, and timely
- access: data subjects need to have access
- accountability: sources need to be able to detect and correct mistakes and unintended consequences.
It is crucial that we pay careful attention to the validity of our data and our processes. Otherwise, we will get bad results.
When the results are used to take decisions related to data subjects, it can cause great harm. Data subjects expect us to solve this problem. It is an ethical priority.
This article was the last of our series on specific ethical problems. Join us next month for the conclusion of the whole series on data ethics.