Ethical data part 6 - The imperative of Data Quality

The Imperative of Data Quality

Working with data implies daily processes leading to important decisions. Using bad data or bad models will lead to bad decisions, which may prove catastrophic for an individual (e.g. a refused loan). It is therefore important to understand the sources of potential errors, in order to remove them.

Even in marketing, where the consequences of bad data seem less catastrophic, you are at risk of annoying your consumers, and of giving the impression that you don’t care about them.

To prevent this, it is necessary to keep in mind the three main risks that can plague data validity:

The sample you use must be representative
The attributes you work with must be (ethically) relevant
Errors in data processing must be avoided

As these risks usually appear during the targeting & analysis phases of our marketing campaigns, we need to be particularly attentive to those moments.

Choosing a representative sample

When processing data, we are limited in our conclusions by the data that we have. We call this the Drunk Search effect (named after the drunk person who searches his lost wallet at night under the light pole, because there is the only place with light… even if he lost his wallet elsewhere!). This phenomenon leads to sampling biases which, in time, are reinforced:

I’m only sending my ad to women
Why?
Because 90% of my database are women
Why?
Because we only mailed women in the past…

You should always keep in mind the difference between the data that you have, and the data you wish you had. Always try to balance important attributes. Are age, gender, language likely to matter?

Also remember that, like your data, the expectations of society evolve. Projections have limited reliability. The past population is not always the same as the future population. Therefore, analysis based on the past will work in the future only if the future is similar to the past.

Watch out for singularities (e.g. the Covid lockdown), but also from gradual drift (ex: age at which women have their first child drifted from 26 to 31 years old in 30 years).

Choosing the right attributes

When working with data, you are always limited by what is available to you. Additional attributes can be collected, but it can be a time (and money) consuming process.

Should you collect them anyway? To answer this question, you need to do a cost to value tradeoff. Here, again, ethics should be put in the balance. Would the missing attributes help me in reducing biases? Would they make me respect the consumer more? If the answer to these questions is ‘yes’, then that is certainly a value you’ll want to add to your processes.

When deciding what data to collect, think also about its relevance to your business. The law tells you to leave some attributes out, like race, sexual orientation, or religion. But most often, ethics are going to guide you: it is ok for a diaper brand to collect the age of a baby? But what about a car or a phone brand? The Front-Page Test we introduced earlier, can be of help to make your choice.

Avoiding errors in data processing

Modern data processing has many faces:

extracting sentiment from text
recognizing faces from photos
merging two records for the same person
…

Keep in mind that none of those techniques are perfect: sarcasm, doppelgangers & homonyms will always come in the way of your data quality.

Even for more trivial usages, errors can arise as soon as in the data entry process. A lot of human and subjective errors are possible:

typos
incorrect codes
misunderstanding a scale order
misunderstanding the meaning of a field
inverting fields (name vs first name, city vs zip code)

Even if not voluntary, these errors lead to bad results, bad choices and bad decisions that can have consequences for the consumer. Think about credit scoring processes or decisions to allow post pay to a certain client.

As a data processor, you have an ethical obligation to use the right data to take a decision that will impact your consumers. Remember that data subjects expect:

validity: your sources need to be authoritative, complete, and timely
access: data subjects need to have access
accountability: sources need to be able to detect and correct mistakes and unintended consequences.

Conclusion

It is crucial that we pay careful attention to the validity of our data and our processes. Otherwise, we will get bad results.

When the results are used to take decisions related to data subjects, it can cause great harm. Data subjects expect us to solve this problem. It is an ethical priority.

This article was the last of our series on specific ethical problems. Join us next month for the conclusion of the whole series on data ethics.

Ethical data part 6 – The imperative of Data Quality

The Imperative of Data Quality

Choosing a representative sample

Choosing the right attributes

Avoiding errors in data processing

Conclusion

Recent Posts

Recent Comments