Expectations of Kaggle competitions: ethics and provenance
Published: 18 Oct 2019
Kaggle is a great platform for learning, discussing and participating in data problems using Machine Learning to help solve them. A facinating component of Kaggle's appeal is it's competitions section which challenges Data Scientists and hobbyists to compete to solve a fun or pressing data problem. Some competitions come with exteremely interesting datasets: in the world of ML good datasets are like fuel to a furnace. However, not all the competitions have extensive explainations, preambles or reflections on the project and it's use. Here I'll look at what I'd recommend to competition setters, Kaggle itself, and those looking to compete to create a better proposal and why it's important.
TL;DR - Anyone using Machine Learning techniques should consider the impact of their work on society. Kaggle is an exciting source of learning and activity but shouldn't be exempt from reflective thinking.
Machine Learning (ML) is a boon to organisations with large datasets, they allow for useful predictions that can be used for positive or negative outcomes for society, whether intended or not. If you have listened to my podcast (www.machine-ethics.net) then you will be well aware of some of the issues, much of which are freely discussed in the news attached to big stories like Cambridge Analytica etc.
Kaggle, as a platform for developing and sharing products from ML code with companies, I would presume would adhere to a strict policy of ethical conduct, data provenance, and explaination of code future use. If this is indeed the case it isn't present on many of the competition's descriptions.
Why does this matter? Some competitions may be set by opaque organisations which hide their future intentions for the free (another issue with the platform) work Kaggle users are conducting. For example one such competition links to an institution sponsored by U.S. Federal institutions, as this is not stipulated in their proposal I can only presume that they are trying to hide this fact , which leads us to wonder into whom's hand this code and research will land... (competition: https://www.kaggle.com/c/recognizing-faces-in-the-wild/overview/description).
The Recognizing Faces in the Wild competition, as linked above, is probably the worst example yet to appear on the platform with no thought given to ethics (Robo-ethics and Data ethics), data provenance, data bias and social exploitation. One can instantly see how this information could be used to negative affect and I worry that without better ethical policies, Kaggle's list of competitions, earnest or not, will be corrupted by insidious organisations without any social reflection of their work or with outright nefarious intent.
With that said there are good examples of competitions from whom do briefly reflect on their propositions and give information of future research, contribution, data bias and miss-use. The Google Jigsaw competition is a good example of this where they acknowledge some of the above (though not miss-use).
This is an open call to Kaggle and indeed any data science competition to consider ethical considerations in your proposals.
Consider including information on:
- Data provenance
- Where is the data from?
- Has it been ethically obtained?
- What method of collection has been used?
- How old is it?
- Why is the data deemed relevant or correlative?
- Future usage - How else can this Kernel (code) be used?
- What are the other applications of this project?
- What might the consequences of this work look like?
- What society are we building if this work is wildly successful?
- Should this work even be done at all if it could be used for mass harm?
- Organisations and their benefactors
- Information about the provenance of the setting organisations and their affiliates
- Acceptance of bias issues
- Demonstrate diversity bias considerations present in your data
- Consider dogma that may occur if output applied in real world
The above is not an exhaustive list but my initial thoughts after reading over half the competitions listed on the site and thinking of relevant questions for them. Anecdotally I've been interested in participating in Kaggle competitions in the past, however the lack of information on the above has lead me to stir away from any competition which isn't obviously innocuous.
If you are a Kaggle employee please heed the above, if you are a organisation looking to set a competition then look through and contact me if you have questions. For anyone else, please DO NOT participate in FREE labour without first considering the cost of your time and the societal impact of your work.
An aside: I would love more Kaggle competitions to also have a requirement to display post-competition outcomes and reflections; it always seems like a bit of an anti-climax as you rarely find out what happens after the submissions.
I'd also change the name of Kaggle's code environment from to something else, as Kernel is super confusing as it already has meaning to data scientists and coders alike.