Data ethics & mitigating algorithmic bias

From controlling access to information to how resources are allocated, the influence and control of algorithms are ubiquitous in our lives today. Despite technology’s ability to help humans, it can also perpetuate and exacerbate issues of systemic oppression and bias.

At Multitudes, we collect behavioural data that can be sensitive, so it’s important that we have robust data ethics principles (discussed in more detail here). In this blog post, we discuss the steps involved in developing a machine learning product, how biases can be introduced during this process (algorithmic bias), ways to mitigate or eliminate these biases, and how we’re putting these principles into practice at Multitudes.

*Harmful impacts of algorithmic bias gaining headlines in the press*

What does algorithmic bias actually mean?

Algorithmic bias refers to the ability of algorithms to systematically and repeatedly produce outcomes that benefit one particular group over another. For example, consider the task of image classification using the following two images:

Image shows a bride from a Western wedding and another bride from an Indian wedding side by side

Both are typical wedding photographs, one from a Western wedding and another from an Indian wedding. However, when neural networks were trained on ImageNet — one of the world’s most widely-used open-source datasets that contains more than 14 million images — they produced very different predictions for these two images (read the paper here). Predictions on the image of the Western bride included labels such as “bride”, “wedding”, “ceremony”. In contrast, for the woman wearing a traditional Indian wedding dress, the predicted labels were “costume”, “performing arts”, “event”. Though this example may be considered trivial by some, such errors are not uncommon, especially given the lack of diversity in the datasets that data scientists typically use for the purposes of model building. So it is not a surprise that there are already many examples in society where algorithms have harmed marginalized groups (see here, and here).

The Machine Learning Lifecycle

The machine learning (ML) lifecycle can be broken up into the following five steps:

Data collection and preparation
Feature engineering
Model evaluation
Model interpretation and explainability
Model deployment

Biases can arise within each step in the ML lifecycle, so we need steps to mitigate them.

‍

1. Data Collection and Preparation

A hand drawing of a computer with data feeding in

What the step means
‍This is the first step in the ML process, where one collects, labels and prepares data for modelling and analysis purposes.

An example of how bias can arise
‍Issues arise when the data collected doesn’t fully reflect the real world. For example, studies about such biases have demonstrated that almost all of the very popular image datasets – ImageNet, Coco and OpenImages – contain images mostly from Europe and North America, despite the majority of the world’s population being in Asia. As such, models trained on these datasets perform worse for people from continents such as Asia and Africa.

Note that there are many more ways that bias can leak into the data collection process; this is just one example.
‍

*Density maps showing the geographical distribution of images in popular image datasets. A world population density map is shown for reference (bottom-right).*

Example of an action to mitigate this
‍Ensure that the data is collected in a manner that reflects reality. In fact, because historical data is collected from a society in which systems of oppression operate, you might even want to over/under-sample data from marginalized groups in order to move towards more equitable datasets. One example of when you might want to do this would be for a facial recognition app: Since there are fewer images of BIPOC folks, you might want to oversample images of them. A fantastic resource to learn more about equitable data collection practices is Timnit Gebru’s article “Datasheets for Datasets” which proposes a framework for transparent and accountable data collection.

‍

2. Feature Engineering and Selection

A hand drawn hand choosing between three options

What the step means
‍When building models for the purposes of prediction, we construct features. A feature is simply a characteristic of each data point that might help for the purposes of prediction. For example, if we are predicting the price of a house, a useful feature might be the number of bedrooms or the postcode the house is located in.

Example of how bias can arise
‍Consider a model that predicts whether a police officer should be deployed in a particular suburb based on past incarceration data: A data scientist may claim to have built a model which is “socially neutral” as they have removed all features that correspond to race, age, and gender. However, other features like postal code might also correlate with features such as race (because in the real world, suburbs are segregated by race). In fact, this study demonstrates the potential for predictive policing to propagate and exacerbate racial biases in law enforcement.

Example of an action to mitigate this
‍The easiest counter-measure is to critically examine the relationships between features that may correlate. In the example above, despite removing features about race, age and gender in the modelling process, one should still look for other features (such as postcode) that correlate with demographic features. It’s also important to initiate and maintain contact with communities and stakeholders from different marginalised groups and have a participatory approach to ML. This paper introduces Community Based System Dynamics (CBSD) as a way to engage different groups in order to design fairer ML systems. As such when designing and deciding on features for models, it’s important to engage with the community who would be most impacted by the model and get their feedback. Even still, it is not clear that this is sufficient to eliminate all biases.

‍

3. Model Evaluation

A computer with a magnifying glass on top, showing a data trend, along with a golden heart

What the step means
‍This involves assessing the accuracy of a model’s ability to predict a certain outcome – for example correctly predicting a person’s face in a facial recognition system.

Example of how bias can arise
In a recent study, the authors discuss intersectional model analysis as a tool to assess model accuracy, inspired by the sociological framework of intersectionality.
‍

“Intersectionality means that people can be subject to multiple, overlapping forms of oppression, which interact and intersect with each other.” - Kimberly Crenshaw

‍
In the “Gender Shades Project”, researchers used this approach to examine companies who were selling facial recognition technologies that boasted accuracies of up to 90%. However, when the accuracy was broken down by different intersectional sub-groups, it was found that the error rates for darker-skinned women were as high as 34.7% – whereas for lighter-skinned males, the error rate was only 0.8%. In hindsight, this is hardly something a multi-trillion-dollar business should be selling at scale, let alone promoting as “accurate”.

Example of an action to mitigate this
‍It’s necessary for Data Scientists to advocate for measures of model performance that contain results broken down by intersectional subgroups. This is another reason why having a representative dataset matters – so there’s enough data to evaluate the model’s accuracy for different demographic groups. The approach of using model cards discussed here is a great resource for evaluating model performance.

‍

4. Model Interpretation and Explainability

A flow chart coming from a rectangle with a heart inside, showing different flows

What the step means
‍Model explainability is a concept which looks into the ability to understand the results of a machine learning model. The extent to which a model's results are explainable to stakeholders should be a key consideration when evaluating different models, especially in human-centric applications.

Example of how bias can arise
Many examples exist of individuals being unfairly impacted due to the output of a model. In 2007 a teacher was fired from a Washington DC school due to an algorithm: Despite having highly favourable reviews from students and parents, an opaque algorithm was used to determine her performance as being in the bottom 2% of all teachers.

Examples of an action step to mitigate this
‍When humans interact with ML systems it is imperative they understand exactly how and what personal data will be used and also why a model is being used in the first place.

Product people, software developers, and designers should have a high-level understanding of the ML system they are building, so they can probe what data is being used, and all the ways that the model's predictions might impact an end user's decisions in the real world.

For data scientists, there are many tools available to help understand model behaviour such as SHAP (SHapley Additive exPlanations), which allows for an understanding of the effect of different model features. When utilising techniques such as deep learning models – where the models identify and abstract features in the data that humans wouldn’t be able to – data scientists can utilise tools such as LIME, which was designed to work on any black-box algorithms.

‍

5. Model Deployment

A rectangle containing a bar chart, with 5 flows emerging from the top

What the step means
Once we’ve trained up a model, evaluated that it is working effectively, and completed the R&D process, models are then deployed into production.

Example of how bias can arise
At this step, it is important to ensure that the model is being used for its intended purposes. Often we can introduce bias into a model because there are inconsistencies between the problem a model was built to solve and the way it is used in the real world. This is especially the case when it is developed and evaluated in a totally self-contained environment, when in reality it exists as part of a complex social system with many decision-makers. For example, Microsoft’s NLP bot learnt racial slurs in less than 24 hours of being exposed to Twitter. Another issue is that data in production drifts over time – a phenomenon known as concept drift. The result is a degradation in model performance.

Example of an action step to mitigate this
‍It is necessary to consistently track the quality of the input data. Without robust monitoring in place, the distribution of the input data can revert to becoming more biased, even if the model creators ensured diversity in the initial dataset. This causes a model to be less performant for certain demographics, which means that previous work done to ensure that ethical considerations were managed appropriately can become irrelevant. We can track this by comparing the distribution of new input data from production with the training data used in model development. It’s also important to label, version, and date the models being used in production so it is easy to roll back and even switch off models that are performing poorly.

‍

How we’re putting this into practice at Multitudes

The Multitudes team smiles holding ice cream

For us at Multitudes, one of our data principles is that if we get data from someone, we should make sure that they get value from it, and that we use the data transparently. In addition, we never show individual performance data – our data is always aggregated to the team or organisation level.

In our development of ML products, we:

Data Collection and Preparation:

In line with our data principles, we only collect data if we are able to provide value to those same people (reciprocity). For example, we show managers and team members the same data in our app, and our first focus has been to design our app to be useful for all people in a team as opposed to solely being useful for management.
When we do data labelling, we divvy it up across a diverse group of people. This is important to minimise the bias that our model learns from the labelled data. This is where the diversity of our team helps, which we’re always thinking about, especially in hiring.

Feature engineering & selection

For analysing the quality of feedback given during code review, we avoided using features like the demographic information of the feedback’s authors, because this could end up penalising certain groups of people.

Model evaluation

We commit to doing intersectional evaluation of all of our models on our own team (since we have our own demographic data and a diverse team). As part of that, we run models on our own data and examine the accuracy across people on our team who are from different demographic groups.

Model deployment

We conduct bad actor exercises where we identify nefarious ways that a bad actor could use data from our tool. This has helped us navigate difficult product considerations. For example, we have set a high bar for accuracy before we will deploy our model to production. This is because we thought through the ways that an inaccurate model could impact our users, especially those from marginalized groups, and found that high accuracy will be critical to minimising risk for those groups.

At the end of the day, humans are the ones who create algorithms, so we also recognize the importance of the broader culture and environment we create at Multitudes. Some things we consider are:

Thinking about the people we have in the room. This means reducing bias in recruitment and doing proactive outreach to under-represented populations when hiring or looking for user testers. It also means consciously creating an environment that listens to those different voices, e.g., considering the share of voice in our team meetings; and making sure to rotate “office housework” tasks, which people from marginalized groups are often expected to do without the appropriate compensation.
Committing to doing ongoing learning about oppression and privilege. For example, each member of our team has an individual allyship action plan that we commit to and hold ourselves accountable on; and at each quarter team strategy week, we dedicate at least half a day to a workshop or class on anti-racism and DEI. We’ve shared some other examples and learnings from our first year here, and our CEO has discussed how we think about oppression in this podcast. As we progress on our journey towards becoming a truly equitable business, we will continue to share our learnings with you.

Conclusions

This article has been a broad overview of some of the ethical pitfalls of machine learning systems. The hope is to provide points to consider when dealing with ML Systems, as well as an example of how we’re implementing these mitigations so far at Multitudes.

However, the subject of “Equity and Accountability in AI” is a vast and well-studied field and we’ve hardly scratched the surface. We hope this encourages everyone from AI researchers to end-users and the general public to have sustained dialogue on the importance of ethical considerations when building and interacting with ML systems. Moreover, it’s worth noting that reducing algorithmic bias is not the full answer – the bigger, more important task is to dismantle systemic oppression. As individuals and as a collective, we can take action to create a more equitable world by making choices in what we consume, how we live, how we work, and who we vote for.

‍

Resources to learn more

This article barely touched the surface of this vast topic. Here are a bunch of resources that you can use to find out more about equity and accountability in data science - and we’ll keep sharing our learnings and approach as we go!

Examples of Unethical AI systems in society

Research Groups and Organisations

Toolkits, Code and Other Fun Stuff

Books