At Multitudes, we’ve spent a lot of time thinking about indicators of team success. Everything we measure in our tool is based on a combination of research plus our team’s experience – we’ve worked as developers, data scientists, engineering leaders, and coaches for engineering teams. In addition, equity and inclusion is at the heart of all that we do; I ran a diversity, equity, and inclusion consultancy before starting Multitudes, and our whole team is committed to making equity the default at work.
Read on for an overview of some of our key metrics – what they are, why they matter, and how we measure them.
We have several analyses that look at the flow of the work – things like how quickly the team collaborates to deliver work, where delivery is flowing smoothly, and where it’s getting blocked.
What it is: This metric shows how long it takes a team to get a piece of work production-ready. It’s an indicator of how long it takes to deliver value to customers.
Why it matters: Our <code-text>Time to Merge<code-text> metric is a subset of <code-text>Lead Time<code-text> (the time from first code commit until the code is running in production). Research from Google’s DORA unit (the DevOps Research and Assessment unit) shows that better <code-text>Lead Time<code-text> is correlated with better business outcomes – specifically, teams with a better <code-text>Lead Time<code-text> do work that is faster, more stable, and more secure. If you want to dive deeper into this, check out the Accelerate book.
How we calculate it: To calculate <code-text>Time to Merge<code-text>, we measure the number of hours from pull request (PR) creation to merge – this shows how long it takes the team to give feedback, make revisions, and then merge the PR. A few additional notes:
What good looks like: The DORA research showed that elite performers have a lead time of less than one day. Since <code-text>Time to Merge<code-text> is a subset of <code-text>Lead Time<code-text>, we also recommend that you aim to keep <code-text>Time to Merge<code-text> to less than a day, e.g., less than 24 hours.
What it is: This shows how long people wait to get feedback on their PRs.
Why it matters: When people have to wait longer for feedback, it can mess up their workflow: They’re more likely to start a new piece of work while waiting for feedback. When they get that feedback, they have to context-switch, making it harder for them to remember what they did, and often resulting in longer times taken for each of the tasks to be completed (for example, one study showed that it takes 10-15 minutes to get back into context). Moreover, there’s bias in how long different groups of people have to wait for feedback. For example, this research showed that women had to wait longer than men for feedback.
How we calculate it: This metric is for teams where people will ping each other directly (e.g., on Slack) when a PR is ready for feedback. Specifically, this measures the number of hours from PR creation until the PR gets feedback (either in a review or a comment by a collaborator). This measure excludes time that the PR spends in a draft state, since the draft state indicates that the PR author is still finishing the work.
Like <code-text>Time to Merge<code-text>, the above measure excludes non-working hours, bot merges, and selfie-merges.
What good looks like: Ideally, this should be less than one working day (8 hours) to ensure that the overall <code-text>Time to Merge<code-text> is within a 24-hour period.
In this group, we look at measures that reflect how well the people on a team are doing. Building a great product is a marathon, not a sprint, so we look at indicators of how sustainably people are working and how well the work environment supports people to be healthy and well.
What it is: This measure shows how often people are doing work late at night or on weekends. Given that more and more people are working flexible hours, our metric specifically focuses on work done during “hours of concern” – work that people did in the wee hours of the morning or on a non-working day.
Why it matters: Working long hours is a risk factor for burnout. Moreover, the longer someone works, the harder it is for them to solve challenging problems: a study from the Wharton School of Business and University of North Carolina demonstrated that our cognitive resources deplete over time, so we need breaks to refuel. At Multitudes, we’ve seen that the faster a team’s <code-text>Time to Merge<code-text>, the higher their <code-text>Out-of-Hours work<code-text> is likely to be – so it’s important for teams and leaders to keep an eye on both metrics together.
How we calculate it: We look at the number of pull requests that people created outside of their usual work hours. We localize our analysis, adjusting for each pull request author's time zone, and we also take into account each person’s preferred working days and hours. In the near future, we will be improving this to include comment and/or commits that were made out of hours too.
What good looks like: This should ideally be 0, with people doing as little work out of hours as possible. If this does rise above 0, it’s important to ensure that it doesn’t become a trend so that people aren't doing sustained days of long hours.
We also look at several indicators of collaboration. In this bucket, we’re examining who gets support and who’s not getting enough support. We also show the people who are doing a lot of work to support others, since this type of “glue” work is easy to miss but is critical for team success.
What it is: This looks at the comment ratio between the loudest and quietest voices on the team.
Why it matters: This measure shows whether everyone on the team is participating equally; this is an indicator of psychological safety. Google’s Project Aristotle research showed that psychological safety is the number one determinant of team performance, and that equal share of voice is a behavioral indicator of psychological safety. Our metric looks at this in practice: Does everyone have an equal share of voice in code reviews?
How we calculate it: We count the number of comments that each person has written and then divide the highest count by the lowest count.
What good looks like: The smaller this number is, the better – a smaller gap means that people are contributing more equally. Our rule of thumb is to try to get this down to 5 or below.
What it is: This graph shows how much feedback each person gave on other people’s PRs, how much feedback they got on their own PRs, and how feedback flows between people.
Why it matters: Microsoft research showed that feedback in code reviews is important not only for improving the quality of the code, but also for knowledge transfer, greater team awareness, and problem-solving to identify other possible solutions. In addition, there’s also unconscious bias in what kind of feedback people get – and who gets actionable feedback or not. That’s why it’s important to visualize feedback, so that teams have clear data to help them make sure that they’re supporting everyone.
How we calculate it: We look at the number of comments and reviews that each person gave and received on their PRs. We then show how the feedback moves between people on the team.
What good looks like: In the best teams, everyone is getting feedback and everyone is giving feedback, or at least asking questions about others’ work. In these teams, seniors give plenty of feedback to juniors and intermediates – and juniors and intermediates feel comfortable asking questions to seniors.
That gives you a taste of what we measure and why. To learn about our other measures or see the tool in action, we invite you to try it out – you can sign up for our beta program here!