What we measure and why

At Multitudes, we’ve spent a lot of time thinking about indicators of team success. Everything we measure in our tool is based on conversations with experts on engineering management, research on how to support DEI at work (diversity, equity, and inclusion), and metrics frameworks like DORA and SPACE. In addition, our team has worked as developers, data scientists, engineering leaders, and coaches for engineering teams. Finally, equity and inclusion is at the heart of all that we do; our CEO, Lauren Peate, ran a diversity, equity, and inclusion consultancy before starting Multitudes, and our whole team is committed to unlocking the collective power of teams – a key part of which is ensuring that those teams are equitable.

Our focus is on how to show the holistic view of team delivery, because productivity is about more than just speed and output. As the recent paper on SPACE metrics points out, metrics signal what is important to an organization – and flow metrics alone cannot capture critical dimensions like employee satisfaction, well-being, retention, collaboration, and knowledge- sharing. This is why we provide people metrics that look at well-being and collaboration, as well as our process metrics on flow of work, value delivery, and quality of work.

Read on for a deep dive into our metrics – what they are, why they matter, how we measure them, and how you can get the most out of them for your teams.

To see our metrics in action, try it out for yourself – you can sign up for our beta program here!

⭐️ A star indicates that the metric is one of the 4 Key Metrics published by Google's DevOps Research and Assessment (DORA) team.

Process Metrics

Our process metrics cover most common flow and velocity metrics, including most DORA metrics. We also aim to make our metrics as actionable as possible - so we try to show possible causes of trends, like bottlenecks during reviews, large PR sizes, and whether the team's focus is aligned with the highest priority work. Complementing speed of delivery, we're also interested in quality - how often bugs are released, and how quickly systems are restored after a failure.

Flow of Work

We have several analyses that look at the flow of the work – things like how quickly the team collaborates to deliver work, where delivery is flowing smoothly, and where it’s getting blocked.

⭐️ Time to Merge

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This is our take on DORA's Lead Time, a metric that shows how long it takes a team to get a piece of work production-ready. It’s an indicator of how long it takes to deliver value to customers.
  • Why it matters: Our Time to Merge metric is a subset of Lead Time (the time from first code commit until the code is running in production). Research from Google’s DORA unit (the DevOps Research and Assessment unit) shows that better Lead Time is correlated with better business outcomes – specifically, teams with a faster Lead Time do work that is faster, more stable, and more secure. If you want to dive deeper into this, check out the Accelerate book.
  • How we calculate it:  To calculate Time to Merge, we measure the number of hours from pull request (PR) creation to merge – this shows how long it takes the team to give feedback, make revisions, and then merge the PR. A few additional notes:
    - Our focus is on PRs that the team collaborated on, so we exclude bot merges and selfie-merges (these are merges by the PR author with no comments or reviews by other collaborators).
    - We exclude the time that the PR spends in a draft state, since people use that feature to pair and brainstorm on work that’s not yet completed.
  • What good looks like: The DORA research showed that elite performers have a lead time of less than one day. Since Time to Merge is a subset of Lead Time, we also recommend that you aim to keep Time to Merge to less than a day, e.g., less than 24 hours.

Review Wait Time

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This shows how long people wait to get feedback on their PRs.
  • Why it matters: This is one possible bottleneck for Time to Merge. When people have to wait longer for feedback, it can mess up their workflow: They’re more likely to start a new piece of work while waiting for feedback. When they get that feedback, they have to context-switch, making it harder for them to remember what they did, and often resulting in longer times taken for each of the tasks to be completed (for example, one study showed that it takes 10-15 minutes to get back into context).

    Moreover, there’s bias in how long different groups of people have to wait for feedback. For example,
    this research showed that women had to wait longer than men for feedback. This is why we do show this metric at the individual level — so that you can make sure that everyone is receiving feedback in a timely manner.
  • How we calculate it:  This measures the number of hours from PR creation until the PR gets feedback. This could be a comment, review, or merge by someone other than the PR author. It excludes time that the PR spends in a draft state, since the draft state indicates that the PR author is still finishing the work. Like Time to Merge, the above measure excludes non-working hours, bot comments/reviews/merges, and selfie-merges.
  • What good looks like: Ideally, this should be less than one working day to ensure that the overall Time to Merge is within a 24-hour period, e.g. less than 6 hours. This includes the time taken to get to the PR, plus the time spent actively reviewing (60 to 90 minutes to catch 70-90% of defects according to this Cisco study).

PR Size

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: How large your team's PRs are. We show two representations of this — Lines Changed and Files Changed.
  • Why it matters: This is another possible bottleneck for Time to Merge. We know that large PRs are harder to review, test, and manage in general. It is now generally accepted that keeping PR size down is best practice for faster reviews, less merge conflicts (and therefore easier collaboration), and simpler rollbacks if required. Learn more in this 2017 paper by Microsoft and the University of Victoria, and in Google’s own internal guidelines (they say “changelist” rather than “pull request”).
  • How we calculate it:  We show the median of the lines of code or the number of files changed per PR depending on the option selected. We chose to provide 2 options here (instead of just lines of code) so you can get a more well-rounded view of the overall size. We recognise that these are both simple measures of "PR Size" which don't take into account edge cases such as lock files or automated formatters (examples where PR size may be large, but the PR is still easy to review and manage). However, in the majority of cases, the number of lines or files changed is a reasonable indicator of how long a PR may take to get merged.
  • What good looks like: Many organizations like to enforce maximum limits on the lines of code (LOC) changed per PR, generally ranging from around 200 to 400. This study also found that PRs should be limited to 200-400 LOC; beyond that, the ability to effectively capture defects goes down. So we recommend keeping LOC under 300 as a good middle ground.

    Files changed varies - you can have a small number of LOC changed across many files, and it'd still be fairly easy to review. In our teams, we try to keep it under 10 files changed .

Value Delivery

⭐️ Merge Frequency

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This is our take on DORA's Deployment Frequency. It shows the median number of PRs merged per person on a team, over time.
  • Why it matters: This is an indicator of the value we're providing to customers, because it shows the volume of work being released to production.
  • How we calculate it:  We count the number of PRs merged in each time period, divided by the number of people on the team. This normalization is to allow benchmarking of teams against industry standards, regardless of team size. We exclude PRs authored by bots.
  • What good looks like: Google suggests that elite teams should be deploying multiple times a day. If we call that one deployment per day per team, that’s 5 deploys per week in a 5-day workweek. Dividing this by a rough approximation of team size (around 5 developers), and taking into account the fact that there's sometimes more than one PR included in a single deploy (for major features, it could be best practice to collect up lots of changes into a release branch), we recommend keeping this metric over 2 PRs merged per person per week.

Types of Work

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: If you have integrated with Linear, this chart will show the number of issues moved to Done per week, color-coded by Project.
    🌱 Coming soon: We're adding a Jira integration, so Jira teams can also get this graph.
  • Why it matters: This metric gives you visibility over team velocity and focus. The height of the bar shows the overall number of tasks done, while the colors help you identify how the team's time was spent. Did bugs and tech debt reduce the time the team had to do feature work? Was the team focused on one or two key priorities, or were people scattered across a range of projects?
  • How we calculate it: We count the number of issues moved to done in a given time period, and color-code these by their Linear Project.
    ‍🌱 Coming soon: Showing the total number of story points completed!
  • What good looks like:  This depends on what you're trying to achieve in a given cycle. See below for a list of some examples of how you might use this chart.

How you might use this chart:

  • You want to make sure that bug work isn’t taking too much time away from feature work. Create a project in Linear called Bug (or similar), add issues to it, and watch the chart display how many were completed over time.
  • You want to make sure that your team is working on the most important thing. You can hover over a project of interest, to highlight how many tickets were dedicated to that project. To reduce work in progress, you might set a goal to focus on only 2-3 projects at a time.
  • You want to minimize the amount of unplanned work that interrupts a cycle. You could create a project called Unplanned and see how much of your team's work this category takes up.
  • You want to make sure that you're consistently setting aside some time to chip away at tech debt. Create a project in Linear called Tech Debt (or similar), add issues to it, and aim for a consistent amount of Tech Debt issues being completed each time period.

Quality of Work

⭐️ Change Failure Rate

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: The percentage of PRs merged that indicate that some kind of failure was released and had to be fixed.
  • Why it matters: This is our take on DORA's Change Failure Rate, which indicates the percentage of deployments that cause a failure in production. It's a lagging indicator of the quality of software that is being deployed - how often does it contain bugs that cause failures later?
  • How we calculate it:  We calculate the % of merged PRs that contain any of the words rollback, hotfix, revert, or [cfr] in the PR title, out of all merged PRs. We tested and chose these keywords to catch most cases where a PR was required to fix a previously-released bug, while minimizing false positives.

    We recognise that this proxies failures after the fact; this is because it's not actually possible to know if someone's releasing a failure into production in the moment, otherwise it wouldn't have been released in the first place! Also, incidents are not always easily tied to a specific PR or deployment.You can include the square-bracketed keyword
    [cfr] in your PR titles if you'd like more granular control over what gets counted in this chart.
  • What good looks like: Google suggests that elite teams should aim for a change failure rate of 0%-15%.

⭐️ Mean Time To Restore

🌱 Coming soon! This is a measure of how quickly an organization recovers from a production failure. It’s the last of the four main DORA metrics.

People Metrics

We understand that productivity is about more than just speed and output. As the recent paper on SPACE metrics points out, metrics signal what is important to an organization - and flow metrics alone cannot capture critical dimensions like employee satisfaction, well-being, retention, collaboration, and knowledge sharing. This is why we provide people metrics that look at well-being and collaboration, as well as our process metrics on flow of work, value delivery, and quality of work.

Wellbeing

In this group, we look at measures that reflect how well the people on a team are doing. Burnout is a huge issue in tech companies, with 60% of tech workers reporting that they’re burned out – and the COVID pandemic has only exacerbated this. That’s why we look at indicators of how sustainably people are working and how well the work environment supports people to be healthy and well.

Out-of-Hours Work

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This measure shows how often people are working outside of their own preferred working hours. Given that more and more people are working flexible hours, our metric is configurable for different timezones and different preferred working hours and days.
  • Why it matters: Working long hours is a risk factor for burnout. Moreover, the longer someone works, the harder it is for them to solve challenging problems: a study from the Wharton School of Business and University of North Carolina demonstrated that our cognitive resources deplete over time, so we need breaks to refuel. At Multitudes, we’ve seen that the faster a team’s Time to Merge, the higher their Out-of-Hours Work is likely to be – so it’s important for teams and leaders to keep an eye on both metrics together, so they don’t over-optimize for speed and then burn out their team.
  • How we calculate it:  We look at the number of commits that people did outside of their usual working hours. By default, this is set to 8am-6pm, Monday to Friday, in each team member’s local time. This can be individually configured in Settings to account for different working hours and days.
  • What good looks like: On average over time, this should  be 0, with people doing as little work out of hours as possible. If this does rise above 0, it’s important to ensure that it doesn’t become a trend so that people aren't doing sustained days of long hours. Multiple weeks with someone doing more than 10 commits made out-of-hours per week might warrant some rebalancing of work or stricter prioritization!

Collaboration

We also look at several indicators of collaboration. In this bucket, we’re examining who gets support and who’s not getting enough support. We also show the people who are doing a lot of work to support others. This type of “glue work” is easy to miss but is important for team success and benefits the whole organization.

These metrics show patterns in comments on GitHub. To see review patterns, you can turn on the Show reviews only filter; this will show only reviews with at least 1 comment, rather than all comments.

PR Participation Gap

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This shows the absolute difference between the most and least frequent commenters on the team.
  • Why it matters: This measure shows how imbalanced team participation is in reviews and comments. More balanced participation is a behavioral indicator of psychological safety, which Google’s Project Aristotle research showed is the number one determinant of team performance
  • How we calculate it:  We count the number of comments that each person has written and then show the range from the highest count to the lowest count.
    - We exclude team members who wrote zero comments, because sometimes teams will have a few team members who are not on GitHub often, but included in the data.
    - We can only calculate this for teams with at least 2 people; for a team of one person, there is no gap to calculate.
  • What good looks like: The smaller the gaps are, the better – a smaller gap means that people are contributing more equally. Looking at distributions of participation gaps each week across various teams and organizations, we found that a threshold difference of 25 comments would be a reasonably realistic goal for most teams.

PR Feedback Given

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: The number of comments written on PRs.
  • Why it matters: This visualizes who is giving the most support, since PR reviews and comments are a way to share knowledge and to encourage growth and learning opportunities.Giving feedback on PRs can be an example of glue work, the somewhat-invisible work that people do to lift up others on the team; our goal is to make this work more visible and valued on teams.
  • How we calculate it:  The total number of comments written on PRs, including comments on one's own PR. We include comments on your own PR because they are often in response to a reviewer question, so these can also contribute to learning and knowledge-sharing on the team.
  • What good looks like: While written communication styles differ between individuals, if a team that does their code reviews on GitHub, then 10 comments per person is a good benchmark to hit. This is based on research from our own data, looking across 6 person-weeks of data for 10 randomly sampled orgs in the Multitudes dataset.

    Note that the trends we expect will vary by seniority. Senior engineers are expected to give more feedback than juniors, to share their knowledge across the team. However, juniors have a lot to offer in code reviews too, via a fresh perspective and clarifying questions (more here about
    why it’s important to include juniors in code reviews). That’s why we still recommend teams aim for more balanced participation across the team – it’s always good to make sure that your juniors feel comfortable speaking their mind and asking questions during code review.

PR Feedback Received

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: The number of comments received on PRs.
  • Why it matters:  Research shows that code review is important for knowledge-sharing and collaborative problem-solving; this metrics helps you ensure that everyone in the team is receiving enough support and feedback that they need. While this is crucial for juniors, continual learning and growth matters for seniors too. For an example, see this success story on how one of our customers increased how much feedback seniors were getting from their peers.In addition, there’s also bias in who gets good feedback. Specifically, people from marginalized groups are more likely to get less, and lower-quality feedback. This is why it's important to have data to make sure everyone on the team is getting the support.
  • How we calculate it:  The total number of comments written on the PRs that you've authored, excluding comments you've written on your own PR (since you don't give feedback to yourself).
  • What good looks like: Similarly to PR Feedback Given, our benchmarks show that it’s good to aim for at least 10 comments per week to each person on the team. This is based on research from our own data, looking across 6 person-weeks of data for 10 randomly sampled orgs.

    Also, there are nuances –  for example, juniors might receive more feedback than seniors.

    We recommend you use this data to focus on outliers. Someone getting very little feedback might not be getting enough support on their work. Someone getting lots of feedback might feel overwhelmed or could be the target of nitpicking.

Feedback Flows

A stylized line graph shows time to merge is high at 18 hours but is trending up.
  • What it is: This graph shows how much feedback each person gave on other people’s PRs, how much feedback they got on their own PRs, and how feedback flows between people.
  • Why it matters:  The top benefits of code reviews are improving code quality, knowledge-transfer, and learning. Moreover, there’s bias in who gets good feedback. Visualizing feedback flows can show us whether there are silos, and how we’re doing across the team at supporting each other.
  • How we calculate it:  We look at the number of comments and reviews that each person (or team) gave and received on their PRs. We then show how the feedback moves across people and teams.
  • What good looks like: In the best teams, everyone is giving feedback and everyone is receiving feedback, or at least asking questions about others’ work. In these teams, seniors give plenty of feedback to juniors and intermediates – and juniors and intermediates feel comfortable asking questions to seniors.
Photo of a potential Multitudes customer.

Empower your engineering managers to build the best team.

See Multitudes in action
Photo of a potential Multitudes customer.