At Multitudes, we’ve spent a lot of time thinking about indicators of team success. Everything we measure in our tool is based on conversations with experts on engineering management, research on how to support DEI at work (diversity, equity, and inclusion), and metrics frameworks like DORA and SPACE. In addition, our team has worked as developers, data scientists, engineering leaders, and coaches for engineering teams. Finally, equity and inclusion is at the heart of all that we do; our CEO, Lauren Peate, ran a diversity, equity, and inclusion consultancy before starting Multitudes, and our whole team is committed to unlocking the collective power of teams – a key part of which is ensuring that those teams are equitable.
Our focus is on how to show the holistic view of team delivery, because productivity is about more than just speed and output. As the recent paper on SPACE metrics points out, metrics signal what is important to an organization – and flow metrics alone cannot capture critical dimensions like employee satisfaction, well-being, retention, collaboration, and knowledge-sharing. This is why we not only provide all four DORA metrics but also provide people metrics that look at wellbeing and collaboration.
Read on for a deep dive into our metrics – what they are, why they matter, how we measure them, and how you can get the most out of them for your teams.
To see our metrics in action, try it out for yourself – you can sign up for our beta program here!
⭐️ A star indicates that the metric is one of the 4 Key Metrics published by Google's DevOps Research and Assessment (DORA) team. 👥 A two-person icon indicates that we only show this metric at a team level, not an individual level (e.g., PRs are a team sport!) What good looks like As you’ll see below, our metrics show pre-defined benchmarks based on internal and external research. You can read more about the research behind these benchmarks in each metric section below. That said, because each team is different, we allow teams to customize targets.
Process Metrics
Our process metrics cover most common flow and velocity metrics, including most DORA metrics. We also aim to make our metrics as actionable as possible - so we try to show possible causes of trends, like bottlenecks during reviews, large PR sizes, and whether the team's focus is aligned with the highest priority work. Complementing speed of delivery, we're also interested in quality - how often bugs are released, and how quickly systems are restored after a failure.
Flow of Work
We have several analyses that look at the flow of the work – things like how quickly the team collaborates to deliver work, where delivery is flowing smoothly, and where it’s getting blocked.
⭐️ 👥 Lead Time
What it is: This is DORA's Lead Time for Changes (abbreviated to Lead Time in the app), a metric that shows how long it takes for a team to get a piece of work production-ready. It’s an indicator of how long it takes to deliver value to customers.
Why it matters: Lead Timeis one of the top four indicators of software team performance, according to Google’s DORA unit (the DevOps Research and Assessment unit). Their research shows that a faster Lead Time is correlated with better business outcomes. Specifically, teams with a faster Lead Time do work that is better, more stable, and more secure. If you want to dive deeper into this, check out the Accelerate book. Note that this is closely related to Cycle Time; we measure Lead Time since that's recommended by DORA.
How we calculate it: To calculate Lead Time, we measure the number of hours from the first new commit on a pull request’s (PR’s) branch to PR merge – this shows how long it takes the team to write code, request and give feedback, make revisions, and then merge the PR.
What good looks like: Google's DORA research shows that elite performers have a Lead Time of less than 24 hours.
Additional calculation notes for Lead Time & subsets
Here are a few additional notes that affect the calculations for Lead Time and/or its components, Coding Time, Review Wait Time, and Editing Time.
These exclusions apply to all four metrics:
Our focus is on PRs that the team collaborated on, so we exclude bot merges and selfie-merges (PRs merged by the PR author, with no comments or reviews by other collaborators).
If you like, you can choose to exclude weekend hours from these calculation; simply toggle on “Exclude Weekend Hours”.
For just Lead Time and Coding Time, please note:
When you first join Multitudes, your historical data (the first 6 weeks) will show a lower time metric if your company does a lot of rebasing. This is because we can’t get original commits from the historical data in GitHub, so the rebased commit is taken as the first commit.
Once you integrate, we get events data from GitHub. This means we will get the original commits that are pushed to GitHub, even if your teams rebase or squash the commits later. Therefore, you might notice that your metrics are higher after the time that you onboard onto Multitudes, compared to your historical data.
How Lead Time is broken into its subsets
Lead Time is broken up into its three components, Coding Time, Review Wait Time, and Editing Time. Where these start and end can depend on the events in each PR's life cycle. Here is a typical PR timeline. Click the dropdown below it for more scenarios.
Click for more PR life cycle secenarios
👥 Coding Time
What it is: This shows how long the team spends writing code before asking for feedback.
Why it matters:Coding Time represents the first part of the development cycle in a team, and it can be a bottleneck that increases Lead Time. There are many reasons why Coding Time might be high, e.g. poorly-scoped work, external interruptions leading to less focus time, or a complicated piece of work.
How we calculate it: This measures the number of hours from the first new commit on a PR to when the PR is ready for review.
What defines the first new commit on a PR?
We exclude commits that were created earlier on other branches, and then pulled in to the PR’s head branch.
We take into account all commits that were pushed to GitHub, even if they are later squashed in a rebase. This means that even if you squash all your commits on a PR down to 1 commit before merging, we will still use the timestamp of your first original commit as the start of coding time, as long as the original commit was pushed previously.
However, if you squash your commits locally before pushing them to GitHub, we will only have data about the newly squashed commits.
What happens if first new commit time is after PR creation time?
If the first new commit on a PR comes after PR creation time, then the PR creation time is taken as the start of coding time, rather than the time of first commit.
This is so that Coding Time can capture the entire “draft time”. It makes sense to include the time that the PR spend in "draft" in this measure of time spent coding.
If the PR was created in a non-draft state, Coding Time is null. This is because it means the PR was ready for review upon creation, and Review Wait Time (the next metric in the PR life cycle) starts at the point where the PR is first ready for review.
See here for additional notes on how this metric is calculated, from Lead Time.
What good looks like: We recommend that Coding Time be under 4 hours. This threshold is based on an internal analysis conducted by Multitudes across 80,000 PRs from a diverse range of customers and comparing against the SPACE and DORA research.
Review Wait Time
What it is: This shows how long people wait to get feedback on their PRs.
Why it matters: This is one possible bottleneck for Lead Time. When people have to wait longer for feedback, it can mess up their workflow. They’re more likely to start a new piece of work while waiting for feedback. When they get that feedback, they have to context-switch, making it harder for them to remember what they did. This often results in longer times taken for each of the tasks to be completed (for example, one study showed that it takes 10-15 minutes to get back into context).
Moreover, there’s bias in how long different groups of people have to wait for feedback. For example, this research showed that women had to wait longer than men for feedback. This is why we do show this metric at the individual level — so that you can make sure that everyone is receiving feedback in a timely manner.
How we calculate it: We measure the number of hours from PR creation until the PR gets feedback. This could be a comment, review, or merge by someone other than the PR author. It excludes time that the PR spends in a draft state, since the draft state indicates that the PR author is still finishing the work. To be clear on some nuances:
Review Wait Time is null if the PR has no feedback. It ignores responses from bots and responses that came in after the merge (since we exclude selfie merges).
See here for some additional calculation notes that apply from Lead Time.
What good looks like: We recommend that Review Wait Time be under 4 hours. This threshold is based on an internal analysis conducted by Multitudes across 80,000 PRs from a diverse range of customers and comparing against the SPACE and DORA research.
👥 Editing Time
What it is: This metric shows how long code takes to get merged once feedback has been received.
Why it matters: As a measure of back-and-forth between the code author and those who are reviewing the code, Editing Time is important for understanding bottle-necks in Lead Time. A high Editing Time could mean that the team needs to improve how they scope work, the received feedback is confusing, the PRs being created are large, or there are other distractions preventing fast iteration. A low Editing Time indicates that the team is able to quickly action feedback and ship work once it has been reviewed.
How we calculate it: We measure the number of hours from first feedback on the PR to PR merge, i.e. the back-and-forth editing time. If there was no response before the merge, Editing Time is null. See here for some additional calculation notes that apply from Lead Time.
What good looks like: We recommend that Editing Time be under 16 hours. This threshold is based on an internal analysis conducted by Multitudes across 80,000 PRs from a diverse range of customers and comparing against the SPACE and DORA research.
👥 PR Size
What it is: How large your team's PRs are. We show two representations of this — Lines Changed and Files Changed.
Why it matters: This is another possible bottleneck for Lead Time. We know that large PRs are harder to review, test, and manage in general. It is now generally accepted that keeping PR size down is best practice for faster reviews, less merge conflicts (and therefore easier collaboration), and simpler rollbacks if required. Learn more in this 2017 paper by Microsoft and the University of Victoria, and in Google’s own internal guidelines (they say “changelist” rather than “pull request”).
How we calculate it: We show the median of the lines of code or the number of files changed per PR depending on the option selected. We chose to provide 2 options here (instead of just lines of code) so you can get a more well-rounded view of the overall size. We recognise that these are both simple measures of "PR Size" which don't take into account edge cases such as lock files or automated formatters (examples where PR size may be large, but the PR is still easy to review and manage). However, in the majority of cases, the number of lines or files changed is a reasonable indicator of how long a PR may take to get merged.
What good looks like: Many organizations like to enforce maximum limits on the lines of code (LOC) changed per PR, generally ranging from around 200 to 400. This study also found that PRs should be limited to 200-400 LOC; beyond that, the ability to effectively capture defects goes down. So we recommend keeping LOC under 300 as a good middle ground.
Files changed varies - you can have a small number of LOC changed across many files, and it'd still be fairly easy to review. In our teams, we try to keep it under 10 files changed .
Value Delivery
⭐️ 👥 Merge Frequency
What it is: This is our take on DORA's Deployment Frequency. It shows the median number of PRs merged per person on a team, over time.
Why it matters: This is an indicator of the value we're providing to customers, because it shows the volume of work being released to production.
How we calculate it: We count the number of PRs merged in each time period, divided by the number of people on the team. This normalization is to allow benchmarking of teams against industry standards, regardless of team size. We exclude PRs authored by bots.
What good looks like:Google suggests that elite teams should be deploying multiple times a day. If we call that one deployment per day per team, that’s 5 deploys per week in a 5-day workweek. Dividing this by a rough approximation of team size (around 5 developers), and taking into account the fact that there's sometimes more than one PR included in a single deploy (for major features, it could be best practice to collect up lots of changes into a release branch), we recommend keeping this metric over 2 PRs merged per person per week.
👥 Types of Work
What it is: This shows how many issues the team completed and what types of work they did.
Jira: this chart shows the number of issues moved to a Resolved status per week, broken down by Issue Type, specifically Story, Bug, or Task.
Linear: this chart shows the number of issues moved to Done per week, broken down by Project.
For both, you can hover over a specific section of the bar chart to get more details on how many tickets were dedicated to that task
Why it matters: This metric gives you visibility over team velocity and how the team’s work was spread across different types of work. If a team is struggling to get their planned feature work done, this is a useful chart to consult to see what could be getting in the way, and understanding if the types of issues completed align with what was planned.
When people are interrupted on a project, it can take up to 23 minutes to get back on track (e.g., fully shift their thinking, remember where they left off, etc.). The more projects an individual holds, the more they therefore have to “context switch”, which can reduce overall productivity, while also increasing feelings of stress and frustration.
Across the team, the cost can really add up; one academic study found that developers working on 2+ projects spend 17% of their development effort on managing interruptions. Did the team have enough time for feature work or did bug work get in the way? In one survey of ~1000 developers, 44% said bugs were a key pain point in their day-to-day work and a main reason deployments were slow.
How we calculate it:
Jira: we count the number of issues moved to “Done” or a custom status in the Resolved category and color code by Issue Type
For example: if you have custom Jira statuses like “Testing”, “In Staging”, “Ready for Release”, and “Released”, and the last 2 statuses are both in then Resolved category, then we count number of issues moved either to “Ready for Release” or “Released”
Linear: we count the number of issues moved to Done and color code by Project
With a Linear integration, you can create Projects to customize how you visualize the work that’s being completed. For example, you might create a Project called Unplanned and then see how much of your team’s work this takes up.
What good looks like: This depends on your team and product priorities. Many teams value consistency week-to-week, since it helps with their planning. It can also be helpful to watch for increases in bug work, since that can decrease the team’s time for feature work. Overall, the goal of this chart is to make sure your team is working on the most important thing(s) and getting work done at a reasonable pace.
👥 Feature vs Maintenance Work
What it is: This shows the relative percentage of either Feature or Maintenance issues completed, for classification details, see How we calculate it below
Why it matters: Delivery is about managing the balance between shipping new features and maintaining existing systems. If you neglect maintenance, your codebase and systems can “rot”, slowing down delivery of new features and site reliability. On the other hand, spending too much time on maintenance can cause the team to miss delivery targets.
This is why visibility over where your team is spending their time is important, to make sure that the balance reflects your priorities.
How we calculate it:
Jira: we count the number of issues moved to a status in the Resolved category (more detail on what gets counted). We color code by Issue Type:
Issues with an Issue Type that contain the string bug, tech debt, or chore are shown on the graph in different shades of purple, to indicate that they’re all types of Maintenance work.
All other issues are grouped into the green Feature type.
Linear: we count the number of issues moved to Done (more detail on what gets counted). We color code by Project:
Projects that contain the string bug, tech debt, or chore are shown on the graph in different shades of purple, to indicate that they’re all types of Maintenance work.
All other projects are grouped into the green Feature type.
What good looks like: Many teams set aside an upfront “tech debt budget” or “maintenance budget” when planning upcoming work. Many will allocate 10-20% for maintenance, but this depends on the team. For example, teams focused on maintaining legacy code might budget 50% of story points or issues to maintenance work. Another approach is to allocate specific days, such as 1 day every week or fortnight. To learn more check out this article on how to define and spend your tech debt budget and this one on reclaiming tech equity. Once you have defined a budget, it’s easy to use this chart to track real world “spend”.
Quality of Work
⭐️ 👥 Change Failure Rate
What it is: The percentage of PRs merged that indicate that some kind of failure was released and had to be fixed.
Why it matters: This is our take on DORA's Change Failure Rate, which indicates the percentage of deployments that cause a failure in production. It's a lagging indicator of the quality of software that is being deployed - how often does it contain bugs that cause failures later?
How we calculate it: We calculate the % of merged PRs that contain any of the words rollback, hotfix, revert, or [cfr] in the PR title, out of all merged PRs. We tested and chose these keywords to catch most cases where a PR was required to fix a previously-released bug, while minimizing false positives.
We recognise that this proxies failures after the fact; this is because it's not actually possible to know if someone's releasing a failure into production in the moment, otherwise it wouldn't have been released in the first place! Also, incidents are not always easily tied to a specific PR or deployment.You can include the square-bracketed keyword [cfr] in your PR titles if you'd like more granular control over what gets counted in this chart.
What it is: This is our take on DORA'sMean Time to Restore metric. It's a measure of how long it takes an organization to recover from an incident or failure in production. You will need to integrate with OpsGenie to get this metric. 🌱 Coming soon: We’re adding a PagerDuty integration, so PagerDuty teams can also get this graph.
Why it matters: This metric indicates the stability of your teams’ software. A higher Mean Time to Restore increases the risk of app downtime. This can further result in a higher Lead Time due to more time being taken up fixing outages, and ultimately impact your organization's ability to deliver value to customers. In this study by Nicole Forsgren (author of DORA and SPACE), high performing teams had the lowest times for Mean Time to Restore. The study also highlights the importance of organizational culture in maintaining a low Mean Time to Restore.
How we calculate it: To calculate Mean Time to Restore, we measure the time from when an incident was opened on OpsGenie, to the time when it was closed.
What good looks like: DORA research shows that elite performing teams have a Mean Time to Restore of less than 1 hour.
People Metrics
We understand that productivity is about more than just speed and output. As the recent paper on SPACE metrics points out, metrics signal what is important to an organization - and flow metrics alone cannot capture critical dimensions like employee satisfaction, well-being, retention, collaboration, and knowledge sharing. This is why we provide people metrics that look at well-being and collaboration, as well as our process metrics on flow of work, value delivery, and quality of work.
Wellbeing
In this group, we look at measures that reflect how well the people on a team are doing. Burnout is a huge issue in tech companies, with 60% of tech workers reporting that they’re burned out – and the COVID pandemic has only exacerbated this. That’s why we look at indicators of how sustainably people are working and how well the work environment supports people to be healthy and well.
Out-of-Hours Work
What it is: This measure shows how often people are working outside of their own preferred working hours. Given that more and more people are working flexible hours, our metric is configurable for different timezones and different preferred working hours and days.
Why it matters: Working long hours is a risk factor for burnout. Moreover, the longer someone works, the harder it is for them to solve challenging problems: a study from the Wharton School of Business and University of North Carolina demonstrated that our cognitive resources deplete over time, so we need breaks to refuel. At Multitudes, we’ve seen that the faster a team’s Lead Time, the higher their Out-of-Hours Work is likely to be – so it’s important for teams and leaders to keep an eye on both metrics together, so they don’t over-optimize for speed and then burn out their team.
How we calculate it: We look at the number of commits that people did outside of their usual working hours. By default, this is set to 8am-6pm, Monday to Friday, in each team member’s local time. This can be individually configured in Settings to account for different working hours and days.
What good looks like: On average over time, this should be 0, with people doing as little work out of hours as possible. If this does rise above 0, it’s important to ensure that it doesn’t become a trend so that people aren't doing sustained days of long hours. Multiple weeks with someone doing more than 5 commits made out-of-hours per week might warrant some rebalancing of work or stricter prioritization!
Collaboration
PR Participation Gap
What it is: This shows the absolute difference between the most and least frequent commenters on the team.
Why it matters: This measure shows how imbalanced team participation is in reviews and comments. More balanced participation is a behavioral indicator of psychological safety, which Google’s Project Aristotle research showed is the number one determinant of team performance
How we calculate it: We count the number of comments that each person has written and then show the range from the highest count to the lowest count. - We exclude team members who wrote zero comments, because sometimes teams will have a few team members who are not on GitHub often, but included in the data. - We can only calculate this for teams with at least 2 people; for a team of one person, there is no gap to calculate.
What good looks like: The smaller the gaps are, the better – a smaller gap means that people are contributing more equally. Looking at distributions of participation gaps each week across various teams and organizations, we found that a threshold difference of 25 comments would be a reasonably realistic goal for most teams.
PR Feedback Given
What it is: The number of comments written on PRs.
Why it matters: This visualizes who is giving the most support, since PR reviews and comments are a way to share knowledge and to encourage growth and learning opportunities. Giving feedback on PRs can be an example of glue work, the somewhat-invisible work that people do to lift up others on the team; our goal is to make this work more visible and valued on teams.
How we calculate it: The total number of comments written on PRs, including comments on one's own PR. We include comments on your own PR because they are often in response to a reviewer question, so these can also contribute to learning and knowledge-sharing on the team.
What good looks like: While written communication styles differ between individuals, if a team that does their code reviews on GitHub, then 10 comments per person is a good benchmark to hit. This is based on research from our own data, looking across 6 person-weeks of data for 10 randomly sampled orgs in the Multitudes dataset.
Note that the trends we expect will vary by seniority. Senior engineers are expected to give more feedback than juniors, to share their knowledge across the team. However, juniors have a lot to offer in code reviews too, via a fresh perspective and clarifying questions (more here about why it’s important to include juniors in code reviews). That’s why we still recommend teams aim for more balanced participation across the team – it’s always good to make sure that your juniors feel comfortable speaking their mind and asking questions during code review.
PR Feedback Received
What it is: The number of comments received on PRs.
Why it matters:Research shows that code review is important for knowledge-sharing and collaborative problem-solving; this metrics helps you ensure that everyone in the team is receiving enough support and feedback that they need. While this is crucial for juniors, continual learning and growth matters for seniors too. For an example, see this success story on how one of our customers increased how much feedback seniors were getting from their peers.In addition, there’s also bias in who gets good feedback. Specifically, people from marginalized groups are more likely to get less, and lower-quality feedback. This is why it's important to have data to make sure everyone on the team is getting the support.
How we calculate it: The total number of comments written on the PRs that you've authored, excluding comments you've written on your own PR (since you don't give feedback to yourself).
What good looks like: Similarly to PR Feedback Given, our benchmarks show that it’s good to aim for at least 10 comments per week to each person on the team. This is based on research from our own data, looking across 6 person-weeks of data for 10 randomly sampled orgs.
Also, there are nuances – for example, juniors might receive more feedback than seniors.
We recommend you use this data to focus on outliers. Someone getting very little feedback might not be getting enough support on their work. Someone getting lots of feedback might feel overwhelmed or could be the target of nitpicking.
Feedback Flows
What it is: This graph shows how much feedback each person gave on other people’s PRs, how much feedback they got on their own PRs, and how feedback flows between people.
Why it matters: The top benefits of code reviews are improving code quality, knowledge-transfer, and learning. Moreover, there’s bias in who gets good feedback. Visualizing feedback flows can show us whether there are silos, and how we’re doing across the team at supporting each other.
How we calculate it: We look at the number of comments and reviews that each person (or team) gave and received on their PRs. We then show how the feedback moves across people and teams.
What good looks like: In the best teams, everyone is giving feedback and everyone is receiving feedback, or at least asking questions about others’ work. In these teams, seniors give plenty of feedback to juniors and intermediates – and juniors and intermediates feel comfortable asking questions to seniors.
We also look at several indicators of collaboration. In this bucket, we’re examining who gets support and who’s not getting enough support. We also show the people who are doing a lot of work to support others. This type of “glue work” is easy to miss but is important for team success and benefits the whole organization.
These metrics show patterns in comments on GitHub. To see review patterns, you can turn on the Show reviews only filter; this will show only reviews with at least 1 comment, rather than all comments.
Empower your engineering managers to build the best team.