Data Chats

What we measure and why

10 min read
Image of two women sitting next to each other and chatting while looking at their laptops. Next to them is a stylized image of a graph with a magnifying glass.
Update Dec 2022: Hi! This page is now deprecated.
You can find ✨
the latest on What We Measure And Why in our help centre.

At Multitudes, we’ve spent a lot of time thinking about indicators of team success. Everything we measure in our tool is based on research and conversations with experts on engineering management, DEI, and metrics frameworks like DORA and SPACE. In addition, our team has worked as developers, data scientists, engineering leaders, and coaches for engineering teams. Finally, equity and inclusion is at the heart of all that we do; I ran a diversity, equity, and inclusion consultancy before starting Multitudes, and our whole team is committed to making equity the default at work.

We understand that productivity is about more than just speed and output. As the recent paper on SPACE metrics points out, metrics signal what is important to an organization - and flow metrics alone cannot capture critical dimensions like employee satisfaction, well-being, retention, collaboration, and knowledge sharing. This is why we provide people metrics that look at well-being and collaboration, as well as our process metrics on flow of work, value delivery, and quality of work.

Read on for a deep dive into our metrics – what they are, why they matter, how we measure them, and how you can get the most of them for you teams.

To see our metrics in action, try it out for yourself – you can sign up for our beta program here!

⭐️ A star indicates a one of the 4 Key Metrics published by Google's DevOps Research and Assessment (DORA) team.

Process metrics

Flow of Work

We have several analyses that look at the flow of the work – things like how quickly the team collaborates to deliver work, where delivery is flowing smoothly, and where it’s getting blocked.


⭐️ Time to Merge

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A line graph at a team-level only showing time to merge is low but trending up.
  • What it is: This is our take on DORA's Lead Time, a metric that shows how long it takes a team to get a piece of work production-ready. It’s an indicator of how long it takes to deliver value to customers.
  • Why it matters: Our <code-text>Time to Merge<code-text> metric is a subset of <code-text>Lead Time<code-text> (the time from first code commit until the code is running in production). Research from Google’s DORA unit (the DevOps Research and Assessment unit) shows that better <code-text>Lead Time<code-text> is correlated with better business outcomes – specifically, teams with a better <code-text>Lead Time<code-text> do work that is faster, more stable, and more secure. If you want to dive deeper into this, check out the Accelerate book.
  • How we calculate it:  To calculate <code-text>Time to Merge<code-text>, we measure the number of hours from pull request (PR) creation to merge – this shows how long it takes the team to give feedback, make revisions, and then merge the PR. A few additional notes:
    - Our focus is on PRs that the team collaborated on, so we exclude bot merges and selfie-merges (merges by the PR author with no comments or reviews by other collaborators).
    - We exclude the time that the PR spends in a draft state, since people use that feature to pair and brainstorm on work that’s not yet completed. 
  • What good looks like: The DORA research showed that elite performers have a lead time of less than one day. Since <code-text>Time to Merge<code-text> is a subset of <code-text>Lead Time<code-text>, we also recommend that you aim to keep <code-text>Time to Merge<code-text> to less than a day, e.g., less than 24 hours.


Review Wait Time

A line graph showing review wait time is is low and trending well, nice!
  • What it is: This shows how long people wait to get feedback on their PRs.
  • Why it matters: This is one possible bottleneck for <code-text>Time to Merge<code-text>. When people have to wait longer for feedback, it can mess up their workflow: They’re more likely to start a new piece of work while waiting for feedback. When they get that feedback, they have to context-switch, making it harder for them to remember what they did, and often resulting in longer times taken for each of the tasks to be completed (for example, one study showed that it takes 10-15 minutes to get back into context).

    Moreover, there’s bias in how long different groups of people have to wait for feedback. For example, this research showed that women had to wait longer than men for feedback. This is why we do show this metric at the individual level - so that you can make sure that everyone is receiving feedback in a timely manner.
  • How we calculate it: This metric is for teams where people will ping each other directly (e.g., on Slack) when a PR is ready for feedback. Specifically, this measures the number of hours from PR creation until the PR gets feedback. This could be a comment, review, or merge by someone other than the PR author. It excludes time that the PR spends in a draft state, since the draft state indicates that the PR author is still finishing the work. Like <code-text>Time to Merge<code-text>, the above measure excludes non-working hours, bot comments/reviews/merges, and selfie-merges.
  • What good looks like: Ideally, this should be less than one working day to ensure that the overall <code-text>Time to Merge<code-text> is within a 24-hour period, e.g. less than 6 hours.

PR Size

👥  This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A line graph at a team-level only showing PR size is low and steady, great work! There's a tab to switch to Lines of Code.
  • What it is: How large your team's PRs are. We show two representations of this - <code-text>Lines Changed<code-text> and <code-text>Files Changed<code-text>.
  • Why it matters: This is another possible bottleneck for <code-text>Time to Merge</code-text>. We know that large PRs are harder to review, test, and manage in general. It is now generally accepted that keeping PR size down is best practice for faster, safer deployments, highly collaborative development, and easier rollbacks if required.
  • How we calculate it: We show the median of the lines of code or the number of files changed, per PR depending on the option selected. We recognise that this is a simple measure of "PR Size" which doesn't take into account edge cases such as lock files or automated formatters (examples where PR size may be large, but the PR is still easy to review and manage). However, in the majority of cases, the number of lines or files changed is a reasonable indicator of how long a PR may take to get merged.
  • What good looks like: Many organisations like to enforce maximum limits on the lines of code (LOC) changed per PR, generally ranging from around 200 to 400. This study also found that PRs should be limited to 200-400 LOC; beyond that, the ability to effectively capture defects goes down. So we recommend keeping LOC under 300.

    Files changed is a bit trickier it really depends (you can have a small number of LOC changed across many files, and it'd still be fairly easy to review), but we tend to say keeping it under 5-8 files changed is a good aim.

Value Delivery

⭐️ Merge Frequency

👥 This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A line graph at a team-level only showing merge frequency is high but trending down - keep an eye on it.
  • What it is: This is our take on DORA's Deployment Frequency. It shows the median number of PRs merged per person on a team, over time.
  • Why it matters: This is an indicator of how much work is reaching customers, because it shows the volume of work being released to production.
  • How we calculate it: We count the number of PRs merged in each time period, divided by the number of people on the team. This normalisation is to allow benchmarking of teams against industry standards, regardless of team size. We exclude PRs authored by bots.
  • What good looks like: Google suggests that elite teams should be deploying multiple times a day. Per week, this would translate to over 5 deploys per team per week (assuming a 5 day work week). Dividing this by a rough approximation of team size, and taking into account the fact that there's often more than one PR included in a single deploy, we'd recommend aiming to keep this metric over 2 PRs merged per person per week.

Types of Work

👥  This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A stacked bar chart at a team-level only showing the types of work a team is working on. Onhover, you can see details of how many issues per category were moved to done.
  • What it is: If you have integrated with Linear, this chart will show the number of issues moved to Done per week, color-coded by Project.
  • Why it matters: This metric gives you visibility over team velocity and focus. The height of the bar shows the overall number of tasks done, while the colors help you identify how the team's time was spent. Was there good focus on one or two key priorities, or were people scattered across a range of projects? What proportion of work done was bugs, chores, or tech debt?
  • How we calculate it: We count the number of issues moved to done in a given time period, and color-code these by their Linear Project.
    🌱 Coming soon: summing the number of story points completed!
  • What good looks like: This really depends on what you're trying to achieve in a given cycle. See below for a list of some examples of how you might use this chart.

How you might use this chart:

  • You want to make sure that your team is working on the most important thing. You can hover over a project of interest, to highlight how much of each time period was dedicated to that project. Maybe you can set a goal to focus on only 2-3 projects at a time, to get them over the line.
  • You want to minimise the amount of unplanned work that interrupts a cycle. You could create a project called <code-text>Unplanned<code-text> and see how much of your team's work this category takes up.
  • You want to keep an eye on the number of bugs you're working on. Create a project in Linear called <code-text>Bug<code-text> (or similar), add issues to it, and watch the chart display how many were completed over time.
  • You want to make sure that you're consistently setting aside some time to chip away at tech debt. Create a project in Linear called <code-text>Tech Debt<code-text> (or similar), add issues to it, and aim for a consistent amount of <code-text>Tech Debt<code-text> issues being completed each time period.

Quality of Work

⭐️ Change Failure Rate

👥  This metric can relate to performance. Since PRs are a team sport, we only show this at the team level, not the individual level.

A line graph at a team-level only showing change failure rate is high but trending down, suggesting to keep an eye on it with eye emojis in the title.
  • What it is: The percentage of PRs merged that indicate that some kind of failure was released and had to be fixed.
  • Why it matters: This is our take on DORA's Change Failure Rate, which indicates the percentage of deployments that cause a failure in production. It's a lagging indicator of the quality of software that is being deployed - how often does it contain bugs that cause failures later?
  • How we calculate it: We calculate the % of merged PRs that contain any of the words <code-text>rollback<code-text>, <code-text>hotfix<code-text>, <code-text>revert<code-text>, or <code-text>[cfr]<code-text> in the PR title, out of all merged PRs. We tested and chose these keywords to catch most cases where a PR was required to fix a previously-released bug, while minimising false positives.
    We recognise that this proxies failures after the fact - because it's not actually possible to know if someone's releasing a failure into production at the time, otherwise it wouldn't have been released in the first place! Also, incidents are not always easily tied to a specific PR or deployment.
    You can include the square-bracketed keyword <code-text>[cfr]<code-text> in your PR titles if you'd like more granular control over what gets counted in this chart.
  • What good looks like: Google suggests that elite teams should aim for a change failure rate of 0%-15%.

⭐️ Mean Time To Restore

🌱 Coming soon!

People Metrics

Wellbeing

In this group, we look at measures that reflect how well the people on a team are doing. Building a great product is a marathon, not a sprint, so we look at indicators of how sustainably people are working and how well the work environment supports people to be healthy and well. 

Out-of-Hours Work

A line graph at a team-level only showing out of hours work is is low and trending well, great work!
  • What it is: This measure shows how often people are doing work late at night or on weekends. Given that more and more people are working flexible hours, our metric specifically focuses on work done during “hours of concern” – work that people did in the wee hours of the morning or on a non-working day. 
  • Why it matters: Working long hours is a risk factor for burnout. Moreover, the longer someone works, the harder it is for them to solve challenging problems: a study from the Wharton School of Business and University of North Carolina demonstrated that our cognitive resources deplete over time, so we need breaks to refuel. At Multitudes, we’ve seen that the faster a team’s <code-text>Time to Merge<code-text>, the higher their <code-text>Out-of-Hours work<code-text> is likely to be – so it’s important for teams and leaders to keep an eye on both metrics together.
  • How we calculate it: We look at the number of commits that people did outside of their usual working hours. By default, this is set to 8am-6pm, Monday to Friday, in each team member’s local time. This can be individually configured in Settings.
  • What good looks like: This should ideally be 0, with people doing as little work out of hours as possible. If this does rise above 0, it’s important to ensure that it doesn’t become a trend so that people aren't doing sustained days of long hours. Sustained periods of more than 10 commits made out-of-hours per week per individual might warrant some rebalancing of work or stricter prioritization!

Collaboration

We also look at several indicators of collaboration. In this bucket, we’re examining who gets support and who’s not getting enough support. We also show the people who are doing a lot of work to support others, since this type of “glue” work is easy to miss but is critical for team success.

Most of these metrics are currently based on comments on GitHub. You can turn on the <code-text>Show reviews only<code-text> filter to only count the number of reviews with at least 1 comment, rather than all comments.


PR Participation Gap

A dumbbell chart at a team level showing that the PR participation gap is low and trending down. It's based on the difference from most to least # of comments written.
  • What it is: This shows the absolute difference between the loudest and quietest voice within each team.
  • Why it matters: This measure shows whether everyone on the team is participating equally, an indicator of psychological safety. Google’s Project Aristotle research showed that psychological safety is the number one determinant of team performance, and that equal share of voice is a behavioral indicator of psychological safety. Our metric looks at this in practice: Does everyone have an equal share of voice in code reviews?
  • How we calculate it: We count the number of comments that each person has written and then divide the highest count by the lowest count.
    - If one person on the team didn’t write any comments, then we set the gap equal to the highest number of comments that one person wrote (essentially setting the lowest number to 1, even if it was actually zero). This allows us to still show an indication of the magnitude of difference.
    - We can only calculate this for teams with at least 2 people; for a team of one person, there is no gap to calculate.
  • What good looks like: The smaller the gaps are, the better – a smaller gap means that people are contributing more equally. Our rule of thumb is to try to get this down to 15 or below.


PR Feedback Given

A line graph showing PR feedback given based on the total number of PR comments team members have given
  • What it is: The number of comments written on PRs.
  • Why it matters: This visualises share of voice (an indicator of psychological safety, which is an essential ingredient for high team performance) in a more direct way. It's also a way to see who is giving the most support, since PR reviews and comments are a way to share knowledge and to encourage growth and learning opportunities.
  • How we calculate it: The total number of comments written on PRs, including comments on one's own PR. We include this because it's often in response to a reviewer, which is an important part of the conversation.
  • What good looks like: While written communication style can obviously differ a lot between individuals, we would expect that in a team that does their code reviews on GitHub would have a decent number of comments per team member. From looking across a selection of organisations, we found that a benchmark of around 10 comments per person per week was a good starting point for conversations around how current code review processes are going, and whether everyone feels that they have enough time to support others on their team through code review.

    We expect senior engineers to give more feedback than juniors, since they are often in a knowledge-sharing role. However, juniors often have a lot to offer in code review too, such as their fresh perspective and clarifying questions - we recommend checking in to make sure that your juniors feel comfortable to speak their mind and to ask questions during code review.

PR Feedback Received

A line graph showing PR feedback received. based on the total number of PR comments team members have received
  • What it is: The number of comments received on PRs.
  • Why it matters: Research has shown that code review is an important place for knowledge sharing and collaborative problem-solving, so it's important to ensure that everyone in the team is receiving the amount of support and feedback that they need. While this is crucial for juniors, continual learning and growth is important for seniors too. For an example, see this success story on how one of our customers increased how much feedback seniors were getting from their peers.

    In addition, there’s also unconscious bias that results in women often getting more vague, less actionable feedback than men. This is why it's important to have data that kickstarts conversations around the feedback and support that people are receiving.
  • How we calculate it: The total number of comments written on the PRs that you've authored, excluding comments you've written on your own PR (since you don't give feedback to yourself).
  • What good looks like: Similarly to <code-text>PR Feedback Given<code-text>, you would generally expect juniors to receive more feedback than seniors, and for everyone to be receiving some amount of feedback.

    When using this data, you're often looking for outliers - anything that looks very low, or very high. Someone receiving much less feedback might not be learning very much from the code review process, or they could be getting silo'ed from the rest of the team. Someone who received much more feedback might feel overwhelmed, or could be the target of some nitpicking.

    These are all starting points for you to delve further into by bringing in the real-world context that you know about your team, and using these data points as conversation starters in 1:1s.

Feedback Flows

  • What it is: This graph shows how much feedback each person gave on other people’s PRs, how much feedback they got on their own PRs, and how feedback flows between people. 
  • Why it matters: Microsoft research showed that feedback in code reviews is important not only for improving the quality of the code, but also for knowledge transfer, greater team awareness, and problem-solving to identify other possible solutions. In addition, there’s also unconscious bias in what kind of feedback people get – and who gets actionable feedback or not. That’s why it’s important to visualize feedback, so that teams have clear data to help them make sure that they’re supporting everyone. 
  • How we calculate it: We look at the number of comments and reviews that each person gave and received on their PRs. We then show how the feedback moves between people on the team. 
  • What good looks like: In the best teams, everyone is getting feedback and everyone is giving feedback, or at least asking questions about others’ work. In these teams, seniors give plenty of feedback to juniors and intermediates – and juniors and intermediates feel comfortable asking questions to seniors. 

Contributor
Lauren Peate
Lauren Peate
Founder, CEO
Support your developers with ethical team analytics.

Start your free trial

Join our beta
Support your developers with ethical team analytics.