Travel

Waymo Safety Impact: How Benchmarks Are Built and Reweighted

Waymo safety impact discussions keep circling back to one basic question: what counts as a fair “human benchmark.” Misryoum editorial desk noted that the human benchmark data used here are the same as reported in Scanlon et al. (2024), and extended upon in Kusano et al. (2025). The benchmarks aren’t floating in the abstract either—they’re derived from state police reported crash records and Vehicle Miles Traveled (VMT) data in the areas Waymo currently operates RO services at large scale (Phoenix, San Francisco, Los Angeles, and Austin).

What the benchmark is trying to do is mirror the kind of driving Waymo is doing, or at least get close. The human benchmarks were made in a way that only included the crashes and VMT corresponding to passenger vehicles traveling on the types of roadways Waymo operates on (excluding freeways). Misryoum newsroom reported the any-injury-reported benchmark also used a 32% underreporting correction (based on NHTSA’s Blincoe et al., 2023 study to adjust for crashes not reported by humans. The serious injury or worse (referred to as “suspected serious injury+” in the papers) and airbag deployment human benchmarks rates used the observed crashes without an underreporting correction.)

But then the whole “benchmark” idea hits a snag most drivers understand without thinking too hard: not all streets inside a city are equally challenging. If Waymo drives more frequently in more challenging parts of the city that have higher crash rates, it may affect crash rates compared to quieter areas. Misryoum analysis indicates the benchmarks reported by Scanlon et al. are at a city level, not for specific streets or areas. That mismatch is where a lot of the work shifts—from counting crashes to adjusting what the counting really represents.

So, the data hub’s human benchmarks weren’t just taken as-is. The human benchmarks shown on this data hub were adjusted using a method described by Chen et al. (2024) that models the effect of spatial distribution on crash risk. In plain terms, the methodology adjusts the city-level benchmarks to account for the unique driving distribution of the Waymo driving. The result of the reweighting method is human benchmarks that are more representative of the areas of the city Waymo drives in the most, which improves data alignment between the Waymo and human crash data.

There’s also the practical side of this, which sounds almost like logistics but really is methodology discipline. Achieving the best possible data alignment, given the limitations of the available data, are part of the newly published Retrospective Automated Vehicle Evaluation (RAVE) best practices (Scanlon et al., 2024b). Misryoum editorial team stated this spatial dynamic benchmark approach described by Chen et al. (2024) was also used in Kusano et al. (2025).

And once you start thinking about it that way, it’s hard not to get a little hung up on the “how” while you wait for the “what.” Like, you can stand outside a busy intersection and hear tires hiss, feel the brake-pulse through your chest, and still wonder whether the data is really capturing the same mix of streets. Here, at least, the evaluation tries to acknowledge that difference—just not completely. It’s a step, but the edge cases, the quiet roads, the parts that get driven less often… they’re still there, quietly shaping the baseline.

Travel

Waymo Safety Impact: How Benchmarks Are Built and Reweighted

Waymo safety impact discussions keep circling back to one basic question: what counts as a fair “human benchmark.” Misryoum editorial desk noted that the human benchmark data used here are the same as reported in Scanlon et al. (2024), and extended upon in Kusano et al. (2025). The benchmarks aren’t floating in the abstract either—they’re derived from state police reported crash records and Vehicle Miles Traveled (VMT) data in the areas Waymo currently operates RO services at large scale (Phoenix, San Francisco, Los Angeles, and Austin).

What the benchmark is trying to do is mirror the kind of driving Waymo is doing, or at least get close. The human benchmarks were made in a way that only included the crashes and VMT corresponding to passenger vehicles traveling on the types of roadways Waymo operates on (excluding freeways). Misryoum newsroom reported the any-injury-reported benchmark also used a 32% underreporting correction (based on NHTSA’s Blincoe et al., 2023 study to adjust for crashes not reported by humans. The serious injury or worse (referred to as “suspected serious injury+” in the papers) and airbag deployment human benchmarks rates used the observed crashes without an underreporting correction.)

But then the whole “benchmark” idea hits a snag most drivers understand without thinking too hard: not all streets inside a city are equally challenging. If Waymo drives more frequently in more challenging parts of the city that have higher crash rates, it may affect crash rates compared to quieter areas. Misryoum analysis indicates the benchmarks reported by Scanlon et al. are at a city level, not for specific streets or areas. That mismatch is where a lot of the work shifts—from counting crashes to adjusting what the counting really represents.

So, the data hub’s human benchmarks weren’t just taken as-is. The human benchmarks shown on this data hub were adjusted using a method described by Chen et al. (2024) that models the effect of spatial distribution on crash risk. In plain terms, the methodology adjusts the city-level benchmarks to account for the unique driving distribution of the Waymo driving. The result of the reweighting method is human benchmarks that are more representative of the areas of the city Waymo drives in the most, which improves data alignment between the Waymo and human crash data.

There’s also the practical side of this, which sounds almost like logistics but really is methodology discipline. Achieving the best possible data alignment, given the limitations of the available data, are part of the newly published Retrospective Automated Vehicle Evaluation (RAVE) best practices (Scanlon et al., 2024b). Misryoum editorial team stated this spatial dynamic benchmark approach described by Chen et al. (2024) was also used in Kusano et al. (2025).

And once you start thinking about it that way, it’s hard not to get a little hung up on the “how” while you wait for the “what.” Like, you can stand outside a busy intersection and hear tires hiss, feel the brake-pulse through your chest, and still wonder whether the data is really capturing the same mix of streets. Here, at least, the evaluation tries to acknowledge that difference—just not completely. It’s a step, but the edge cases, the quiet roads, the parts that get driven less often… they’re still there, quietly shaping the baseline.

Back to top button