The measurement model

What we measure

Screen measures seven engineering competencies through four composable assessment components. Each competency is scored independently with per-flaw evidence. No single number. No pass/fail.

Competencies

Components

Pass/fail labels

The seven competencies

01 — 07

01Competency

Critique

Engaging deeply with code they didn't write.

Reading for intent, spotting assumptions, identifying failure modes — or skimming and accepting. The first thing a senior engineer does with AI-generated code is push back on it. We measure whether the candidate does the same.

Measured by

PR ReviewBug Bash

How scoring works

Two passes when a candidate submits a session.

Step 1

Candidate submits

Inline PR comments, flaw-detection findings, a design document, or a debugging writeup.

Step 2

Per-component scoring

Codex-as-judge compares the submission against the ground-truth manifest — planted flaws, rubric criteria, or scenario probes.

Step 3

Cross-component aggregation

Per-component measurements project onto the seven skill axes — weighted scores with collected evidence for each.

Every score traces back to a specific planted flaw, rubric criterion, or scenario probe. You never see a number without being able to ask “where did this come from?” and get a concrete answer.

Our philosophy

Measurement, not verdict.

Screen does not produce a PASS/FAIL label. We surface raw measurements plus percentile context across sessions. You see where a candidate lands on each competency — and you decide what “good enough” means for your team, your role, and your level.

Different teams value different things. A startup scaling fast might weight system reasoning heavily. A security-focused team might care most about skepticism. Screen gives you the data. You make the call.

See it in action