Proving it works: cloudUPDRS vs. blinded neurologists

Building a smartphone that measures Parkinson’s motor symptoms is one thing. Proving its scores mean the same thing as a neurologist’s is another — and it is the harder, less glamorous half of the work. This study set out to do exactly that: put a 16-item phone assessment head-to-head against three blinded neurologists, and report the result honestly, even where the news is mixed.

The problem

The trusted yardstick for Parkinson’s motor severity is Part III of the MDS-UPDRS, the clinical motor examination. It is internationally validated, familiar, and easy to interpret, which is why it remains the favoured primary endpoint of major Parkinson’s trials. But it has two well-known weaknesses: it is time-consuming for clinicians, and its poor calibration and limited sensitivity may have contributed to the repeated failure of promising new therapies to show benefit in trials. A more sensitive, objective measure of motor severity could make trials faster and fairer — and help personalise treatment.

Smartphones and wearables are the obvious candidates. They are cheap, objective, and can measure as often as you like. Earlier studies showed that digital scores correlate with the total MDS-UPDRS III. But correlation across a group is not the same as being right about an individual, and that is what clinical decisions and trials actually require. When previous models were scaled up from a handful of patients to larger cohorts, they tended not to generalise.

The authors identify three reasons digital tools have struggled — and, importantly, three traps to avoid. First, models trained on a single human’s scores absorb that rater’s subjective bias rather than removing it. Second, assessments built from only five to seven digital subtests may be too blunt to capture the heterogeneity of a disease whose clinical scale has 33 items. Third — and most insidious — when a study tests a large number of candidate features or machine-learning algorithms against a limited amount of data, it runs a high risk of feature-selection bias: the model latches onto chance patterns and reports an accuracy that looks impressive but will not hold up. This is the over-optimism the paper is determined not to repeat.

A smartphone test compared against three blinded neurologists scoring the same patients.

Figure 1. The validation question: can a phone predict what expert raters score?

How we tested it

The answer was the CloudUPDRS Smartphone Software in Parkinson’s (CUSSP) study — a design built specifically to disarm those three traps. It was prospective, pre-registered, dual-site, and crossover-randomised: patients were recruited at two London hospitals (the National Hospital for Neurology and Neurosurgery, and Homerton University Hospital), and software randomised the order in which the smartphone and clinical assessments were performed, one immediately after the other. Randomising the order matters because motor signs such as tremor amplitude can shift within minutes; doing the phone test first for everyone would have quietly biased the comparison.

The smartphone assessment was the index test — the new measure being evaluated. The reference standard was deliberately strengthened against subjective bias: rather than a single examiner’s score, three neurologists, each with specialist movement-disorders training and blinded to medication status, randomisation order, clinical details, and one another’s ratings, independently scored video recordings of every MDS-UPDRS III item. Training the models on the median of three blinded ratings, rather than one person’s opinion, is what guards against over-fitting to any individual rater’s idiosyncrasies.

In total the analysis drew on 60 subjects, 990 smartphone tests, and 2,628 blinded video subitem ratings. To guard against the second and third traps, the team used a broader 16-item smartphone battery (to capture more of the disease’s heterogeneity) and pre-published or pre-registered its features and analysis plan in advance (so the headline result could not be the product of hindsight). Performance was measured by leave-one-subject-out cross-validation (LOSO-CV): each subject’s prediction comes from a model trained only on the other subjects — an honest test of how the tool behaves on someone it has never seen.

A dual-site, crossover, blinded study design analysed with leave-one-subject-out cross-validation.

Figure 2. A study designed to avoid over-optimistic results.

What we found

The pre-specified analysis — the strict, locked-in-advance one — classified 70.3% of subjects (SEM 5.9%) into a category consistent with at least one of the three blinded raters. That is well above the random baseline of 36.7% (SEM 4.3%): the phone is genuinely predictive of expert judgement at the individual level, not just across a group. A more demanding version, requiring the phone to match the median of the three raters exactly — effectively asking it to outperform any single human — scored a more modest 57.0% (SEM 8.0%), still clearly above its own random baseline of 28.5%.

The team then ran an exploratory analysis in which they were allowed to pick the best-performing classifier and feature for each subtest. This pushed accuracy up to 78.7% (SEM 5.1%) — but the authors are explicit that this number carries a moderate risk of over-optimism, precisely the feature-selection bias they warned about. They report both figures side by side on purpose: the conservative, pre-specified result as the trustworthy benchmark for clinical translation, and the optimised result as a signpost for which features and classifiers future studies might test.

Performance across individual subtests was variable, ranging from 53.2% to 97.0%. That spread is itself a finding, and the paper refuses to oversell it. The very highest scores came from leg-tremor tests at 97.0% — but the authors note this was achieved largely by predicting the most common category every time, because almost none of the 60 patients had leg tremor to begin with. The more informative successes were where scores genuinely varied across patients: proximal bradykinesia tasks such as pronation/supination movements (around 73–75%) carried real signal, while some finger-tapping variants were among the weakest. The honest conclusion is that smartphone measures have predictive value at the subject level — and that future work must keep mitigating both subjective and feature-selection biases, and test across a range of motor features, to avoid over-optimistic estimates.

Subject-level accuracy: 36.7% random vs 70.3% pre-specified vs 78.7% optimised.

Figure 3. Subject-level predictive accuracy against blinded raters.

Why it matters

If digital measures can stand in for parts of the clinical motor exam, trials gain a more sensitive, frequently sampled, objective endpoint — and routine care gains a way to personalise treatment to each patient’s own motor profile. That is a large prize. But the paper’s real contribution is the word if. It shows how a digital assessment can be validated the way a medical measure should be: prospectively, against a blinded multi-rater reference, with a locked analysis plan and an out-of-sample test — and it reports the limitations rather than burying them, noting that the cohort skewed toward mild-to-moderate disease and that severely affected patients were under-represented.

That posture is the whole point. The most useful thing here is not a single accuracy number but a template for honesty: separate the conservative benchmark from the optimistic one, name the biases you are exposed to, and resist the temptation to quote your best result as if it were your only result.

This is exactly how stm.ai’s MedTech practice approaches AI for healthcare. We tie every model back to clinical evidence rather than letting it float free, validate it against the standards clinicians actually trust, and keep a human in the loop wherever it counts — here, three human raters define the truth the model is measured against. And we report results the way this study does: precisely, with the limitations attached. In medicine, an honestly-reported 70% is worth more than an over-optimistic 90%, because only one of them survives contact with a real patient.

A. Jha, E. Menozzi, R. Oyekan, et al. — “The CloudUPDRS smartphone software in Parkinson’s study: cross-validation against blinded human raters”, npj Parkinson’s Disease (2020). Read the paper.