To include your compound in the COVID-19 Resource Center, submit it here.

Good test hunting: FDA authorizations point to benchmarks for COVID-19 serology

How to size up COVID-19 tests, using FDA authorizations as guideposts

The critical role of COVID-19 antibody testing in reopening society -- combined with controversy around the regulatory flexibility these tests have received and reports that some are performing poorly -- have put all eyes on evaluation metrics. FDA’s new guidelines for serological test validation by other government agencies provide a first glimpse into how the agency is defining high performance, and some of the tests it has approved itself fall short.

With COVID-19 testing universally seen as a cornerstone for returning to normal life, the obvious question is which tests will be informative and reliable enough to guide personal and public health decisions.

Industry, government and the general public have therefore become keen to understand the sensitivity, specificity and predictive value of tests on the market, how a test’s clinical relevance varies across populations, and the hurdles facing COVID-19 tests in particular (see “The Quarantiner’s Guide to Evaluating COVID-19 Tests”).

Figure: The quarantiner’s guide to evaluating COVID-19 tests

Tests with binary results are typically evaluated based on sensitivity and specificity, which are characteristics intrinsic to the test. Objectively defining sensitivity and specificity requires having a reference standard -- a test universally recognized as the best available method for establishing the presence or absence of a condition. When no reference standard is available, sensitivity and specificity are defined as positive percent agreement (PPA) and negative percent agreement (NPA) with another test of the developer’s choice.

The clinical relevance of a test depends on the prevalence of the condition it detects. How much one can trust a positive or negative result in a given clinical population is quantified by a test’s positive predictive value (PPV) or negative predictive value (NPV). The PPV and NPV of a test with a fixed sensitivity/PPA and specificity/NPA changes as function of case prevalence in the population.

So far, COVID-19 is diagnosed by molecular tests that detect viral genomic material by amplifying sequences specific to SARS-CoV-2. The other major type of test, immunoassays, can be more readily geared to give rapid results, but have proven more challenging to develop and validate (see “Diagnostic Tech Tableau”).

One type of immunoassay, which detects viral protein antigens, is poised to gain momentum as a way to quickly and cheaply screen people for active infections, though viral genome tests will remain the gold standard (see “Viral Antigen Tests on Deck”).

Serological testing, a different type of immunoassay that detects patient antibodies against past infections, has risen to prominence for its potential to track how far the pandemic has spread, and to identify who and how many people have built some degree of immunity -- key questions for reopening society.

The problem is that serology tests have exploded on the market with limited regulatory oversight, and attempts to independently validate marketed serological tests have suggested that many are under-performing (see “Wild West of Antibody Tests”).

COVID-19 diagnostics must receive FDA Emergency Use Authorization (EUA) before entering the U.S. market. But in its March 16 guidance, FDA said that because serological tests are not used for clinical diagnoses, they can be marketed without regulatory review if companies collect their own validation data, notify FDA and make certain disclaimers.

In a conversation with BioCentury, FDA Commissioner Stephen Hahn said the burden will fall on users to evaluate tests based on the data developers report. But leaving individuals and institutions to differentiate between good and bad antibody tests is arguably an abdication of FDA’s responsibility, especially since decisions based on faulty serological tests could have catastrophic consequences (see “FDA is Wrong on Self-Validation of COVID-19 Tests”).

Over the last month, the U.S. government took a series of steps to bring more serological tests under review, such as limiting CMS’s coverage mandate to tests with EUA (see “Regulatory Calculus for Antibody Tests”).

The latest move came April 28, when FDA issued a Letter of Authorization outlining an “umbrella” pathway that grants EUA to COVID-19 serological tests validated by other government agencies, including a program led by NIH’s National Cancer Institute (NCI) announced April 6 (see “NCI Takes Lead on Test Validation”).

The document also was the first time FDA outlined minimum performance criteria for COVID-19 tests; serological tests that receive EUA through this umbrella pathway must show a sensitivity of 90% and a specificity of 95% in the government-run validation studies, using a minimum of 30 antibody-positive patient samples and 80 negative control samples.

However, these black-and-white cutoffs don’t apply to tests granted EUA by FDA directly. Among the nine serological tests with FDA EUA, four have sensitivity or specificity values that fall outside of this window, indicating the agency is giving itself greater discretion than it is allowing other agencies (see “Serological Test Report Card”).

Combined with the new guidelines, these tests’ specs provide benchmarks for evaluating the next wave of serological tests, including a highly anticipated high-throughput assay from Roche (SIX:ROG; OTCQX:RHHBY). The company, which said it had internally validated other marketed tests and found several to be “a disaster,” plans to launch its test in early May (see “Roche to Scale Up Serological Tests”).

Figure: Serological test report card

FDA has granted nine COVID-19 antibody tests Emergency Use Authorization (EUA). The instructions for use (IFU) document for each test reports its sensitivity and specificity in the form of positive percent agreement (PPA) and negative percent agreement (NPA), respectively, with a reverse transcription polymerase chain reaction (RT-PCR) test, along with 95% confidence intervals (CI) for each value.

The 95% CI represents the range of values that one can be 95% certain contains the test characteristics. A key factor determining the width of the 95% CI for sensitivity/PPA is the number of COVID-19 patient samples used to validate the test, while the 95% CI for specificity/NPA is more influenced by the number of negative control samples, with larger numbers of samples supporting a tighter confidence interval and greater certainty.

Some manufacturers report PPA and NPA values for different sets of samples collected at different time intervals after symptom onset; the data shown represent results for the latest time interval reported.

The shaded area represents the sensitivity and specificity cutoffs FDA requires serological tests to meet in order to receive EUA via its “umbrella” pathway. In this pathway, tests are validated by certain government agencies outside of FDA, for example the serological test validation program led by NIH's National Cancer Institute (NCI). Four of the tests FDA has authorized itself do not make the grade.

Numbers to watch for

Every FDA-authorized test has an instructions for use (IFU) document that reports some measure of sensitivity or specificity -- performance metrics intrinsic to the test itself. The way these parameters are reported depends on the type of test, and the availability of other tests to act as reference standards.

Diagnostics based on reverse transcription polymerase chain reaction (RT-PCR) or non-thermocycling probes report test sensitivity based on limit of detection (LoD), defined as the lowest amount of the target that the test can detect, at least 95% of the time; a lower LoD indicates a more sensitive test. Specificity is evaluated by the whether or not the assays cross-react with genomic material from a wide range of other pathogens.

Because these tests are based on genomic sequence complementarity, their performance has been easier to optimize than that of tests based on antibody recognition. Performance issues for diagnostic tests have largely been attributed to the challenges of performing nasopharyngeal swabs and preserving sample integrity in transport; development of saliva-based tests and point-of-care tests that could avoid these issues is under way.

A longer-term hurdle is accumulating enough clinical evidence to determine what LoD gives enough confidence in deciding whether to send someone home from the hospital, or whether someone is capable of infecting others. The LoDs of tests granted EUA by FDA span four orders of magnitude (see “Pushing Limits of Detection”).

Defining LoDs for a viral genome-based diagnostic does not require having patient samples on hand; developers can spike nasal swabs from uninfected individuals with synthetic viral sequences.

In contrast, access to large numbers of patient samples is critical for establishing confidence in the performance of serological tests, because their target analytes -- human antibodies -- are highly variable across individuals. Yet despite soaring case numbers, lack of uniform, routine sample banking has made challenging to access enough samples to thoroughly validate tests; some test developers have stretched the utility of the resources available to them by screening multiple samples from the same patient.

An objective definition of sensitivity and specificity requires a universally acknowledged reference standard, defined as the best available method for establishing the presence or absence of a target condition.

Because no such standard exists yet for COVID-19, serological test developers have been reporting sensitivity and specificity as positive predictive agreement (PPA) and negative predictive agreement (NPA), respectively, with RT-PCR tests performed on patients’ nasal swabs.

The number of samples used to validate tests are a key metric for establishing confidence in their PPAs and NPAs; the more samples available, the tighter the 95% confidence interval (CI).

Since specificity/NPA reflects the ability to accurately identify negative controls, which are more widely available than patient samples, the 95% CI for these metrics tends to be tighter than for sensitivity/PPA, which captures the proportion of positive cases a test can find.

Several serological test developers report PPA and NPA values for multiple time windows of patient sample collection, with better performance on samples collected later after symptom onset. Antibodies are thought to arise 10-14 days after infection, peak at about day 28, and remain in the circulation for months afterward.

What antibodies an immunoassay developer chooses to measure, and what use case they focus on, also influences test performance.

Because IgG antibodies are produced late in infection and have antigen specificities honed by B cell affinity maturation, they are considered a more accurate measure of seroconversion than IgM antibodies, which are produced earlier in infection and are less antigen-specific. Tests that detect only IgGs therefore tend to be more accurate, but might fail to detect exposure to infections that are still ongoing.

Rapid immunoassays, which are performed on disposable lateral flow dipsticks, are typically more challenging to make accurate than enzyme-linked immunosorbent assays (ELISAs) or other chemiluminescence tests performed in labs, in part because they lack signal amplification.

Because false positive COVID-19 antibody tests could give people a false sense of security that leads them to put themselves and others at risk, the top priority for these tests has been maximizing specificity/NPA by minimizing cross-reactivity with other viral proteins. Yet maximizing specificity often comes at the expense of sensitivity, since raising the bar for what counts as a positive result typically increases exclusion of true positives.

For example, while Ortho Clinical Diagnostics Inc. reports specificities of 100% for both its total antibody and IgG-specific lab-based COVID-19 serology tests, both have sensitivities/PPAs below the 90% cutoff FDA specified for tests reviewed by other agencies, at 83.3% and 85.5%, respectively.

Other FDA-authorized tests that fall below the cutoffs are the rapid IgM and IgG test from Autobio Diagnostics Co. Ltd., with a sensitivity/PPA of 88.2%, and the rapid IgM and IgG test from Chembio Diagnostics Inc. (NASDAQ:CEMI), which meets on sensitivity/PPA but has a specificity/NPA of 94.4%, below the 95% cutoff.

No test is an island

Even when a test’s sensitivity and specificity are high, how much its results can be trusted depends on the prevalence of cases in the population.

A commonly used metric of clinical relevance is positive predictive value (PPV), the proportion of total positives a test reports that are actually true positives. As a result, a test that performs well in populations with large proportions of true positive cases can be woefully inadequate in populations where prevalence is low.

The prevalence of COVID-19 exposure in the U.S. has been estimated to be in the realm of 5%. According to a Testing Blueprint published Monday by the White House, CDC and FDA, a test with 95% sensitivity and 95% specificity has a PPV of 50% in this population -- no better than a coin toss.

This hypothetical test is more sensitive and specific than seven out of the nine tests FDA has authorized.

Adding to the uncertainty, the most reliable way to measure the prevalence of COVID-19 exposure in the population is via good antibody tests, which creates a circular problem. An April medRxiv preprint estimating COVID-19 exposure in California’s Santa Clara County made three different predictions based on three assumptions of serological test performance.

But the Testing Blueprint argues that combining two serological tests dramatically improves the ability to trust their respective positive results (see “Sequential Testing as Route to Minimize False Positives”).

As more people are exposed to COVID-19, and as effective vaccines come online, the prevalence of anti-SARS-CoV-2 antibodies in the population will rise, making positive results from individual tests more trustworthy.

The bigger question is what titers and types of antibodies actually confer protection against re-infection.

The answer will guide how test developers should tune the sensitivity and specificity of their assays. More critically, it will reveal what kinds of questions serological tests are capable of answering, and whether they actually hold the key to normalcy.

Further analysis of the coronavirus crisis can be found at

Trial Subscription

Get a two-week free trial subscription to BioCentury


Article Purchase

This article may not be distributed to non-subscribers