Multi-rater assessment is a common approach used to characterize disease states, particularly among populations at risk of impaired symptom insight. This approach brings a substantial increase in data volume and a key aim of comparing results across raters. Given this challenging milieu, methods designed to quickly visualize and grasp comparative results offer considerable utility. This article examines visualization methods of multiple rater data at the level of the individual patient. Examples in various disease populations are used. This is the fourth in a series of articles examining non-traditional methods of visualizing disease states (e.g., excluding plots, charts, graphs), other parts here
A Tale of Two Raters
It is the best of data, it is the worst of data. Multiple rater assessment, or more broadly multi-source data, is the Alexandre Manette of Dickens’ classic novel as it often sits in an under-utilized prison due to the added analytical challenges. The use of multiple raters increases confidence that the targeted outcome domains are accurately and fully captured. However, it also brings a wave of additional data and an inherent contradiction with its aim of increased accuracy. That is, what to do with the inevitable data discrepancies?
This article presents methods to quickly visualize, grasp and facilitate workflows around comparative results in hopes of ‘freeing’ the multi-rater technique from imprisonment. Note the focus here at single subject-level, as opposed to group-level statistical analysis of multi-rater data. The examples come from clinical research settings.
Isn’t One Enough?
The Multi-Model Assessment Method is the gold-standard for disease measurement. The main idea being that if consistent findings emerge across multiple assessment methods and sources, it is reasonable to increase confidence in the results. The general principles for selecting measures and sources will not be covered here. Instead, merely note that end-result of this approach is a collection of measurement values that assess the targeted construct which can be any outcome, characteristic, disease state, etc. about the subject (e.g., activity level, dyspnea, tremor).
As seen in the figure below, the measurement values of the target construct have three primary sources:
Completed by the subject. Typically, these measures capture the person’s subjective experience (e.g., patient reported outcomes [PROs]) or performance on an objective task (e.g., cognitive test).
Completed by other people with direct knowledge about the targeted construct. Common examples include clinical staff (e.g., physician, nurse), caregiver, family members and teachers.
Devices and Markers
This category incorporates the broad range of objective devices and physical markers that capture data related to the targeted construct. Examples include various tracking devices (e.g., steps, heartrate, etc.), lab values (e.g., c-reactive protein), images, etc.
He Said, She Said
The use of multiple raters is one technique within the Multi-Model Assessment Method approach. In short, the perspectives of multiple raters are incorporated into the assessment of the target construct. Determining which perspectives to incorporate depends on the measurement goals.
At one end is a purely psychometric goal. That is, measurement accuracy is the primary concern. In this case, the raters are knowledge experts and drive toward achieving consensus on the ratings. On the other end, the goal is accuracy PLUS full characterization of the target construct. Here multiple perspectives are thought critical for ensuring the entire range of the target construct is captured. For example, consider the scenario where subjects are at risk of limited insight into their disease characteristics (e.g., dementia, seizure disorder).
Visualize the Revolution
The perspectives incorporated into the assessment and the measurement goals also set the stage for how best to visualize agreement / disagreement across raters. Turning first to the purely psychometric case where expert raters are driving toward item consensus. That is, the sovereignty of the individual rater is subordinate to the collective authority of the group (i.e., French Revolution).
The methods involve each rater completing the same assessment form independently. The results are then feed into a “Gold Standard” form which contains the values for items in which there is agreement across raters (see image below).
To help facilitate the workflow, the ‘visualization’ is built into the Gold Standard form to highlight discrepancies so that they can be quickly identified, discussed amongst raters, and a final assessment value agreed upon and entered. Items with agreement can be either hidden or displayed. In the example below, the “Medical Records” gold standard form highlights in red those items with disagreement and items with agreement are simply displayed with the consensus value.
Turning next to the combined goal of accuracy AND full characterization of the target construct, here merely displaying the items and highlighting discrepancies has marginal utility since item consensus is unrelated to the workflow. That is, the visualization technique must maintain the sovereignty and liberty of the individual rater (i.e., American Revolution).
In this case, the main challenge is how to transform the items and associated values into a visual representation that can be quickly grasped and understood. As an example, consider the item below from a seizure questionnaire.
Several things to note about the challenge of visualizing this item:
This item is just one of nearly 20 items, many of which option set just as large.
All items are completed for each seizure type. It is not uncommon for patients to have 2 or more seizure types, and thus 2 or more completed questionnaires.
Both the patients AND caregivers answer the same questions.
Now consider the health care worker tasked with reviewing and comparing patient versus caregiver item responses. Certainly a difficult and time-consuming task. Moreover, merely aligning the patient and caregiver responses side-by-side in columns still presents a daunting task that is text-heavy and difficult to quickly comprehend.
As described in previous blog posts, humans automatically process color and spatial relationships. Using this fact, a line drawing was assigned to each item and put in a table. Then the cell background color was set based on patient’s response and the text color by the caregiver’s response. In both cases, red indicates endorsement.
As seen in the figure below, this approach makes it immediately obvious which symptoms were endorsed by the patient and caregiver, and where there is a report discrepancy. Also note that over time the symptom mapping location is readily learned by staff serving to increase the speed of comprehension.
In the previous example, the goal was to obtain an accurate and complete understanding of each seizure type. Since the patient may have limited insight into their seizure episodes, the other rater provided a mechanism to ensure each seizure is fully characterized. Consider a slightly different clinical scenario where the goal is to accurately and fully characterize symptoms, as well as track targeted treatment behaviors and ensure all stakeholders are on the same page.
In the image below, example items from the ‘Screen for Child Anxiety Related Disorders (SCARED) are shown in an outcome table (full SCARED questionnaire here). In this case, green was used for “Not True”, yellow for ‘Somewhat True”, and red for “Very True”. In each cell, ratings by the parents set the background color and the clinician set the circle color. The clinician and parent colors where set as the same to highlight discrepancies.
Notice the table makes it easy to see the value of each item status and rater congruency even when the additional complexity of time is introduced. Any number of additional raters could be added to the table (e.g., the child’s ratings) by merely adding another symbol (e.g., triangle) and appropriate coloring.
In summary, including multiple raters in your study design or patient registry is a powerful mechanism to ensure outcomes are accurately and/or fully captured. However, the additional data from multiple raters can be challenging in terms of helping consumers quickly grasp and understand comparative results. Various visualization techniques can significantly improve this situation and, ideally, free it from the prison of under-utilization.