Resolving 'No Shared Levels Found' Error In R Spaghetti Plots
Introduction
Guys, today we're diving into a quirky little bug we encountered while using the goshawk
package for creating spaghetti plots. Specifically, we ran into a warning message: "No shared levels found between names(values)
of the manual scale and the data's colour values." This happened when trying to visualize longitudinal data, and I'm here to break down what occurred, how we tried to fix it, and what the potential solutions might be. So, buckle up, and let's get started!
What Happened?
So, what exactly went down? We were using the nest_longitudinal_dev
app (available at https://genentech.shinyapps.io/nest_longitudinal_dev/) to generate spaghetti plots. For those unfamiliar, spaghetti plots are a way to visualize how individual data points change over time, making them super useful for tracking trends in clinical trials or other longitudinal studies.
The issue popped up when we tried to create a spaghetti plot using the goshawk::g_spaghettiplot
function. The function is designed to help visualize data trends over time for different treatment groups. We provided the necessary data and specified color mappings for different treatment groups using the color_manual
argument. However, R threw a warning message indicating that there were no shared levels between the manual scale's names and the data's color values. This means that the colors we were trying to assign to specific groups weren't aligning properly with the actual data.
To make it clearer, let's walk through the code snippet that triggered the warning:
p <- goshawk::g_spaghettiplot(
data = ANL,
subj_id = "USUBJID",
biomarker_var = "PARAMCD",
biomarker_var_label = "PARAM",
biomarker = "ALT",
value_var = "CHG",
trt_group = "TRT01A",
trt_group_level = NULL,
time = "AVISITCDN",
time_level = NULL,
color_manual = c(`Drug X 100mg` = "#1e90ff", `Combination 100mg` = "#bb9990", Placebo = "#ffa07a"),
color_comb = "#39ff14",
ylim = c(min = -48, max = 47),
facet_ncol = 2L,
facet_scales = "fixed",
hline_arb = c(10, 30),
hline_arb_label = c("default A", "default B"),
hline_arb_color = c("grey", "red"),
xtick = xtick,
xlabel = xlabel,
rotate_xlab = TRUE,
font_size = 9L,
dot_size = 9L,
alpha = 0.2,
group_stats = "NONE",
hline_vars = NULL,
hline_vars_colors = character(0),
hline_vars_labels = character(0)
)
p
In this code, we're calling g_spaghettiplot
with various parameters, including color_manual
, which maps treatment groups to specific colors. The warning suggests that the names in color_manual
(i.e., "Drug X 100mg", "Combination 100mg", "Placebo") don't perfectly match the actual treatment group levels in the data (ANL$TRT01A
). This mismatch can occur due to inconsistencies in naming conventions, typos, or subtle differences in how the treatment groups are labeled in the data compared to the color mapping.
The ramifications of this warning are significant because it means the colors in the plot might not accurately represent the intended treatment groups. Imagine a scenario where "Drug X 100mg" is slightly misspelled in the data as "Drug X 100 mg"; the plot might not assign the correct color, leading to misinterpretations. Therefore, addressing this warning is crucial for ensuring data integrity and accurate visual representation.
Diving Deeper into the Code
To truly understand the root of the issue, let's meticulously examine the R code snippet that triggers the warning. This will involve dissecting the code block step by step, identifying potential problem areas, and understanding how the data is transformed before being fed into the goshawk::g_spaghettiplot
function.
Data Preparation
The initial part of the script focuses on data preparation. It begins by loading necessary libraries such as DescTools
, magrittr
, dplyr
, random.cdisc.data
, and stringr
. These libraries provide functions for data manipulation, string processing, and statistical analysis. The random.cdisc.data
library is particularly interesting as it generates synthetic clinical trial data, which is perfect for testing and development purposes. The use of synthetic data allows developers to work on visualization tools like goshawk
without being constrained by real patient data, which can often be subject to privacy restrictions and access limitations.
Next, the code generates two primary datasets: ADSL
(Analysis Dataset for Subjects) and ADLB
(Analysis Dataset for Laboratory data) using the radsl
and radlb
functions, respectively. These datasets are crucial as they form the foundation for the spaghetti plot. The ADSL
dataset typically contains demographic and baseline characteristics of the subjects, while ADLB
includes laboratory measurements over time. Together, they provide a comprehensive view of patient data throughout the clinical trial. By generating these datasets, the script ensures a controlled and reproducible environment for the visualization process.
Data Transformation and Cleaning
A significant portion of the code is dedicated to data transformation and cleaning. The script starts by filtering the ADSL
dataset to include only subjects with ITTFL == "Y"
, indicating the intent-to-treat population. It then creates a new variable, TRTORD
, representing the treatment order, and maps treatment names using a predefined mapping list .arm_mapping
. This step is crucial for ensuring consistent treatment group labeling, a common challenge in clinical trial data where treatment names can vary slightly across datasets. The use of .make_label
function suggests an attempt to preserve metadata (labels) during data transformations, an important practice for maintaining data provenance and interpretability.
Further transformations involve converting character variables to factors using mutate_at
and defining a subset of ADLB
called .ADLB_SUBSET
. This subsetting operation filters the data based on several criteria, including the presence of non-missing AVAL
values (analysis values), ITTFL == "Y"
, and specific visit codes (e.g., "SCREEN%", "BASE%", "%WEEK%", "%FOLLOW%"). These filters are designed to focus the analysis on relevant data points, such as baseline and follow-up measurements. The script also standardizes visit codes using a case-when construct, mapping visit descriptions to shorter codes (e.g., "SCREENING" to "SCR", "BASELINE" to "BL"). This standardization is critical for time-series visualizations like spaghetti plots, where consistent time variable representation is essential.
Key Data Manipulations
One of the key data manipulations involves the creation of AVALL2
, BASEL2
, and BASE2L2
variables. These variables represent the log2 transformation of AVAL
, BASE
, and BASE2
values, respectively. The transformations are applied conditionally based on the PARAMCD
(parameter code) and the minimum AVAL
value for each parameter. This suggests an effort to handle zero or negative values, which cannot be directly log-transformed, and to normalize the data for visualization purposes. Log transformations are commonly used in biological data to reduce skewness and make data more amenable to statistical analysis and visualization.
Additionally, the script introduces noise into the LBSTRESC
variable (character result/finding in standard format) by randomly replacing values with strings like "<25"
or ">75"
. This could be a simulation technique to mimic real-world data variability or to test the robustness of the visualization against missing or extreme values. Such manipulations highlight the importance of understanding the data generation process when interpreting the spaghetti plot.
Final Data Preparation Steps
The final data preparation steps involve joining .PARAM_MINS
with .ADLB_SUPED1
, further mutating columns, and preparing the data for the goshawk::g_spaghettiplot
function. The .ADLB_LOQS
object, created using goshawk:::h_identify_loq_values
, identifies and flags values below the limit of quantification (LOQ), a common issue in laboratory data. This step is crucial for handling censored data, where values below a certain threshold are not precisely measured.
Finally, the script performs a critical inner join between ADLB
and ADSL
based on STUDYID
and USUBJID
. This join ensures that laboratory data is linked to subject-level characteristics, which is essential for grouping and coloring the spaghetti plot. The script also includes stopifnot
statements with hash values, likely used for data integrity checks, ensuring that the datasets have not been inadvertently altered during the transformations.
Identifying the Problem Area
After meticulously walking through the data preparation steps, the code narrows its focus to the ANL
dataset by filtering ADLB
for PARAMCD == "ALT"
(Alanine Aminotransferase). This filtering action suggests that the spaghetti plot is specifically designed to visualize ALT measurements over time. The script also defines xtick
and xlabel
vectors to customize the x-axis labels of the plot, mapping numeric visit codes to descriptive visit names (e.g., -2 to "Screening", 0 to "Baseline", etc.). This level of detail in axis customization demonstrates a commitment to clear and interpretable visualizations.
It is at this stage, just before the call to goshawk::g_spaghettiplot
, that the potential problem area becomes most apparent. The ANL
dataset, filtered for ALT measurements, is directly passed into the plotting function, along with the color_manual
argument. The warning message "No shared levels found between names(values)
of the manual scale and the data's colour values" strongly suggests a mismatch between the treatment group levels in ANL$TRT01A
and the names specified in color_manual
. A discrepancy here would prevent the plot from correctly mapping colors to treatment groups, leading to the warning.
Examining the Plotting Call
The heart of the matter lies in the call to goshawk::g_spaghettiplot
. This function is designed to generate the spaghetti plot, but the warning messages indicate a problem with how colors are being assigned. Let's break down the key arguments:
data = ANL
: Specifies the dataset to use for the plot.- `subj_id =