1 Introduction

Pattern recognition and clustering algorithms are the methodological cornerstones of the “big data” paradigm. In biology, high-throughput genomics and detailed imaging techniques are avidly applied to learn the details of how cells work and how diseases develop, and big datasets are expanding at an exponential rate, which also means that biomedical data analysis relies more and more on computational modeling and visualization in addition to the traditional descriptive statistics. The taxonomic tradition in biomedicine to categorize phenomena into distinct easily identifiable boxes remains strong, which explains the popularity of classical algorithms such as principal component analysis and hierarchical clustering as the first and often only choices for visualization and interpretation of the multi-dimensional structure of a complex dataset. However, both methods struggle when the dataset resembles a continuum instead of distinct clusters of data.

In the vignette, we focus on biomedical applications of the Numero framework, but there is nothing in the R package that is specifically aimed at biology. The choice of diabetic kidney disease as an example reflects our experience in the field, whereas Numero itself can be applied to any analysis problem that involves complex multi-dimensional data.

The document is organized into sections and paragraphs that describe our motivation for developing the library, introduce the concept of the self-organizing map, describe the dataset we use as an example of a biomedical study, go through a complete R-script of an analysis pipeline, define metabolic subgroups, interpret the results and discuss the role of the map analyses in publications.

1.1 Limitations of conventional categorization

The conventional notion of qualitative data patterns (e.g. health vs. disease) fits well with clustering algorithms that aim to find discriminatory borders automatically within the data. However, we argue that many biomedical datasets do not have a qualitative structure of regularity, but they instead reflect a multivariable spectrum of causes and consequences where the borderline between health and disease is blurred. For instance, chronic kidney disease is defined according to an internationally accepted threshold of glomerular filtration rate (GFR < 60 mL/min 1.73 m2, (Levey & Coresh, 2012)), but there is no mathematically identifiable threshold effect in the population-based GFR distribution or any other biomarker or physical characteristic, as demonstrated by the continuous discussion on diagnostic criteria (Delanaye et al., 2012). Therefore, in most cases it is impossible to say exactly when someone develops chronic kidney disease, only that the diagnostic threshold is reached after a gradual decline, after which treatments can be initiated according to consensus guidelines.

Typical clustering analyses rely on algorithms that are tweaked for different application domains to produce classifications that are mathematically optimal, to reproduce an existing gold standard, or to predict future outcomes. We maintain that excessive reliance on mathematical criteria is not useful for datasets without intrinsic clustering structure, since the choice of the criteria will determine the output rather than the data or practical usefulness of the classification. Furthermore, the process that leads to category assignments is often too complicated to understand on a practical level, so the human observer must rely on the “black box” to produce the classification results without access to the inner workings. We propose a half-way solution, where the aim is to simplify the data presentation with statistical verification so that a human observer can determine a suitable subgrouping for a specific purpose, yet with sufficient access to the data patterns to understand the characteristics of the dataset in detail.

A traditional strict classification model will work well, if measurable qualitative differences exist. For instance, type 1 diabetes is an autoimmune form of diabetes that develops in children and adolescents. The condition is severe with a short life expectancy if untreated, so type 1 diabetes can be considered a qualitative example of health versus disease. Consequently, highly accurate diagnostic biomarkers such as glucose, insulin and C-peptide already exist. Even when treated, type 1 diabetes has a profound long-term impact on energy metabolism and it represents a distinct data cluster that is separate from the non-diabetic population.

Unlike type 1 diabetes, common age-associated diseases such as chronic kidney disease, type 2 diabetes, and atherosclerosis are challenging from a clustering perspective: they take decades to develop, they are not immediately life-threatening if left untreated, and there is a wide variation in severity across individuals. Furthermore, the affected individuals often suffer from multiple interacting chronic conditions, making it difficult to isolate specific causes and symptoms. Therefore, the simplistic notion of a qualitative threshold between health and disease becomes problematic. We aim to address these challenges by creating subgroups that are of practical value beyond mathematical criteria, and guided by a human observer with access to understandable presentations of the multivariable data patterns.

Multiple co-occurring and inter-connected phenomena are hallmarks of complex systems and the observable data that can be obtained from them. This presents a challenge to the traditional paradigms of biomedicine. For instance, differential diagnostics cannot cope well with multiple overlapping diseases, or evolving degrees of severity. This motivated us to develop the Numero framework in such a way as to enable visual comparisons of multiple overlapping diagnoses and their diagnostic criteria. We expect the Numero to be highly valuable in situations where the most important outcome or a set of outcomes is not obvious (e.g. competing risk scenarios). For instance, patients with type 1 diabetes may develop serious injuries to their vasculature over decades, but the affected organs, severity and rate of progression vary. Therefore, predictive models that focus only on a single outcome at a time may miss the big picture. The example of diabetic kidney disease we use in this vignette demonstrates how to use the Numero framework to gain insight into the overlaps and longitudinal associations between multiple morbidities.

1.2 Self-organizing map

Expressing multivariable data in visual form is a critical part of any knowledge discovery process, and an extensive number of algorithms have been developed in recent decades. In many cases, the aim is to project a set of multivariable data points into a two-dimensional presentation for human viewing (Figure 1). We built the Numero package using the self-organizing map (SOM) algorithm (Kohonen, Schroeder, & Huang, 2001), which is based on only a few simple mathematical rules, does not break down from missing data and can handle a high number of variables. We also developed a method to estimate the statistical significance of the map patterns (V.-P. Mäkinen et al., 2008b). Of note, the modular structure of the library allows users to replace the SOM with any other suitable algorithm for customized analysis pipelines.

A conceptual example. The example shows how to organize objects with multiple features into a two-dimensional layout. The images were obtained from @cardoso2014taxonomic.

Figure 1: A conceptual example
The example shows how to organize objects with multiple features into a two-dimensional layout. The images were obtained from Cardoso, Queiroz, & Lima (2014).

Conceptually, the SOM algorithm mimics a human observer who wants to make sense of a set of objects. For instance, Figure 1A depicts schematic drawings of the flowering legume genus Luetzelburgia that grows in South America (Cardoso et al., 2014). Figure 1B shows how a human observer might organize the drawings based on their visual similarities (shape, size and other morphological details). By organization, we refer to the spatial layout of the drawings on the two-dimensional canvas: drawings that look similar are close to each other, whereas drawings that look different are far apart (in most cases). This is how all people, from children to elderly, sort and classify their objects with multiple observable features (= multivariable data points) with the help of a two-dimensional surface (= data map). The same observer then decides how to split the dataset into subgroups based on his or her domain knowledge.

If there are thousands of drawings, manual organization becomes impractical. For this reason, we let the SOM algorithm to do the first organization step, and to visualize the salient patterns within the dataset in a two-dimensional data map. The spatial principle still applies: multivariable data points that have similar values are close to each other, whereas data points that are different are on the opposite sides of the map. The second step of defining subgroups remains the responsibility of the observer. We argue that this type of data-assisted subgrouping is particularly useful in situations where there is no qualitative threshold between health and disease, but a line must be drawn to initiate preventative measures or treatments.

Although there are only 18 drawings in Figure 1, the nature of the dataset resembles many epidemiological studies. Specifically, some of the drawings are very similar, but it is not obvious how they should be classified into subgroups (i.e. our version of the figure can be disputed, a single “correct” visual subgrouping may not exist). If the classification was based on the height of a drawing, the results would look different compared to using the width – some drawings are narrow while being long, whereas others are wide despite being short. This is a naïve example on how the selection of the mathematical criteria for classification has a substantial impact on the results, and illustrates the motivation for our work. We developed the Numero library as an alternative tool that helps researchers to define meaningful groupings when pure mathematics cannot provide a conclusive answer.

Previous versions of the software (written for Matlab) were successfully used on a range of metabolomics and other biomedical studies (Bernardi et al., 2010; Kumpula et al., 2010; Kuusisto et al., 2012; V.-P. Mäkinen et al., 2008a, 2008b, 2013; Tukiainen et al., 2008; Würtz et al., 2011). However, the old version used a rectangular SOM, which tends to guide observers into picking four subgroups in the corners even when not supported by data. We created the Numero package with a circular implementation of the SOM to remove the limitations from cornered border shapes. Additional technical details and supportive material are available as an online supplement within a previous publication (V.-P. Mäkinen et al., 2012).

2 Terminology

Data point -– Here, we define the term data point as a single uniquely identifiable row in a spreadsheet of data (with variables as columns). For instance, in the diabetic kidney disease dataset described in the next section, a data point refers to a patient (and vice versa) as there is only one row per patient.

Map –- A map is a general term to describe the two-dimensional canvas onto which the multivariable data points are projected. The concept is analogous to a geographic map that indicates where people live, except that the location is not based on geography (i.e. physical distances), but comes from the data (i.e. distances = data-based similarities).

Layout –- We make a distinction between what is a map, and what is the layout of data points on it. The layout is a table of data point locations as coordinates, whereas the map is a more integrated concept that also includes information that is necessary to find the locations of new previously unseen data points, and to draw and paint the map in visual form.

District –- A district refers to a pre-defined division of the map into uniformly sized areas. The districts are created mainly for technical reasons: using districts speeds up calculations and enables the estimation of map-related statistics. This is analogous to a real city being divided into districts to estimate regional demographics, for instance.

Coloring –- The Numero framework always creates a single map. However, the map districts can be painted with different colors. This enables the user to create multiple colorings of the map to visualize regional differences. These colorings can be made for each variable, which helps to identify which parts of the map are particularly important for a specific phenomenon. Again, this is similar to a real city map where the districts are colored according to the income level of the local residents, or according to the mean age, smoking rates, obesity etc.

Subgroup –- We expect that most uses of Numero will result in the subgrouping of a complex dataset. Visually, we define a subgroup via a contiguous set of adjacent districts on the map. Consequently, all the data points that are located within the set of districts are the subgroup members.

District profile –- The SOM algorithm works through the districts during the optimization of the data point layout on the map. The computational process eventually converges to a stable configuration that is stored as a set of district profiles. From a practical point of view, a district profile represents the typical average profile that captures the characteristics of the data points within the district. In technical terms, the district profile (also known as the prototype) contains the mean weighted data values across all the data points, where the weights are determined by the neighborhood function used in the SOM algorithm.

Best-matching district –- The best-matching district (BMD, also known as the best-matching unit in the literature) is the district with a profile that is the most similar to a data point when considering all variables simultaneously. The BMD is closely related to the data point layout: the assigned location for a data point is the location of the BMD for that data point.

3 Example dataset of diabetic kidney disease

Diabetic kidney disease is the leading indication for dialysis and kidney transplantation in the developed countries, and carries a substantial risk of premature death due to cardiovascular disease. About one third of individuals with type 1 diabetes will develop diabetic kidney disease during their lifetime. As the age of onset of type 1 diabetes is in childhood or adolescence, these individuals will develop complications at a relatively early age. Therefore, people with type 1 diabetes represent a particularly vulnerable group facing lower quality of life and reduced life span due to kidney damage.

Albuminuria (elevated albumin concentration in urine) is the basis for the clinical classification of diabetic kidney disease. In this example, we applied the threshold of 300mg/24h, if 24h urine collections were done and 0.2 mg/min when overnight urine data were available from the local medical centers that examined the patients. If the threshold was exceeded in at least two out of three consecutive measurements, we assigned the individual in the diabetic kidney disease group. In addition, the FinnDiane Study Group measured urinary albumin excretion rate from a single 24h urine sample in their designated central laboratory. The logarithm of the albumin excretion rate was included in the example dataset.

Our example dataset contains a subset of data from a previous publication (V.-P. Mäkinen et al., 2008b). We created the simplified dataset for educational purposes, but it contains enough information to replicate some of the findings from the original study. The dataset includes 613 individuals of whom 225 individuals had diabetic kidney disease at baseline. In addition, we included information on whether an individual had died after an eight-year follow-up to demonstrate how the study design we chose can be applied to longitudinal data. The available data are summarized in Table 1.

Table 1: Summary of the diabetic kidney disease dataset from the FinnDiane Study
The mean and standard deviation are reported for continuous variables. Abbreviations: urinary albumin excretion rate (AER), triglycerides (TG), high density lipoprotein subclass 2 (HDL2). P-values were estimated by the t-test for continuous variables and by Fisher’s test for binary traits.
Trait No kidney disease Diabetic kidney disease P-value
Men / Women 192 / 196 119 / 106 0.45
Age (years) 38.8 ± 12.2 41.7 ± 9.7 0.0012
Type 1 diabetes duration (years) 25.3 ± 10.3 28.6 ± 7.8 <0.001
Log10 of AER (mg/24h) 1.20 ± 0.51 2.72 ± 0.59 <0.001
Log10 of TG (mmol/L) 0.034 ± 0.201 0.159 ± 0.212 <0.001
Total cholesterol (mmol/L) 4.89 ± 0.77 5.35 ± 0.96 <0.001
HDL2 cholesterol (mmol/L) 0.54 ± 0.16 0.51 ± 0.18 0.027
Log10 of serum creatinine (µmol/L) 1.94 ± 0.09 2.14 ± 0.24 <0.001
Metabolic syndrome 90 (23.2%) 114 (50.7%) <0.001
Macrovascular disease 16 (4.1%) 38 (16.9%) <0.001
Diabetic retinopathy 133 (34.4%) 178 (79.1%) <0.001
Died during follow-up 13 (3.4%) 43 (19.1%) <0.001

4 Aims and study design

In the original study, we hypothesized that the metabolic profile of an individual with type 1 diabetes at baseline predicts adverse events in the future (V.-P. Mäkinen et al., 2008b). Here, we set two aims to test the same hypothesis in the example dataset:

  1. Define metabolic subgroups of type 1 diabetes based on biochemical data.
  2. Identify subgroups with high all-cause mortality.

We chose these aims to accommodate a high number of variables and to ensure statistical robustness. Please note that we included only a few variables in the example dataset for pedagogical reasons, but the SOM in the original study was created based on thousands of variables.

The strict separation of Aim 1 and 2 is an example of an unsupervised classification design where the metabolic subgroups are created without using the mortality data. Only after the subgroup modeling has been completed, the deaths during follow-up will be counted within the subgroups. An alternative would be to employ regression or other supervised methods that use all the available data simultaneously to create a predictive model of mortality. While supervised models can achieve high accuracy, they rarely work well outside the dataset they were created for, and they may fail if the outcome to be predicted is poorly defined or biased. For these reasons, we adopted the more robust unsupervised classification design.

We denote the study design as “split-by-variables” since it starts from a spreadsheet with one patient per row and the variables organized into columns, and then assigns one set of variables into the training set, and the remaining variables (e.g. deceased or alive at follow-up) into the evaluation set (Figure 2). Since the evaluation set plays no part in the training of the SOM, we can estimate the statistical significance of the mortality pattern without over-estimating the model accuracy.