Similarity-driven multi-view linear reconstruction

Introduction

Similarity-driven multi-view linear reconstruction (SiMLR) is an algorithm that exploits inter-modality relationships to transform large scientific datasets into a smaller joint space. The link between the original data XiX_i (n×pn \times p) and the reduced embedding space is a sparse set of features viv_i ( p×kp \times k ). Standard statistical tools may then be applied on the embeddings si=Xivis_i=X_i v_i.

SiMLR may be used in a variety of other ways. We will cover a basic example and follow with different use cases. These examples are perhaps less involved than those given in the original publication’s cloud computing examples but illustrate new functionality.

Example: Applying SiMLR to Simulated Multi-View Data

In this section, we explore how SiMLR can be applied to multi-view datasets. We’ll generate synthetic data to simulate the application of SiMLR, compare the results to traditional methods like Singular Value Decomposition (SVD), and visualize the relationships between the views. This example derives from the simulation-based evaluation in the original paper.

Step 1: Simulate Multi-View Data

We begin by simulating three different views of data, each representing different modalities or datasets that share some underlying structure.

library(ANTsR)
#> Warning: replacing previous import 'stats::filter' by 'dplyr::filter' when
#> loading 'ANTsR'
#> ANTsR 0.6.1
#> Environment variables set either in .Renviron or with a seed (e.g. XXX): 
#> Sys.setenv(ANTS_RANDOM_SEED = XXX)
#> Sys.setenv(ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS = 1)
#> may influence reproducibility in some methods. See
#> https://github.com/ANTsX/ANTs/wiki/antsRegistration-reproducibility-issues
#> for more information.Also see *repro methods in antsRegistration.
#> 
#> Attaching package: 'ANTsR'
#> The following objects are masked from 'package:stats':
#> 
#>     sd, var
#> The following objects are masked from 'package:base':
#> 
#>     all, any, apply, max, min, prod, range, sum
library(reshape2)
set.seed(1500)
nsub <- 100          # Number of subjects/samples
npix <- c(100, 200, 133)  # Number of features in each view
nk <- 5             # Number of latent factors

# Generating outcome matrices for each view
outcome <- matrix(rnorm(nsub * nk), ncol = nk)
outcome1 <- matrix(rnorm(nsub * nk), ncol = nk)
outcome2 <- matrix(rnorm(nsub * nk), ncol = nk)
outcome3 <- matrix(rnorm(nsub * nk), ncol = nk)

# Generating transformation matrices for each view
view1tx <- matrix(rnorm(npix[1] * nk), nrow = nk)
view2tx <- matrix(rnorm(npix[2] * nk), nrow = nk)
view3tx <- matrix(rnorm(npix[3] * nk), nrow = nk)

# Creating the multi-view data matrices
mat1 <- (outcome %*% t(outcome1) %*% (outcome1)) %*% view1tx
mat2 <- (outcome %*% t(outcome2) %*% (outcome2)) %*% view2tx
mat3 <- (outcome %*% t(outcome3) %*% (outcome3)) %*% view3tx
colnames(mat1)=paste0("m1.",1:ncol(mat1))
colnames(mat2)=paste0("m2.",1:ncol(mat2))
colnames(mat3)=paste0("m3.",1:ncol(mat3))

# Combine the matrices into a list
matlist <- list(m1 = mat1, m2 = mat2, m3 = mat3)

In this code, mat1, mat2, and mat3 represent three views of the data. Each view is constructed to share a common underlying structure but with different transformations applied.

Step 2: Apply SiMLR

Now we apply the SiMLR algorithm to the simulated multi-view data to find a common representation across the views.

# Applying SiMLR
prepro=c( "centerAndScale","np")
constr='Stiefelx100x1'
constr='orthox100x0.1'
constr='Grassmannx100x1'
result <- simlr(matlist,constraint=constr,scale=prepro, 
#  mixAlg='pca', energyType='cca', # recommended
  mixAlg='ica', energyType='regression', # recommended
  sparsenessAlg = "spmp",
  iterations=100, verbose=TRUE)
#> [1] "Grassmann" "100"       "1"        
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: 0.00853325097828405  <o> mixer: ica  <o> E:  regression"
#> [1] "Iteration: 1 bestEv: 15.1731869055101 bestIt: 1 CE: 15.1731869055101 featOrth: 0"
#> [1] "Iteration: 2 bestEv: 15.1731869055101 bestIt: 1 CE: 18.0294080669307 featOrth: 0"
#> [1] "Iteration: 3 bestEv: 15.1731869055101 bestIt: 1 CE: 18.3230344177197 featOrth: 2.57221819646702e-09"
#> [1] "Iteration: 4 bestEv: 15.1731869055101 bestIt: 1 CE: 18.506984331078 featOrth: 0"
#> [1] "Iteration: 5 bestEv: 15.1731869055101 bestIt: 1 CE: 19.5779040898486 featOrth: 0"
#> [1] "Iteration: 6 bestEv: 15.1731869055101 bestIt: 1 CE: 21.5014248185066 featOrth: 1.74800657263809e-05"
#> [1] "Iteration: 7 bestEv: 15.1731869055101 bestIt: 1 CE: 18.8535397830963 featOrth: 5.92035465227189e-06"

# Projecting data into the reduced space
p1 <- mat1 %*% (result$v[[1]])
p2 <- mat2 %*% (result$v[[2]])
p3 <- mat3 %*% (result$v[[3]])

SiMLR identifies a sparse set of features that best reconstruct each view, resulting in projections p1, p2, and p3 for each view.

Step 3: Visualization and Comparison

To understand the effectiveness of the SiMLR projections, we can compare the correlations between the projections of different views. Additionally, we’ll compare the results to those obtained via Singular Value Decomposition (SVD).

# Calculate SVD for comparison
svd1 <- svd(mat1, nu = nk, nv = 0)$u
svd2 <- svd(mat2, nu = nk, nv = 0)$u
svd3 <- svd(mat3, nu = nk, nv = 0)$u

# Calculate correlations between the SiMLR projections
cor_p1_p2 <- range(cor(p1, p2))
cor_p1_p3 <- range(cor(p1, p3))
cor_p2_p3 <- range(cor(p2, p3))

# Compare with correlations from SVD
cor_svd1_svd2 <- range(cor(svd1, svd2))

# Print the results
cat("Correlation between p1 and p2:", cor_p1_p2, "\n")
#> Correlation between p1 and p2: -0.1469598 0.8445678
cat("Correlation between p1 and p3:", cor_p1_p3, "\n")
#> Correlation between p1 and p3: -0.5011919 0.9324397
cat("Correlation between p2 and p3:", cor_p2_p3, "\n")
#> Correlation between p2 and p3: -0.692099 0.7550467
cat("Correlation between SVD1 and SVD2:", cor_svd1_svd2, "\n")
#> Correlation between SVD1 and SVD2: -0.834545 0.5642089
Visualizing Correlations

Visualizing these correlations helps us understand the similarity between the views after the SiMLR transformation.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.1

# Function to plot correlation heatmaps
plot_cor_heatmap <- function(mat1, mat2, title) {
  cor_mat <- abs(cor(mat1,mat2))
  ggplot(melt(cor_mat), aes(Var1, Var2, fill = value)) +
    geom_tile() +
    scale_fill_gradient2(midpoint = 0.5, low = "blue", high = "red", mid = "white") +
    theme_minimal() +
    labs(title = title, x = "Variables", y = "Variables") +
    coord_fixed()
}

# Plotting the correlations
plot_cor_heatmap(p1, p2, "Correlation Heatmap of p1-p2")

plot_cor_heatmap(p1, p3,"Correlation Heatmap of p1-p3")

plot_cor_heatmap(p2, p3, "Correlation Heatmap of p2-p3")

These heatmaps show how the reduced representations from SiMLR correlate within each view. Comparing these with similar plots for SVD will reveal whether SiMLR captures more meaningful relationships between the views.

Step 4: Permutation Test

A permutation test can assess whether the observed correlations are significant.

s1 <- sample(1:nsub)
s2 <- sample(1:nsub)
permMats=list(vox = mat1, vox2 = mat2[s1, ], vox3 = mat3[s2, ])
resultp <- simlr(permMats,
  constraint=constr,scale=prepro,  sparsenessAlg = "spmp" )
p1p <- mat1 %*% (resultp$v[[1]])
p2p <- mat2[s1, ] %*% (resultp$v[[2]])
p3p <- mat3[s2, ] %*% (resultp$v[[3]])

# Compare the permuted correlations
cor_p1p_p2p <- range(cor(p1p, p2p))
cor_p1p_p3p <- range(cor(p1p, p3p))
cor_p2p_p3p <- range(cor(p2p, p3p))

# Print permuted results
cat("Permuted Correlation between p1p and p2p:", cor_p1p_p2p, "\n")
#> Permuted Correlation between p1p and p2p: -0.1602327 0.1008282
cat("Permuted Correlation between p1p and p3p:", cor_p1p_p3p, "\n")
#> Permuted Correlation between p1p and p3p: -0.1961412 0.2505636
cat("Permuted Correlation between p2p and p3p:", cor_p2p_p3p, "\n")
#> Permuted Correlation between p2p and p3p: -0.1998227 0.1645429

The permutation test will show if the observed correlations are stronger than those expected by chance.

This introductory example demonstrates how SiMLR can uncover the shared structure across multiple views of data. The correlations between the projections indicate the degree to which the algorithm has captured this shared structure, and the permutation test serves as a validation step.

Integration of Multiple Data Types via Dimensionality Reduction

SiMLR is particularly useful when working with datasets from different modalities (e.g., genomics, proteomics, and imaging data). By finding a joint embedding space, SiMLR can integrate these diverse data types into a unified representation, facilitating downstream analyses such as clustering, classification, or regression.

When dealing with high-dimensional data, dimensionality reduction techniques are often required to make the data more manageable and to avoid issues like the curse of dimensionality. SiMLR provides an approach that not only reduces dimensionality but also maintains the relationships between different views of the data, making it a powerful tool for exploratory data analysis and visualization. Below, we illustrate reading, writing and exploratory integrated visualization.

library(fpc)
#> Warning: package 'fpc' was built under R version 4.3.3
library(cluster)
#> Warning: package 'cluster' was built under R version 4.3.1
library(gridExtra)
library(ggpubr)
sim2nm=tempfile()
write_simlr_data_frames( result$v, sim2nm  )
simres2=read_simlr_data_frames( sim2nm, names(matlist) )
popdf = data.frame( age = outcome, cog=outcome1, mat1, mat2, mat3 )
temp2=apply_simlr_matrices( popdf, simres2, 
    center=TRUE, scale=TRUE, absolute_value=rep(TRUE,length(matlist)) )
simnames=temp2[[2]]
zz=exploratory_visualization( temp2[[1]][, temp2[[2]]], dotsne=FALSE )
#> Warning: package 'GGally' was built under R version 4.3.1
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Warning: package 'ggdendro' was built under R version 4.3.1
#> Warning: package 'patchwork' was built under R version 4.3.3
print( zz$plot )

Now we can icorporate the “modalities” in order to predict the simulated outcome matrix. Note that contributions exist from each modality in the regression (in some cases). This is one of the advantages of SiMLR: it provides systematic guidance for building these types of integrative models.

# multi-view regression
summary( lm( age.1 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
#> 
#> Call:
#> lm(formula = age.1 ~ m1PC1 + m2PC1 + m3PC1, data = temp2[[1]])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.59476 -0.49317 -0.05545  0.49108  1.84581 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.06104    0.07439   0.820 0.413993    
#> m1PC1       -10.18620    1.77660  -5.734 1.14e-07 ***
#> m2PC1         5.26367    1.36903   3.845 0.000217 ***
#> m3PC1         4.69026    1.92786   2.433 0.016830 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7439 on 96 degrees of freedom
#> Multiple R-squared:  0.494,  Adjusted R-squared:  0.4782 
#> F-statistic: 31.24 on 3 and 96 DF,  p-value: 3.521e-14
summary( lm( age.2 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
#> 
#> Call:
#> lm(formula = age.2 ~ m1PC1 + m2PC1 + m3PC1, data = temp2[[1]])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.73691 -0.25370  0.01167  0.24400  0.95020 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -0.06212    0.03628  -1.712   0.0901 .  
#> m1PC1        15.30365    0.86639  17.664   <2e-16 ***
#> m2PC1        15.34258    0.66763  22.981   <2e-16 ***
#> m3PC1       -19.56383    0.94015 -20.809   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3628 on 96 degrees of freedom
#> Multiple R-squared:  0.8587, Adjusted R-squared:  0.8543 
#> F-statistic: 194.5 on 3 and 96 DF,  p-value: < 2.2e-16
summary( lm( age.3 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
#> 
#> Call:
#> lm(formula = age.3 ~ m1PC1 + m2PC1 + m3PC1, data = temp2[[1]])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.47369 -0.33511  0.02194  0.37138  1.13810 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.03258    0.05510   0.591 0.555648    
#> m1PC1        5.30441    1.31582   4.031 0.000111 ***
#> m2PC1        3.95065    1.01395   3.896 0.000181 ***
#> m3PC1        2.53717    1.42784   1.777 0.078747 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.551 on 96 degrees of freedom
#> Multiple R-squared:  0.7054, Adjusted R-squared:  0.6962 
#> F-statistic: 76.61 on 3 and 96 DF,  p-value: < 2.2e-16
summary( lm( age.4 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
#> 
#> Call:
#> lm(formula = age.4 ~ m1PC1 + m2PC1 + m3PC1, data = temp2[[1]])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.44941 -0.50917 -0.07413  0.54017  1.85159 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.02380    0.07451   0.319    0.750    
#> m1PC1       -1.90664    1.77947  -1.071    0.287    
#> m2PC1       -7.19214    1.37124  -5.245 9.31e-07 ***
#> m3PC1       10.21246    1.93097   5.289 7.74e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7451 on 96 degrees of freedom
#> Multiple R-squared:  0.4509, Adjusted R-squared:  0.4338 
#> F-statistic: 26.28 on 3 and 96 DF,  p-value: 1.702e-12
summary( lm( age.5 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
#> 
#> Call:
#> lm(formula = age.5 ~ m1PC1 + m2PC1 + m3PC1, data = temp2[[1]])
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.28071 -0.43685 -0.04041  0.36609  1.42936 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -0.08199    0.05960  -1.376   0.1721    
#> m1PC1        12.12119    1.42326   8.516 2.27e-13 ***
#> m2PC1        -2.03136    1.09675  -1.852   0.0671 .  
#> m3PC1       -11.86392    1.54443  -7.682 1.33e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.596 on 96 degrees of freedom
#> Multiple R-squared:  0.6149, Adjusted R-squared:  0.6029 
#> F-statistic:  51.1 on 3 and 96 DF,  p-value: < 2.2e-16

Sparse Feature Selection

SiMLR incorporates sparse feature selection, allowing users to identify a minimal subset of features that contribute most to the joint embedding space. This can be particularly advantageous in settings where interpretability is crucial, such as biomarker discovery in biological datasets.

pp=plot_features( simres2 )
grid.arrange( grobs=pp, nrow=3, top='Joint features' )

Improved Model Interpretability

By projecting the data into a joint space, SiMLR helps improve the interpretability of machine learning models. The sparse feature selection ensures that only the most informative features are retained, making it easier to understand the relationships between the input data and the model’s predictions.

Cross-Modal Prediction or Imputation

In scenarios where you have multiple data modalities but missing information scattered across some of the views, SiMLR can be used to impute the missing data by leveraging the relationships between the available modalities. This cross-modal prediction capability is particularly useful in multi-omics studies and other fields where complete data is often unavailable.

popdfi=popdf
proportion <- 0.1
num_NAs <- ceiling(proportion * nrow(popdfi) * ncol(popdfi))
indices <- arrayInd(sample(1:(nrow(popdfi) * ncol(popdfi)), num_NAs), .dim = dim(popdfi))
popdfi[indices] <- NA
nms=names(simres2)
nsimx=3
sep='.'
imputedcols=c()
for ( n in nms ) {
  for ( v in 1:nsimx ) {
    thiscol=paste0(n,sep,v)
    if ( any( is.na( popdfi[,thiscol] ) )) {
      imputedcols=c(imputedcols,thiscol)
      popdfi = simlr_impute( popdfi, nms, v, n, separator=sep )
    }
  }
}
impresult=data.frame( og=popdf[,imputedcols[1]], imp=popdfi[,imputedcols[1]])
ggscatter( impresult, 'og', 'imp' ) + ggtitle(paste("Imputed vs original data",imputedcols[1]))
#> Warning: Removed 3 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Testing Significance of Joint Relationships

We use a metric (rvcoef) that is not directly optimized to assess the significance of the simlr result versus permuted data. The RV coefficient is analogous to the Pearson correlation coefficient but is used for comparing two matrices (or datasets) rather than two variables. It measures the similarity of the column spaces (subspaces) spanned by the columns of two matrices. We compute the rvcoef across all pairs of data at each permutation and compare to the original values.

initu = initializeSimlr( matlist, 3, jointReduction = TRUE  )
np=40
myperm = simlr.perm(  matlist,
  constraint=constr,scale=prepro, 
  initialUMatrix=initu,
  sparsenessAlg = "spmp",
  nperms=np, FUN=rvcoef, verbose=TRUE )
#> [1] "Grassmann" "100"       "1"        
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: -0.585842979107515  <o> mixer: svd  <o> E:  cca"
#> [1] "Iteration: 1 bestEv: -0.294506815668022 bestIt: 1 CE: -0.294506815668022 featOrth: 0"
#> [1] "Iteration: 2 bestEv: -0.294506815668022 bestIt: 1 CE: -0.204286868733419 featOrth: 0"
#> [1] "Iteration: 3 bestEv: -0.294506815668022 bestIt: 1 CE: -0.228665226994726 featOrth: 0"
#> [1] "Iteration: 4 bestEv: -0.294506815668022 bestIt: 1 CE: -0.176607747765278 featOrth: 0"
#> [1] "Iteration: 5 bestEv: -0.294506815668022 bestIt: 1 CE: -0.187276561577006 featOrth: 0"
#> [1] "Iteration: 6 bestEv: -0.294506815668022 bestIt: 1 CE: -0.0446094539775891 featOrth: 0"
#> [1] "Iteration: 7 bestEv: -0.294506815668022 bestIt: 1 CE: -0.0880621654628728 featOrth: 0"
print( tail( myperm$significance, 2 )  )
#>    n   perm        m1_m2        m1_m3        m2_m3
#> 42 3  ttest 3.786937e-62 6.649524e-57 8.428902e-53
#> 43 3 pvalue 0.000000e+00 0.000000e+00 0.000000e+00
gglist = list()
simlrmaps=colnames(myperm$significance)[-c(1:2)]
for ( xxx in simlrmaps ) {
    pvec=myperm$significance[ myperm$significance$perm %in% as.character(1:np),xxx]
    original_tstat <- myperm$significance[1,xxx] # replace with actual value
    gglist[[length(gglist)+1]]=( visualize_permutation_test( pvec, original_tstat, xxx ) )
}
grid.arrange( grobs=gglist, top='Cross-modality relationships: permutation tested')

# display simlr results
gglist2=list()
for ( i in 1:(length(matlist)-1)) {
    for ( j in ((i+1):length(matlist))) {
        toviz1=data.matrix(simres2[[i]])
        toviz2=data.matrix(simres2[[j]])
        temp=visualize_lowrank_relationships( matlist[[i]], matlist[[j]], 
            toviz1, toviz2,
            nm1=names(matlist)[i], nm2=names(matlist)[j] )$plot
        gglist2[[length(gglist2)+1]]=temp
    }
}
grid.arrange(grobs=(gglist2),nrow=1, top='Low-rank correlations: SiMLR')

Evaluating Machine Learning Objective Functions

This section of the tutorial vignette provides an example of how to set up and run a parameter search using the simlr.search function. Here, csearch contains a list of constraint options that include “none” and different options for enforcing orthogonality in the feature space. This list will be used to explore different constraint options during the parameter search.

In this step, the simlr.parameters function is called to create a parameter list (simparms) that will control the grid search. Each parameter category (e.g., nsimlr_options, prescaling_options, etc.) is provided with options as a list. These parameters specify the settings and constraints that will be explored during the optimization process.

  • nsimlr_options : The number of latent factors to consider.

  • prescaling_options : Methods for data scaling/preprocessing (robust, whiten, np).

  • objectiver_options : The objective function used (eg regression or cca).

  • mixer_options : The mixing technique (ica or pca).

  • sparval_options : Sparseness settings.

  • expBeta_options : Exponential smoothing parameter beta values. e.g. 0.0, 0.9, 0.99, 0.999.

  • positivities_options : Positivity constraints.

  • optimus_options : The optimization strategy (lineSearch, mixed, greedy).

  • constraint_options : Feature space orthogonality constraints to apply.

  • search_type : The type of search to perform (full indicates a full grid search).

  • num_samples : The number of samples to evaluate during the search.

paster <- function(vec1, vec2) {
  paste0(rep(vec1, each = length(vec2)), vec2)
}
csearch = as.list( c("none", 
  paster( c("Grassmannx","Stiefelx","orthox"), 
    c("10x10","100x1","100x1","1x1"))))
# the 10x10 are weights on the optimization term of the energy
simparms = simlr.parameters( 
    nsimlr_options = list( 2 ),
    prescaling_options = list( 
            c( "robust", "centerAndScale",  "np") ),
    objectiver_options = list("regression",'cca'),
    mixer_options=list("ica",'pca'), 
    sparval_options=list(  c(0.5,0.5,0.5) ),
    expBeta_options = list( c(0.99) ),
    positivities_options = list(
            c("positive","positive","positive")),
    optimus_options=list( 'lineSearch'),
    constraint_options=csearch,
    sparsenessAlg = "spmp",
    search_type="random",
    num_samples=5
    )

The regularizeSimlr function is used to apply regularization to the input matrices matlist. Regularization can help prevent overfitting by penalizing complexity. The fraction and sigma parameters control the extent and type of regularization.

regs = regularizeSimlr(matlist, fraction=0.05, 
  sigma=rep(1.5,length(matlist)))
plot( image( regs[[1]] ) )

plot( image( regs[[2]] ) )

The simlr.search function performs the grid search over the parameter space defined in simparms. It takes the matrices matlist and regs as input. The nperms specifies the number of permutations to evaluate for robustness. The verbose parameter controls the level of detail in the output during the search process.

simlrXs=simlr.search(
        matlist,
        regs=regs,
        simparms,
        nperms=12,
        verbose=4
        )
#> Will search: 5 parameter sets[1] "robust"         "centerAndScale" "np"            
#> [1] "cca"
#> [1] "pca"
#> [1] "Stiefelx10x10"
#> [1] 0.5 0.5 0.5
#> [1] 0.99
#> [1] "positive" "positive" "positive"
#> [1] "lineSearch"
#> [1] "Stiefel" "10"      "10"     
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: -0.464340360769418  <o> mixer: pca  <o> E:  cca"
#> [1] "Iteration: 1 bestEv: -0.438182541354338 bestIt: 1 CE: -0.438182541354338 featOrth: 0"
#> [1] "Iteration: 2 bestEv: -0.601628557287905 bestIt: 2 CE: -0.601628557287905 featOrth: 0"
#> [1] "Iteration: 3 bestEv: -0.674005218977676 bestIt: 3 CE: -0.674005218977676 featOrth: 0"
#> [1] "Iteration: 4 bestEv: -0.682480976923719 bestIt: 4 CE: -0.68248097692372 featOrth: 0"
#> [1] "Iteration: 5 bestEv: -0.682480976923719 bestIt: 4 CE: -0.68248097692372 featOrth: 0"
#> [1] "Iteration: 6 bestEv: -0.706018196521312 bestIt: 6 CE: -0.706018196521312 featOrth: 0"
#> [1] "Iteration: 7 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 8 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 9 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 10 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 11 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 12 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 13 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "robust"         "centerAndScale" "np"            
#> [1] "regression"
#> [1] "ica"
#> [1] "Grassmannx1x1"
#> [1] 0.5 0.5 0.5
#> [1] 0.99
#> [1] "positive" "positive" "positive"
#> [1] "lineSearch"
#> [1] "Grassmann" "1"         "1"        
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: 0.126808680550222  <o> mixer: ica  <o> E:  regression"
#> [1] "Iteration: 1 bestEv: 0.171625403530076 bestIt: 1 CE: 0.171625403530076 featOrth: 0"
#> [1] "Iteration: 2 bestEv: 0.171625403530076 bestIt: 1 CE: 0.182394596894939 featOrth: 0"
#> [1] "Iteration: 3 bestEv: 0.171625403530076 bestIt: 1 CE: 0.190171497886056 featOrth: 0"
#> [1] "Iteration: 4 bestEv: 0.171625403530076 bestIt: 1 CE: 0.202462156166499 featOrth: 0"
#> [1] "Iteration: 5 bestEv: 0.171625403530076 bestIt: 1 CE: 0.184660134693779 featOrth: 0"
#> [1] "Iteration: 6 bestEv: 0.171625403530076 bestIt: 1 CE: 0.224349921295047 featOrth: 0"
#> [1] "Iteration: 7 bestEv: 0.171625403530076 bestIt: 1 CE: 0.204913595796791 featOrth: 0"
#> [1] "improvement"
#>   nsimlr objectiver mixer ebber    optimus    constraint final_energy
#> 1      2 regression   ica  0.99 lineSearch Grassmannx1x1     18.75593
#>   prescaling1    prescaling2 prescaling3 sparval1 sparval2 sparval3 positivity1
#> 1      robust centerAndScale          np      0.5      0.5      0.5    positive
#>   positivity2 positivity3 perm     m1_m2     m1_m3     m2_m3
#> 1    positive    positive    0 0.1649625 0.2645936 0.3598185
#>              PC1        PC2
#> m3.1 0.000000000 0.03612574
#> m3.2 0.000000000 0.01702356
#> m3.3 0.000000000 0.03471395
#> m3.4 0.000000000 0.02619322
#> m3.5 0.009039381 0.00000000
#> m3.6 0.000000000 0.03516955
#> [1] "robust"         "centerAndScale" "np"            
#> [1] "cca"
#> [1] "pca"
#> [1] "Stiefelx1x1"
#> [1] 0.5 0.5 0.5
#> [1] 0.99
#> [1] "positive" "positive" "positive"
#> [1] "lineSearch"
#> [1] "Stiefel" "1"       "1"      
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: -0.464340360769418  <o> mixer: pca  <o> E:  cca"
#> [1] "Iteration: 1 bestEv: -0.438182541354338 bestIt: 1 CE: -0.438182541354338 featOrth: 0"
#> [1] "Iteration: 2 bestEv: -0.601628557287905 bestIt: 2 CE: -0.601628557287905 featOrth: 0"
#> [1] "Iteration: 3 bestEv: -0.674005218977676 bestIt: 3 CE: -0.674005218977676 featOrth: 0"
#> [1] "Iteration: 4 bestEv: -0.682480976923719 bestIt: 4 CE: -0.68248097692372 featOrth: 0"
#> [1] "Iteration: 5 bestEv: -0.682480976923719 bestIt: 4 CE: -0.68248097692372 featOrth: 0"
#> [1] "Iteration: 6 bestEv: -0.706018196521312 bestIt: 6 CE: -0.706018196521312 featOrth: 0"
#> [1] "Iteration: 7 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 8 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 9 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 10 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 11 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 12 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "Iteration: 13 bestEv: -0.709594575141912 bestIt: 7 CE: -0.709594575141912 featOrth: 0"
#> [1] "robust"         "centerAndScale" "np"            
#> [1] "cca"
#> [1] "ica"
#> [1] "none"
#> [1] 0.5 0.5 0.5
#> [1] 0.99
#> [1] "positive" "positive" "positive"
#> [1] "lineSearch"
#> [1] "none" NA     NA    
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: -0.464340360769418  <o> mixer: ica  <o> E:  cca"
#> [1] "Iteration: 1 bestEv: -0.494812989573843 bestIt: 1 CE: -0.494812989573843 featOrth: 0"
#> [1] "Iteration: 2 bestEv: -0.494812989573843 bestIt: 1 CE: -0.494812989573843 featOrth: 0"
#> [1] "Iteration: 3 bestEv: -0.494812989573843 bestIt: 1 CE: -0.221795951739896 featOrth: 0"
#> [1] "Iteration: 4 bestEv: -0.494812989573843 bestIt: 1 CE: -0.427229030777649 featOrth: 0"
#> [1] "Iteration: 5 bestEv: -0.494812989573843 bestIt: 1 CE: -0.427229030777649 featOrth: 0"
#> [1] "Iteration: 6 bestEv: -0.494812989573843 bestIt: 1 CE: -0.367095898803618 featOrth: 0"
#> [1] "Iteration: 7 bestEv: -0.494812989573843 bestIt: 1 CE: -0.476318599472209 featOrth: 0"
#> [1] "robust"         "centerAndScale" "np"            
#> [1] "regression"
#> [1] "ica"
#> [1] "orthox1x1"
#> [1] 0.5 0.5 0.5
#> [1] 0.99
#> [1] "positive" "positive" "positive"
#> [1] "lineSearch"
#> [1] "ortho" "1"     "1"    
#> [1] "     <0> BUILD-V <0> BUILD-V <0> BUILD-V <0> BUILD-V <0>    "
#> [1] "initialDataTerm: 0.126808680550222  <o> mixer: ica  <o> E:  regression"
#> [1] "Iteration: 1 bestEv: 0.18000033728193 bestIt: 1 CE: 0.18000033728193 featOrth: 0"
#> [1] "Iteration: 2 bestEv: 0.18000033728193 bestIt: 1 CE: 0.187678302906652 featOrth: 0"
#> [1] "Iteration: 3 bestEv: 0.18000033728193 bestIt: 1 CE: 0.18817971777985 featOrth: 0"
#> [1] "Iteration: 4 bestEv: 0.18000033728193 bestIt: 1 CE: 0.190986836460099 featOrth: 0"
#> [1] "Iteration: 5 bestEv: 0.18000033728193 bestIt: 1 CE: 0.212805019313083 featOrth: 0"
#> [1] "Iteration: 6 bestEv: 0.18000033728193 bestIt: 1 CE: 0.201284261815092 featOrth: 0"
#> [1] "Iteration: 7 bestEv: 0.18000033728193 bestIt: 1 CE: 0.221347012429854 featOrth: 0"
#>   nsimlr objectiver mixer ebber    optimus    constraint final_energy
#> 2      2 regression   ica  0.99 lineSearch Grassmannx1x1     18.75593
#>   prescaling1    prescaling2 prescaling3 sparval1 sparval2 sparval3 positivity1
#> 2      robust centerAndScale          np      0.5      0.5      0.5    positive
#>   positivity2 positivity3 perm     m1_m2     m1_m3     m2_m3
#> 2    positive    positive    0 0.1649625 0.2645936 0.3598185
#> el finito
simlrXs$paramsearch
#>   nsimlr objectiver mixer ebber    optimus    constraint final_energy
#> 1      2        cca   pca  0.99 lineSearch Stiefelx10x10     17.46030
#> 2      2 regression   ica  0.99 lineSearch Grassmannx1x1     18.75593
#> 3      2        cca   pca  0.99 lineSearch   Stiefelx1x1     17.46030
#> 4      2        cca   ica  0.99 lineSearch          none     12.18170
#> 5      2 regression   ica  0.99 lineSearch     orthox1x1     17.19227
#>   prescaling1    prescaling2 prescaling3 sparval1 sparval2 sparval3 positivity1
#> 1      robust centerAndScale          np      0.5      0.5      0.5    positive
#> 2      robust centerAndScale          np      0.5      0.5      0.5    positive
#> 3      robust centerAndScale          np      0.5      0.5      0.5    positive
#> 4      robust centerAndScale          np      0.5      0.5      0.5    positive
#> 5      robust centerAndScale          np      0.5      0.5      0.5    positive
#>   positivity2 positivity3 perm     m1_m2     m1_m3      m2_m3
#> 1    positive    positive    0 0.4243037 0.1969152 0.42325627
#> 2    positive    positive    0 0.1649625 0.2645936 0.35981851
#> 3    positive    positive    0 0.4243037 0.1969152 0.42325627
#> 4    positive    positive    0 0.2918459 0.2703616 0.04986623
#> 5    positive    positive    0 0.2304250 0.2483959 0.24320984

Finally, simlrXs$paramsearch retrieves the results of the parameter search. This will show the performance of different parameter combinations, helping to identify the optimal settings for the SiMLR algorithm. The “best” result here will have the highest value for final_energy and its parameters are stored in the parameters slot.

This section of the vignette demonstrates how to set up a comprehensive parameter search for the SiMLR algorithm. It shows how to create custom constraints, define a parameter grid, regularize the model, and execute the search. The result is an optimized set of parameters that can be used to enhance the performance of the SiMLR model on your data.

Path modeling

SIMLR has a path modeling variant that lets you build the graph of connections however you like. Call the matrices A, B and C. If you build: A=>A, B=>B, C=>C then you will learn within-modality feature sets. If you build A=>(B,C), B=>A, C=>A then you will “focus” on A. The default graph is A=>(B,C), B=>(A,C), C=>(A,B) ]

Summary

We provided a quick overview of potential uses of the SiMLR framework.
In addition, we provided visualization hints and standards that may be of use in scientific applications of these methods.