diff --git a/docs/articles/iceqream.html b/docs/articles/iceqream.html index fb4d0f5..37d5df8 100644 --- a/docs/articles/iceqream.html +++ b/docs/articles/iceqream.html @@ -6,7 +6,7 @@ -
See below for minimal instructions on how to use IceQream on your own data.
+For this tutorial, we'll use pre-prepared data from a mouse gastrulation trajectory. Let's download and load this data:
download.file("https://iceqream.s3.eu-west-1.amazonaws.com/gastrulation-example.tar.gz", "gastrulation-example.tar.gz")
untar("gastrulation-example.tar.gz")
We loaded the following objects:
-peak_intervals
: A data frame with genomic positions of peaks.Let's examine the structure of our input data:
+
+# Peak intervals
+print("Peak Intervals:")
+#> [1] "Peak Intervals:"
head(peak_intervals)
#> # A tibble: 6 × 6
@@ -129,12 +131,13 @@ Download the example files#> 5 chr1 3191760 3192060 1_Xkr4_3192 FALSE 23571
#> 6 chr1 3264000 3264300 1_Xkr4_3264 FALSE -48368
-nrow(peak_intervals)
-#> [1] 99291
atac_scores
: A matrix with ATAC scores for each peak at the different bins of the trajectory. We will regress the last bin (bin4) vs the first bin (bin1).print(paste("Number of peaks:", nrow(peak_intervals)))
+#> [1] "Number of peaks: 99291"
+
+
+# ATAC scores
+print("\nATAC Scores:")
+#> [1] "\nATAC Scores:"
head(atac_scores)
#> bin1 bin2 bin3 bin4
@@ -145,12 +148,16 @@ Download the example files#> 1_Xkr4_3192 0.234917748 0.159842916 0.24823286 0.40749248
#> 1_Xkr4_3264 0.057764122 0.079473969 0.05351386 0.05018866
-nrow(atac_scores)
-#> [1] 99291
additional_features
: A data frame with additional features for each peak. This is optional.print(paste("Number of peaks:", nrow(atac_scores)))
+#> [1] "Number of peaks: 99291"
+
+
+
+# Additional features
+print("\nAdditional Features:")
+#> [1] "\nAdditional Features:"
head(additional_features)
#> cg_cont k4me3 k27me3 k27ac prox_bin1_punc_all spatial_ratio
@@ -182,39 +189,36 @@ Download the example files#> 1_Xkr4_3192 4.054054 6.2500
#> 1_Xkr4_3264 3.783784 3.4375
-nrow(additional_features)
-#> [1] 99291
In addition, we loaded a set of genomic intervals that will be used for motif energy normalization. These intervals are all the peaks including the ones that are close to TSSs.
+print(paste("Number of peaks:", nrow(additional_features)))
+#> [1] "Number of peaks: 99291"
+
+The peak_intervals
dataframe contains the genomic positions of accessibility peaks. The atac_scores
matrix contains ATAC-seq signal intensities for each peak across different stages of the trajectory. additional_features
includes extra genomic features for each peak.
The first step in the IceQream pipeline is to compute motif energies for each motif model and each peak. This process is computationally intensive as we are computing the energy for 21862 motifs collected from HOMER, JASPAR, Jolma et al., HOCOMOCO and SCENIC databases.
- -Calculation is done by:
-
-# motif_energies <- compute_motif_energies(peak_intervals, motif_db, normalization_intervals = normalization_intervals)
If you want to start with a less computationally intensive example, you can use the the scenic clusters (1615) instead of the full motif database. This will still take ~10-15 minutes, and the performance will be significantly lower.
+The first step in the IceQream pipeline is to compute motif energies for each transcription factor model and each peak. This process is computationally intensive, as it calculates energies for over 20,000 motifs from various databases.
-# motif_energies <- compute_motif_energies(peak_intervals, motif_db_scenic_clusters, normalization_intervals = normalization_intervals)
In any case, we will load the precomputed motif energies for this example. Note that the full matrix requires ~23GB of RAM.
+motif_energies <- compute_motif_energies(peak_intervals, motif_db, normalization_intervals = normalization_intervals)
For this tutorial, we'll use pre-computed motif energies:
download.file("https://iceqream.s3.eu-west-1.amazonaws.com/gastrulation_energies.tar.gz", "gastrulation_energies.tar.gz")
untar("gastrulation_energies.tar.gz")
Load the motif energies:
motif_energies <- readr::read_rds("gastrulation_energies.rds")
motif_energies <- motif_energies[peak_intervals$peak_name, ]
-dim(motif_energies)
-#> [1] 99291 21862
For a less memory and computationally intensive analysis, you can use the SCENIC motif clusters instead of all motifs (See Performance Considerations section).
-We will now run IceQream on the gastrulation trajectory:
+Now we're ready to run the IceQream regression:
+This takes ~25 minutes on a 40 core machine with 256GB of RAM.
traj_model <- iq_regression(
peak_intervals = peak_intervals,
@@ -386,9 +390,9 @@ Run IceQream#>
#> Run `plot_traj_model_report(traj_model)` to visualize the model features
#> Run `plot_prediction_scatter(traj_model)` to visualize the model predictions
The result is a TrajectoryModel
object that contains the regression results and the model report:
Let's examine the output:
-traj_model
+print(traj_model)
#> <TrajectoryModel> with 21 motifs and 22 additional features
#>
#> Slots include:
@@ -432,141 +436,378 @@ Run IceQream#>
#> R^2 train: 0.304
#> R^2 test: 0.305
The TrajectoryModel
object contains components such as the regression model, motif models, and predicted accessibility scores.
We can plot the performance of the model:
+Let's start with a scatter plot of observed vs. predicted accessibility changes:
plot_prediction_scatter(traj_model)
And the model report:
-
-plot_traj_model_report(traj_model)
-#> ℹ Plotting 21 motifs
-#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
-#> of ggplot2 3.3.4.
-#> ℹ The deprecated feature was likely used in the ggseqlogo package.
-#> Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
-#> This warning is displayed once every 8 hours.
-#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
-#> generated.
-#> ℹ Plotting motif "SCENIC.cisbp__M00835"
-#> ℹ Plotting motif "JASPAR.SOX14"
-#> ℹ Plotting motif "SCENIC.taipale_cyt_meth__ESX1_YYAATTAN_eDBD"
-#> ℹ Plotting motif "JASPAR.nub"
-#> ℹ Plotting motif "SCENIC.taipale_cyt_meth__TCF7L1_ASATCAAAS_eDBD_meth_repr"
-#> ℹ Plotting motif "SCENIC.predrem__nrMotif1342"
-#> ℹ Plotting motif "SCENIC.swissregulon__mm__Gata1"
-#> ℹ Plotting motif "SCENIC.hocomoco__ZIC2_MOUSE.H11MO.0.C"
-#> ℹ Plotting motif "SCENIC.swissregulon__hs__ID4"
-#> ℹ Plotting motif "SCENIC.hocomoco__SMAD4_MOUSE.H11MO.0.A"
-#> ℹ Plotting motif "SCENIC.dbtfbs__mm_MYOG_myocyte_ENCSR000AIW_merged_N1"
-#> ℹ Plotting motif "SCENIC.jaspar__MA0869.2"
-#> ℹ Plotting motif "HOMER.PQM_1"
-#> ℹ Plotting motif "SCENIC.cisbp__M02320"
-#> ℹ Plotting motif "HOCOMOCO.SUH_MOUSE.H11MO.0.A"
-#> ℹ Plotting motif "SCENIC.jaspar__MA1621.1"
-#> ℹ Plotting motif "SCENIC.taipale_cyt_meth__TFAP4_ANCATATGNT_eDBD_meth"
-#> ℹ Plotting motif "SCENIC.taipale_cyt_meth__TGIF2_NTGACAGN_FL"
-#> ℹ Plotting motif "JOLMA.FIGLA_di_DBD"
-#> ℹ Plotting motif "SCENIC.taipale_tf_pairs__HOXB2_PITX1_TAATKRNGGATTA_CAP_repr"
-#> ℹ Plotting motif "HOMER.TATA_box_1"
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning in ks.test.default(plot_df$observed[plot_df$energy == "0-3"],
-#> plot_df$observed[plot_df$energy == : p-value will be approximate in the
-#> presence of ties
-#> Warning: Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Warning: Removed 42 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
-#> Warning: Removed 46 rows containing missing values or values outside the scale range
-#> (`geom_line()`).
This plot shows how well our model predicts accessibility changes. Points closer to the diagonal line indicate better predictions. We measure the accuracy of the model using the coefficient of determination (R^2).
+Next, let's look at the model report, which provides detailed information about the motifs and their contributions:
+
+plot_traj_model_report(traj_model, filename = "model_report.pdf")
+knitr::include_graphics("model_report.pdf")
The model report provides several key pieces of information (from left to right):
+You can rename the motif models to more informative names, either manually using rename_motif_models
or automatically using match_traj_model_motif_names
:
+names_map <- match_traj_model_motif_names(traj_model)
+#> ℹ Matching motif names, note that this might take a while.
+#> → Matching JASPAR.nub
+#> → Matched with "HOMER.Oct4", PSSM correlation = 0.898696170118534
+#> → Matching SCENIC.predrem__nrMotif1342
+#> → Matched with "HOMER.Atoh1", PSSM correlation = 0.74898042552435
+#> → Matching SCENIC.cisbp__M00835
+#> → Matched with "HOMER.Tbx5", PSSM correlation = 0.904736231501015
+#> → Matching SCENIC.swissregulon__hs__ID4
+#> → Matched with "HOMER.E2A_1", PSSM correlation = 0.867727837996119
+#> → Matching SCENIC.taipale_cyt_meth__ESX1_YYAATTAN_eDBD
+#> → Matched with "HOMER.Isl1", PSSM correlation = 0.869760207503598
+#> → Matching SCENIC.jaspar__MA1621.1
+#> → Matched with "HOMER.E2A_1", PSSM correlation = 0.918354413557226
+#> → Matching JOLMA.FIGLA_di_DBD
+#> → Matched with "HOMER.MyoG", PSSM correlation = 0.721047792856512
+#> → Matching SCENIC.taipale_cyt_meth__TFAP4_ANCATATGNT_eDBD_meth
+#> → Matched with "HOMER.Atoh1", PSSM correlation = 0.669100601091648
+#> → Matching SCENIC.swissregulon__mm__Gata1
+#> → Matched with "HOMER.GATA3_2", PSSM correlation = 0.95177409403423
+#> → Matching SCENIC.taipale_cyt_meth__TGIF2_NTGACAGN_FL
+#> → Matched with "HOMER.Meis1", PSSM correlation = 0.733098848365572
+#> → Matching HOCOMOCO.SUH_MOUSE.H11MO.0.A
+#> → Matched with "HOMER.E2F6", PSSM correlation = 0.657723158470179
+#> → Matching SCENIC.jaspar__MA0869.2
+#> → Matched with "HOMER.Pit1", PSSM correlation = 0.532601141276059
+#> → Matching HOMER.PQM_1
+#> → Matched with "HOMER.PQM_1", PSSM correlation = 0.94889261728701
+#> → Matching SCENIC.hocomoco__SMAD4_MOUSE.H11MO.0.A
+#> → Matched with "HOMER.TboxSmad", PSSM correlation = 0.971715050385645
+#> → Matching SCENIC.cisbp__M02320
+#> → Matched with "HOMER.GATA3_2", PSSM correlation = 0.802823387218573
+#> → Matching SCENIC.hocomoco__ZIC2_MOUSE.H11MO.0.C
+#> → Matched with "HOMER.E_box_2", PSSM correlation = 0.710572263191446
+#> → Matching JASPAR.SOX14
+#> → Matched with "HOMER.Tcf3", PSSM correlation = 0.67164203806063
+#> → Matching SCENIC.taipale_cyt_meth__TCF7L1_ASATCAAAS_eDBD_meth_repr
+#> → Matched with "HOMER.Tcf4", PSSM correlation = 0.950211584477428
+#> → Matching SCENIC.dbtfbs__mm_MYOG_myocyte_ENCSR000AIW_merged_N1
+#> → Matched with "HOMER.Atoh1", PSSM correlation = 0.886109683951447
+#> → Matching HOMER.TATA_box_1
+#> → Matched with "HOMER.TATA_box", PSSM correlation = 0.867728627783548
+#> → Matching SCENIC.taipale_tf_pairs__HOXB2_PITX1_TAATKRNGGATTA_CAP_repr
+#> → Matched with "HOMER.Lhx2", PSSM correlation = 0.525391618049386
+names_map
+#> HOCOMOCO.SUH_MOUSE.H11MO.0.A
+#> "E2F6"
+#> HOMER.PQM_1
+#> "PQM_1"
+#> HOMER.TATA_box_1
+#> "TATA_box"
+#> JASPAR.SOX14
+#> "Tcf3"
+#> JASPAR.nub
+#> "Oct4"
+#> JOLMA.FIGLA_di_DBD
+#> "MyoG"
+#> SCENIC.cisbp__M00835
+#> "Tbx5"
+#> SCENIC.cisbp__M02320
+#> "GATA3_2"
+#> SCENIC.dbtfbs__mm_MYOG_myocyte_ENCSR000AIW_merged_N1
+#> "Atoh1"
+#> SCENIC.hocomoco__SMAD4_MOUSE.H11MO.0.A
+#> "TboxSmad"
+#> SCENIC.hocomoco__ZIC2_MOUSE.H11MO.0.C
+#> "E_box_2"
+#> SCENIC.jaspar__MA0869.2
+#> "Pit1"
+#> SCENIC.jaspar__MA1621.1
+#> "E2A_1"
+#> SCENIC.predrem__nrMotif1342
+#> "Atoh1.2"
+#> SCENIC.swissregulon__hs__ID4
+#> "E2A_1.2"
+#> SCENIC.swissregulon__mm__Gata1
+#> "GATA3_2.2"
+#> SCENIC.taipale_cyt_meth__ESX1_YYAATTAN_eDBD
+#> "Isl1"
+#> SCENIC.taipale_cyt_meth__TCF7L1_ASATCAAAS_eDBD_meth_repr
+#> "Tcf4"
+#> SCENIC.taipale_cyt_meth__TFAP4_ANCATATGNT_eDBD_meth
+#> "Atoh1.3"
+#> SCENIC.taipale_cyt_meth__TGIF2_NTGACAGN_FL
+#> "Meis1"
+#> SCENIC.taipale_tf_pairs__HOXB2_PITX1_TAATKRNGGATTA_CAP_repr
+#> "Lhx2"
+traj_model <- rename_motif_models(traj_model, names_map)
+#> Warning in eval(family$initialize): non-integer #successes in a binomial glm!
+#> Warning: glmnet.fit: algorithm did not converge
You can export the minimal model representation to a list of PBM in order to use infer it on new data:
+
+pbm_list <- traj_model_to_pbm_list(traj_model)
+#> ℹ Computing motif energies for 21 motifs on 129963 normalization intervals
+pbm_list
+#> $Oct4
+#> a <PBM> object named "Oct4" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.8997500001511 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Atoh1.2
+#> a <PBM> object named "Atoh1.2" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.7175598349943 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Tbx5
+#> a <PBM> object named "Tbx5" with 15 positions (`@pssm`)
+#> Energy normalization max = -23.7724491326609 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $E2A_1.2
+#> a <PBM> object named "E2A_1.2" with 15 positions (`@pssm`)
+#> Energy normalization max = -21.2037809248507 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Isl1
+#> a <PBM> object named "Isl1" with 15 positions (`@pssm`)
+#> Energy normalization max = -25.5471422344978 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $E2A_1
+#> a <PBM> object named "E2A_1" with 15 positions (`@pssm`)
+#> Energy normalization max = -21.4483677716017 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $MyoG
+#> a <PBM> object named "MyoG" with 15 positions (`@pssm`)
+#> Energy normalization max = -23.3601778254574 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Atoh1.3
+#> a <PBM> object named "Atoh1.3" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.1394656075513 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $GATA3_2.2
+#> a <PBM> object named "GATA3_2.2" with 15 positions (`@pssm`)
+#> Energy normalization max = -25.6915278612726 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Meis1
+#> a <PBM> object named "Meis1" with 15 positions (`@pssm`)
+#> Energy normalization max = -24.7345338401995 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $E2F6
+#> a <PBM> object named "E2F6" with 15 positions (`@pssm`)
+#> Energy normalization max = -25.9229999989994 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Pit1
+#> a <PBM> object named "Pit1" with 15 positions (`@pssm`)
+#> Energy normalization max = -26.0310794002464 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $PQM_1
+#> a <PBM> object named "PQM_1" with 15 positions (`@pssm`)
+#> Energy normalization max = -24.6383171142784 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $TboxSmad
+#> a <PBM> object named "TboxSmad" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.3027239225011 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $GATA3_2
+#> a <PBM> object named "GATA3_2" with 15 positions (`@pssm`)
+#> Energy normalization max = -26.5521923358056 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $E_box_2
+#> a <PBM> object named "E_box_2" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.6841209042217 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Tcf3
+#> a <PBM> object named "Tcf3" with 15 positions (`@pssm`)
+#> Energy normalization max = -23.5664496893186 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Tcf4
+#> a <PBM> object named "Tcf4" with 15 positions (`@pssm`)
+#> Energy normalization max = -23.5260544042839 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Atoh1
+#> a <PBM> object named "Atoh1" with 15 positions (`@pssm`)
+#> Energy normalization max = -22.55867400696 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $TATA_box
+#> a <PBM> object named "TATA_box" with 15 positions (`@pssm`)
+#> Energy normalization max = -23.5020346193338 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
+#>
+#> $Lhx2
+#> a <PBM> object named "Lhx2" with 13 positions (`@pssm`)
+#> Energy normalization max = -14.2673450054441 (`@max_energy`)
+#> Spatial distribution with 149 spatial factors, from position 1 to 299 (298 bp)
+#> (`@spat`)
+#> Includes a model with 4 coefficients ("high-energy", "higher-energy",
+#> "low-energy", and "sigmoid") (`@coefs`)
You can now use pbm_list.compute
or pbm_list.gextract
to infer the model on new data:
+new_intervals <- data.frame(
+ chrom = rep("chr1", 3),
+ start = c(3671720, 4412460, 4493400),
+ end = c(3672020, 4412760, 4493700)
+)
+pbm_list.gextract(pbm_list, new_intervals)
+#> ℹ Computing energies for 21 PBMs on 3 sequences
+#> chrom start end Oct4 Atoh1.2 Tbx5 E2A_1.2 Isl1 E2A_1
+#> 1 chr1 3671720 3672020 0.000000 0.000000 3.360820 0 3.729907 0
+#> 2 chr1 4412460 4412760 2.438457 2.731101 3.491086 0 4.280369 0
+#> 3 chr1 4493400 4493700 0.000000 0.000000 2.669436 0 3.894363 0
+#> MyoG Atoh1.3 GATA3_2.2 Meis1 E2F6 Pit1 PQM_1 TboxSmad
+#> 1 2.0862625 0 3.806423 3.154064 4.018626 2.305464 1.271002 0.1399182
+#> 2 2.3392323 0 5.520942 6.555252 5.438935 4.640641 5.010974 4.3378987
+#> 3 0.5483327 0 4.763735 4.778387 3.007386 4.004477 1.685282 0.4491135
+#> GATA3_2 E_box_2 Tcf3 Tcf4 Atoh1 TATA_box Lhx2
+#> 1 6.282286 1.2634071 0.000000 0.4460041 4.839001 0.000000 0
+#> 2 4.587564 0.4109196 2.038721 6.6088638 0.000000 1.872493 0
+#> 3 5.887933 0.9321862 2.091852 0.0000000 0.000000 2.445689 0
+
+# directly compute on sequences
+seqs <- prego::intervals_to_seq(new_intervals)
+seqs
+#> [1] "TTCCTCTCCTCCCCTCGCGCGCGCTCCCTCCTCCCGCAGCCTCTCCTCCACCAGCTGACTCCGAGGGAGAGGATGACCTCATCCCTTCCCTTCCAGCTGCCGCCGCTCCCACCCCGGCTGGGGAGGGGCGAGAAGGAGGGCCCGGAGGAGGGGCTGGGATTGAGGGGAGCGGCGGGTGGGGGTGCCTGGCTGGCCAGTGCTGGACGCGGAGGGCAACAGCACGGCAATCGGAGGCCCAGTCCAGGCTCGTGGGATAGCGAAGAGCGTTGAGTGGATTTCCTCGAAGCTGGGGGGATGGGA"
+#> [2] "ATCTCTGGAAAGACTTGTGCTGATCTCTCTCTGCCCCTTCCTTGATTCACATCTCAAGGGACCGAGAAGGGAGGGAAAACACCAGTCCAGTATTTCCTATCAGTTCAGCGGGGCAGGAACCGGGAGCTTTTCCACAGGGCTGAGCCTGGCCCTCCACTGAGCAGTGTCTGCATTCCAAGGCTCCAGCCTGTCACCACCCTTCCAATCCCTTTGAAGCTGGGCAAAAGGCCTGCCAACAAGCACCAAACTTGAGAGCTCCTCTGCCAGCCCTGGGAGGGGCTGTTTCCTGCCTGCTTTTCG"
+#> [3] "GCTCATGGCTCTCCAGACCGACCCCGAGCTCTGCTATGGCCACGGGACACGCCGCTTCCCCCGACCCTGAGGCAGGGATCGGAAGCTAGCCTGGAGATGCCCAGAGGAACTCGTAAAGCTGAGCGGGTGCTACCCTCCCGCTGCTCTCCTGGTAGACCTAACCCTTCGCCTAATCCGCGCTGGAGATCTACCCAGTGACACTGCGGGTGTCCCCCCGGGCCGCGGGGCCCTTTTCTTTATGGACGCGGCCAATGGCGAGGCGGGGGCGGACCGGACCCTAGTCCTTAGGCCCCCGCCCAG"
+pbm_list.compute(pbm_list, seqs)
+#> ℹ Computing energies for 21 PBMs on 3 sequences
+#> Oct4 Atoh1.2 Tbx5 E2A_1.2 Isl1 E2A_1 MyoG Atoh1.3
+#> [1,] 0.000000 0.000000 3.360820 0 3.729907 0 2.0862625 0
+#> [2,] 2.438457 2.731101 3.491086 0 4.280369 0 2.3392323 0
+#> [3,] 0.000000 0.000000 2.669436 0 3.894363 0 0.5483327 0
+#> GATA3_2.2 Meis1 E2F6 Pit1 PQM_1 TboxSmad GATA3_2 E_box_2
+#> [1,] 3.806423 3.154064 4.018626 2.305464 1.271002 0.1399182 6.282286 1.2634071
+#> [2,] 5.520942 6.555252 5.438935 4.640641 5.010974 4.3378987 4.587564 0.4109196
+#> [3,] 4.763735 4.778387 3.007386 4.004477 1.685282 0.4491135 5.887933 0.9321862
+#> Tcf3 Tcf4 Atoh1 TATA_box Lhx2
+#> [1,] 0.000000 0.4460041 4.839001 0.000000 0
+#> [2,] 2.038721 6.6088638 0.000000 1.872493 0
+#> [3,] 2.091852 0.0000000 0.000000 2.445689 0
The computational requirements for IceQream depend on the size of your dataset. For reference, the example dataset used here (with ~100,000 peaks and ~20,000 motifs) requires about 23GB of RAM for the motif energy matrix, and will take several hours to create, depending on your hardware.
+For a less memory and computationally intensive analysis, you can reduce the number of motifs used in the regression by taking a representative from the SCENIC clusters (1615) instead of all motifs (20,000+). This can be done by:
+
+# motif_energies <- compute_motif_energies(peak_intervals, motif_db_scenic_clusters, normalization_intervals = normalization_intervals)
Note that the performance of the model will be lower with fewer motifs, but it can still provide valuable insights.
To run IceQream on your own data, you will need to provide the following inputs:
+chrom
, start
, end
, peak_name
), optionally it can have a const
column indicating constitutive loci.You can then follow the steps outlined in this vignette to compute motif energies and run the regression.