-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcookbook.tex
1089 lines (875 loc) · 57.5 KB
/
cookbook.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,12pt]{article}
% \usepackage{ucs}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{caption}
\usepackage{mathpazo}
\usepackage{makeidx}
\usepackage{multicol}
\usepackage{fontenc}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{float}
\usepackage{multirow}
\usepackage{color}
\usepackage{fancyvrb}
\lstset{breaklines=true}
\lstset{basicstyle=\ttfamily}
\lstset{framesep=10pt}
\usepackage{lscape}
\usepackage[top=1in,bottom=1in,right=1in,left=1in]{geometry}
\usepackage{float}
\usepackage[pdftex]{hyperref}
\title{\textbf{MaLTE: Machine Learning of Transcript Expression}\\\textit{An \textsf{R} Package Implementing the MaLTE Framework}}
\author{Paul K Korir$^{1}$ and Cathal Seoighe$^2$\\
\\
$^{1}$School of Biochemistry and Cell Biology,\\
University College Cork, Cork \\
$^{2}$School of Mathematics, Statistics and Applied Mathematics,\\
National University of Ireland, Galway
}
\date{}
\makeindex
\begin{document}
\maketitle
\newpage
\tableofcontents
\newpage
% \begin{abstract}
% MaLTE is a novel approach to gene and transcript expression prediction that uses a supervised learning approach to achieve performance superior to conventional microarray summarisation. It learns from a gold standard how best to use fluorescence probe intensity values. This leads to more accurate and absolute expression estimates as well as an simple facility to obtain expression estimates for individual transcript isoforms.
% \end{abstract}
\section{Introduction}
\label{introduction}
Quantification of oligonucleotide expression microarrays involves assembling probe fluorescence intensities from sets of probes into a single gene or transcript expression measure, a process referred to as \textit{summarisation} \cite{irizarry2003summaries}. There are many summarization algorithms \cite{irizarry2006comparison} most of which exhibit from two main limitations: they produce \textit{relative}, as opposed to \textit{absolute}, estimates \cite{irizarry2005multiple, fu2009estimating} and they are oblivious to the abundance of individual transcript abundances \cite{malone2011microarrays}.
The \textsf{MaLTE} framework supplements conventional algorithms with a supervised learning approach between a gold standard probe fluorescence intensities. As a framework, the various components may be modified or entirely overhauled; it is not restricted to the learning algorithms used in the \textsf{R} \textsf{MaLTE} package (conditional random forest (CRF) and quantile regression random forest (QRF)). Currently, we use RNA-Seq as the gold standard HTS quantification technique. Figure \ref{fig:schematic} below shows a schematic of the \textsf{MaLTE} framework. Learning occurs for each gene independently though only one gene is illustrated. For gene expression, \textsf{MaLTE} incorporates simple feature selection by picking the best 15 probes that correlate with the expression in the gold standard.
\begin{figure}[H]
\centering
\includegraphics[width=\columnwidth]{schematic.pdf}
\caption{\textbf{Schematic of the MaLTE framework.} A supervised learning algorithm is used to learn the relationship between a gold-standard (RNA-Seq) and probe fluorescence intensities. The learned model may then be applied to a new set of probes.}
\label{fig:schematic}
\end{figure}
This approach overcomes the two main setbacks of conventional algorithms: it transforms expression estimates onto an absolute scale thus improving the within-sample correlations and naturally extends to predicting expression of individual transcript isoforms by training on the multiple responses of individual transcript isoforms. Moreover, \textsf{MaLTE} leads to substantial improvements in cross-sample correlations when used on data from the same batch, which increases statistical power. Finally, a tree-based learning introduces an easy way to filter out poorly-predicted genes through the use of out-of-bag estimates.
This documents describes how to use the \textsf{R} \textsf{MaLTE} package. It begins with a description on how to get and install the package. It then outlines the two main ways in which the package may be used: \textit{gene expression prediction (GEP)} and \textit{transcript isoform expression prediction (TIEP)}. It concludes with a description on how to filter and collate expression predictions for downstream analyses. It also provides a tentative road map for future development, a detailed description on how the \textsf{MaLTE} package is built, and includes a detailed use case of using data from the Genotype-Tissue Expression (GTEx) project \cite{lonsdale2013genotype} as training data together with an example of array transformation (e.g. from exon array to gene array). Bug reports, comments and suggestions are welcome through \href{mailto:[email protected]}{[email protected]}.
\section{Conventions Used}
\label{conventions}
\begin{itemize}
\item Array data used here corresponds to that from \textit{Affymetrix GeneChip$^{\text{\textregistered}}$ Human Exon 1.0 ST} arrays. At present only Affymetrix GeneChip$^{\text{\textregistered}}$ Human Exon and Human Gene arrays have been tested using \textsf{MaLTE}.
\item All gene and transcript identifiers are from Ensembl (\url{http://www.ensembl.org}).
\item All filenames are written in italics (\textit{filename.txt}), variables in monospace (\texttt{my.var}), and functions in monospace terminated with parentheses (\texttt{my.function()}). Classes are in monospace beginning with a capital letter (\texttt{My.Class}) while corresponding constructors additionally terminate with parentheses (\texttt{My.Class()}). Names of software packages are in sans (\textsf{R}, \textsf{MaLTE}, \textsf{Cufflinks})
\item \textit{High-throughput sequencing (HTS)} refers to \textit{RNA-Seq}.
\item A \textit{map} is a tab-delimited text file with two columns with each column consisting of identifiers/names.
\item The set of samples used for training are called \textit{training samples}. \textit{Test samples} refer to the samples that need to have their gene/transcript expression quantified.
\end{itemize}
\section{System Requirements}
\label{system}
\begin{itemize}
\item \textsf{R} (\texttt{2.14.0} or greater) installed on GNU/Linux: \textsf{MaLTE} has been tested on Scientific Linux version 5, Ubuntu 12.04, and Mac OS X.
\item \textsf{R} packages: \texttt{party}, \texttt{multicore}, \texttt{quantrefForest}, \texttt{limma}
\item \textsf{Python} 2.7
\item Affymetrix Power Tools (APT; \url{http://www.affymetrix.com})
\item Affymetrix GeneChip® library files (\url{http://www.affymetrix.com})
\item \textsf{git} (optional)
\end{itemize}
\section{Installation}
\label{installation}
\textsf{MaLTE} may be downloaded from \url{https://github.com/polarise/MaLTE-package}. There are two ways to install \textsf{MaLTE}: via GNU/Linux shell and \textsf{R} shell.
\begin{enumerate}
\item Via GNU/Linux shell
\begin{verbatim}
R CMD INSTALL MaLTE_<release>.tar.gz
\end{verbatim}
\item Via \textsf{R} shell
\begin{verbatim}
> install.packages( "/path/to/MaLTE_<release>.tar.gz" )
\end{verbatim}
\end{enumerate}
\noindent\\
Once installed, \textsf{MaLTE} should be loaded in the \textsf{R} shell as follows:
\begin{verbatim}
> library( MaLTE )
\end{verbatim}
or, quietly
\begin{verbatim}
> suppressMessages( library( MaLTE ))
\end{verbatim}
\section{Example Dataset}
\label{datasets}
Example data files are provided with \textsf{MaLTE}, which are based on data from the HapMap project \cite{gibbs2003international}. These are all contained in the \texttt{data} directory in the package. They will be used in Section~\ref{gep} and Section~\ref{tiep} and are formatted as outlined in Section~\ref{gep:detailed} and Section~\ref{tiep:detailed}.
\begin{enumerate}
\item Files with sample names (several examples provided)
\item HTS data
\item Transcript HTS data
\item Raw microarray data (direct output from APT \texttt{apt-cel-extract})
\item Truncated microarray data (non-essential rows and columns removed)
\item Gene-to-probeset maps for exon array
\item Gene-to-transcript maps
\end{enumerate}
The complete set of files may be displayed on the \textsf{R} shell using the following line:
\begin{verbatim}
> dir( paste( system.file( package="MaLTE" ), "data", sep="/" ))
\end{verbatim}
\section{Gene Expression Prediction}
\label{gep}
\subsection{Quick Start Guide}
\label{gep:quick}
This section provides a quick introduction to using \textsf{MaLTE}. Detailed instructions incorporating descriptions of the various file and there respective formats is provided in Section \ref{gep:detailed}. It is assumed that the user is in the \textsf{R} shell, the \textsf{MaLTE} library is loaded and the following data files are available: \textit{samples.txt}, \textit{hts\_data.txt}, \textit{ma\_data.txt}, \textit{gene\_probesets.txt}.
\noindent\\
\textbf{Step I:} Provide the location of the file containing a map between sample names on both platforms (RNA-Seq and microarray)
\begin{verbatim}
> samples.fn = paste( system.file( package="MaLTE" ), "data",
"samples.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step II:} Provide the location of the file containing the high-throughput sequencing (RNA-Seq) data
\begin{verbatim}
> hts.fn = paste( system.file( package="MaLTE" ), "data",
"hts_data.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step III:} Provide the location to the file containing quantile-normalized and background corrected fluorescence probe intensities
\begin{verbatim}
> ma.fn = paste( system.file( package="MaLTE" ), "data",
"ma_data.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step IV:} Provide the location of a map showing the probe sets associated with each gene
\begin{verbatim}
> g2p.fn = paste( system.file( package="MaLTE" ), "data",
"gene_probesets.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step V:} Prepare the data into training and test sets
\begin{verbatim}
> prepare.data( samples.fn=samples.fn, ma.fn=ma.fn, hts.fn=hts.fn,
g2p.fn=g2p.fn )
\end{verbatim}
\noindent\\
\textbf{Step VI:} Read the data in preparation for the training and test phase
\begin{verbatim}
> tt.ready = read.data( train.fn="train_data.txt.gz",
test.fn="test_data.txt.gz" )
\end{verbatim}
\noindent\\
\textbf{Step VII:} Initialize training parameters
\begin{verbatim}
# conditional random forest
> tt.params = TT.Params()
# quantile regression random forest
> tt.params = TT.Params( quantreg=TRUE )
\end{verbatim}
\noindent\\
\textbf{Step VIII:} Train and predict
\begin{verbatim}
> tt.seq = array2seq( tt.ready, tt.params )
\end{verbatim}
\noindent\\
\textbf{Step IX:} Perform out-of-bag (OOB) predictions
\begin{verbatim}
> tt.seq.oob = array2seq.oob( tt.ready, tt.params )
\end{verbatim}
\noindent\\
\textbf{Step X:} Filter based on OOB correlations
\begin{verbatim}
> tt.filtered = oob.filter( tt.seq, tt.seq.oob, thresh=0 )
\end{verbatim}
\noindent\\
\textbf{Step XI:} Get the names of test samples
\begin{verbatim}
> test.names = get.names( samples.fn, test=TRUE )
\end{verbatim}
or
\begin{verbatim}
> test.names = get.test( samples.fn ) # get test sample names
\end{verbatim}
\noindent\\
\textbf{Step XII:} Aggregate predictions and write output to a text file for downstream analyses.
\begin{verbatim}
> df = get.predictions( tt.filtered, test.names )
> write.table( df, file="filt_preds.txt", col.names=TRUE, row.names=FALSE,
quote=FALSE, sep="\t" )
\end{verbatim}
\subsection{Detailed Instructions}
\label{gep:detailed}
Gene expression prediction (GEP) depends on having four input files:
\begin{itemize}
\item \textbf{Sample names.} A map of sample names between both platforms (HTS and array; possibly zipped). An example file is provided with the package (Step I).
\item \textbf{HTS data.} The HTS (RNA-Seq) data in text file (possibly zipped)
\item \textbf{Microarray data.} The microarray probe data (possibly zipped)
\item \textbf{Gene-to-probeset map.} A map between gene identifiers and probe set identifiers. Probe set identifiers are provided by the array manufacturer as part of the array description.
\end{itemize}
Each file is now described in detail under the following sub-headings: \textit{purpose}, \textit{generic designation}, \textit{header} and \textit{structure}.
\subsubsection{Sample Names File}
\label{gep:sample}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides a one-to-one map between sample identifiers on both platforms. \\
\textbf{Generic designation} & \textit{samples.txt} OR \textit{samples.txt.gz}; Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}. \\
\textbf{Header} & \texttt{hts\textless tab\textgreater ma} \\
\textbf{Structure} & Two-columns separated by a single tab (tab-delimited) \\
& All sample identifiers must be unique and must exactly correspond to sample names present in the headers of the HTS data and microarray data (to be described shortly). For example, microarray probe files will usually have headers with sample names terminated by \textit{*.CEL}; this must be retained in the sample names file. \\
& Test samples are marked by having an asterisk (\texttt{`*'}) as the first character. Any other row is assumed to be a training sample. \\
& Test samples may consist of having both HTS and array data. If only test array data is present then the first column must be \texttt{`*NA'}. \\
& Comments must begin with a pound/hash (\texttt{`\#'}) symbol. \\
\end{tabular}
\subsubsection{HTS Data File}
\label{gep:hts}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides the HTS expression estimates.\\
\textbf{Generic designation} & \textit{hts\_data.txt} OR \textit{hts\_data.txt.gz} \\
& Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}. \\
\textbf{Header} & \texttt{gene\_id\textless tab\textgreater Sample01\textless tab\textgreater...\textless tab\textgreater SampleN} \\
\textbf{Structure} & Tab-delimited \\
& All rows must be unique \\
& No comments are allowed \\
\end{tabular}
\subsubsection{Microarray Data File}
\label{gep:microarray}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides the microarray probe intensities. \\
\textbf{Generic designation} & \textit{ma\_data.txt} OR \textit{ma\_data.txt.gz} if inessential columns have been removed (see Section \ref{gep:prepare} on \textit{Preparing Input Files})
\textit{raw\_ma\_data.txt} OR \textit{raw\_ma\_data.txt.gz} if inessential columns are still present (see Section \ref{gep:prepare} on \textit{Preparing Input Files}) \\
& Any suitable name will do but its name must terminate with \textit{*.txt} or \textit{*.txt.gz}. \\
\textbf{Header} & \texttt{probe\_id\textless tab\textgreater Sample01\textless tab\textgreater...\textless tab\textgreater SampleN} \\
\textbf{Structure} & Tab-delimited \\
& Comments must begin with a pound/hash (\texttt{`\#'}) symbol. \\
\end{tabular}
\subsubsection{Gene-to-Probe Set Map}
\label{gep:gene}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides a one-to-many map of gene identifiers to probe set identifiers. \\
\textbf{Generic designation} & \textit{gene\_probesets.txt} OR \textit{gene\_probesets.txt.gz} \\
& Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}.. \\
\textbf{Header} & \texttt{gene\_id\textless tab\textgreater probeset\_id} \\
\textbf{Structure} & Tab-delimited \\
& Probe set identifiers may be missing for some genes. \\
& No comments are allowed. \\
\end{tabular}
\subsection{Preparing Input Files}
\label{gep:prepare}
\begin{enumerate}
\item \textbf{Sample names.} This file can be prepared using a text editor or preferably using a spreadsheet application such as Microsoft Excel or LibreOffice/OpenOffice Calc. The file must be saved as a tab-delimited text file or a comma-separated values (CSV) file with the field-delimiter set to TABS and the quote-character set to NONE. The file extension must be as described above.
\item \textbf{HTS data.} This file may be constructed using customized scripts that collate the HTS expression estimates output from programs such as \textsf{Cufflinks} or \textsf{DESeq}. The order of sample columns is unimportant. There are several online datasets\footnote{\url{http://bowtie-bio.sourceforge.net/recount/}} that are provided in this format making it easy to proceed with using \textsf{MaLTE}.
\item \textbf{Microarray data.} Raw microarray data is provided as CEL files. The contents of CEL files need to be extracted and additional pre-processing steps may be applied to the raw data. Two pre-processing steps we recommend are quantile-normalization (QN) and background correction (BC). The Affymetrix Power Tools (APT) suite is recommended for this and other analytical steps though several \textsf{R} packages have been developed to supplement APT. Here we describe how to extract fluorescence probe intensities and how to remove unnecessary columns.
\begin{enumerate}
\item[(i)] \textbf{Extracting QN and BC microarray probe data.} We assume that all CEL files are contained in a single folder. APT requires a set of library files that are available from the Affymetrix website (\url{http://www.affymetrix.com}). An account will have to be created in order to download library files. The library files consist of array description files used by APT to carry out analyses. More information on these can be found in the manuals provided for each array type.
To extract probes with QN and BC:
\begin{verbatim}
apt-cel-extract -o raw_ma_data.txt
-c /path/to/HuEx-1_0-st-v2.2/HuEx-1_0-st-v2.r2.clf
-p /path/to/HuEx-1_0-st-v2.2/HuEx-1_0-st-v2.r2.pgf
-b /path/to/HuEx-1_0-st-v2.2/HuEx-1_0-st-v2.r2.antigenomic.bgp
-a quant-norm,pm-gcbg *.CEL
\end{verbatim}
This provides `raw' data that can directly be used with \textsf{MaLTE}. To do so, the \texttt{raw} argument in \texttt{prepare.data()} must be set to \texttt{TRUE} (as it is \texttt{FALSE} by default). However, unnecessary columns can be excluded as shown below.
\item[(ii)] \textbf{Excluding unnecessary columns.} Unnecessary columns can be easily excluded using the bash utility \texttt{cut} as follows:
\begin{verbatim}
cut -f1,5,8- raw_ma_data.txt > ma_data.txt
\end{verbatim}
\end{enumerate}
\item \textbf{Zip the file to save space.}
\begin{verbatim}
gzip raw_ma_data.txt
gzip ma_data.txt
\end{verbatim}
\item \textbf{Gene-to-probeset map.} This file may be downloaded directly from \textsf{BioMart}, particularly for popular arrays. Alternatively, the user may prepare it for him/herself. This can be done using \textsf{BEDTools}. \textsf{BEDTools} takes as input two BED files having coordinates of gene and probe sets, respectively. The \texttt{intersect} \textsf{BEDTools} utility then finds all overlaps between both files and writes them to an extended BED file. The appropriate columns can then be combined to provide the required file.
To use the BED approach, the gene annotation must be provided as a BED file. Similarly, the array's annotation files (available from the Affymetrix website) must be converted to BED format. Both tasks may be performed using custom scripts.
Please consult the \textsf{BEDTools} website (\url{http://bedtools.readthedocs.org/en/latest/}) on how to intersect two BED files.
The gene and probe set columns can then be isolated using \texttt{cut} in a manner similar to sub-step (2) above.
\end{enumerate}
\section{Transcript Isoform Expression Prediction}
\label{tiep}
\subsection{Quick Start Guide}
\label{tiep:quick}
This section describes how to perform transcript isoform expression prediction in quick steps. It assumes that the user has logged into an \textsf{R} shell, the \textsf{MaLTE} package is loaded and that the following files are available: \textit{samples.txt}, \textit{train\_data.txt.gz}, \textit{test\_data.txt.gz}, \textit{hts\_txs\_data.txt}, and \textit{gene\_transcripts.txt}.
The files \textit{train\_data.txt.gz} and \textit{test\_data.txt.gz} are produced by running Step I-V in Section \ref{gep:quick} \textit{Gene Expression Prediction: Quick Start Guide}.
\noindent\\
\textbf{Step I:} Provide the location of the file containing a map between sample names on both platforms (RNA-Seq and microarray)
\begin{verbatim}
> samples.fn = paste( system.file( package="MaLTE" ), "data",
"samples.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step II:} Provide the location of the file containing the high-throughput sequencing (RNA-Seq) transcript isoform expression estimates
\begin{verbatim}
> hts.txs.fn = paste( system.file( package="MaLTE" ), "data",
"hts_txs_data.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step III:} Provide the location of map of gene-to-transcript identifiers
\begin{verbatim}
> g2tx.fn = paste( system.file( package="MaLTE" ), "data",
"gene_transcripts.txt.gz", sep="/" )
\end{verbatim}
\noindent\\
\textbf{Step IV:} Prepare the data into training and test sets
\begin{verbatim}
> prepare.txs.data( samples.fn=samples.fn, train.fn="train_data.txt.gz",
test.fn="test_data.txt.gz", hts.txs.fn=hts.txs.fn, g2tx.fn=g2tx.fn )
\end{verbatim}
\noindent\\
\textbf{Step V:} Read in the data in preparation for training and preparation
\begin{verbatim}
> tt.ready.txs = read.txs.data( train.fn="train_txs_data.txt.gz",
test.fn="test_txs_data.txt.gz" )
\end{verbatim}
\textbf{Step VI:} Train and predict
\begin{verbatim}
> tt.seq.txs = array2seq( tt.ready.txs, tt.params )
\end{verbatim}
\noindent\\
\textbf{Step VII:} Train and predict for OOB estimates
\begin{verbatim}
> tt.seq.oob.txs = array2seq.oob( tt.ready.txs, tt.params )
\end{verbatim}
\textbf{Step VIII:} Filter based on OOB correlations
\begin{verbatim}
> tt.filtered.txs = oob.filter( tt.seq.txs, tt.seq.oob.txs, thresh=0 )
\end{verbatim}
\noindent\\
\textbf{Step IX:} Get test sample names
\begin{verbatim}
> test.names = get.names( samples.fn, test=TRUE )
\end{verbatim}
or
\begin{verbatim}
> test.names = get.test( samples.fn ) # get test sample names
\end{verbatim}
\noindent\\
\textbf{Step X:} Collate predicted transcript isoform predictions and write them to a text file for downstream analyses
\begin{verbatim}
> df.txs = get.predictions( tt.filtered.txs, test.names )
\end{verbatim}
\subsection{Detailed Instructions}
\label{tiep:detailed}
Transcript isoform expression prediciton (TIEP) depends on having five input files:
\begin{itemize}
\item \textbf{Sample names.} A map of sample names between both platforms (HTS and array; possibly zipped). This is the exact same file used in GEP above.
\item \textbf{GEP Training data.} The name of this file is \textit{train\_data.txt.gz}. It is the first zipped output produced by running \texttt{prepare.data()} prior to carrying out GEP.
\item \textbf{GEP Test data.} The name of this file is \textit{test\_data.txt.gz}. It is the second zipped output produced by running \texttt{prepare.data()} prior to carrying out GEP.
\item \textbf{Transcript HTS data.} The HTS (RNA-Seq) transcript isoform data in text file (possibly zipped)
\item \textbf{Gene-to-transcript map.} A map of between gene identifiers and transcript identifiers.
\end{itemize}
\subsubsection{Sample Names File}
\label{tiep:sample}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides a one-to-one map between sample identifiers on both platforms. \\
\textbf{Generic designation} & \textit{samples.txt} OR \textit{samples.txt.gz} \\
& Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}.. \\
\textbf{Header} & \texttt{hts\textless tab\textgreater ma} \\
\textbf{Structure} & Two-columns separated by a single tab (tab-delimited) \\
& All sample identifiers must be unique and must exactly correspond to sample names present in the headers of the HTS data and microarray data. For example, microarray probe files will usually have headers with sample names terminated by \textit{*.CEL}; this must be retained in the sample names file. \\
& Test samples are marked by having an asterisk (\texttt{`*'}) as the first character. Any other row is assumed to be a training sample. \\
& Test samples may consist of having both HTS and array data. If only test array data is present then the first column must be \texttt{`*NA'}. \\
& Comments must begin with a pound/hash (\texttt{`\#'}) symbol. \\
\end{tabular}
\subsubsection{GEP Training Data File}
\label{tiep:gep_train}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file contains gene-to-probe training data that will be used to create new transcripts-to-probes training data. \\
\textbf{Generic designation} & \textit{train\_data.txt.gz} \\
& This file is produced after running \texttt{prepare.data()}. \\
\textbf{Header} & None \\
\textbf{Structure} & This file has six columns. This data is automatically generate by the \texttt{prepare.data()} function. \\
& Gene identifier \\
& Number of training samples \\
& Number of probes associated with this gene \\
& Probe (not probe set) identifiers associated with this gene \\
& HTS expression estimates \\
& Vectorized matrix\footnotemark[1] of fluorescence probe intensities \\
\end{tabular}
\footnotemark[1]{A \textit{vectorized matrix} is a stack of the columns into a single column vector.}
\subsubsection{GEP Test Data File}
\label{tiep:gep_test}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file contains that gene-to-probe training data that will be used to create new transcripts-to-probes training data. \\
\textbf{Generic designation} & \textit{test\_data.txt.gz} \\
\textbf{Header} & None \\
\textbf{Structure} & This file has six columns. This data is automatically generate by the \texttt{prepare.data()} function. \\
& Gene identifier \\
& Number of test samples \\
& Number of probes associated with this gene \\
& Probe (not probe set) identifiers associated with this gene \\
& HTS expression estimates \\
& Vectorized matrix of fluorescence probe intensities
\end{tabular}
\subsubsection{Transcript HTS Data File}
\label{tiep:transcript}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides the HTS transcript isoform expression estimates. \\
\textbf{Generic designation} & \textit{hts\_txs\_data.txt} OR \textit{hts\_txs\_data.txt.gz} \\
& Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}.. \\
\textbf{Header} & \texttt{tx\_id\textless tab\textgreater Sample01\textless tab\textgreater...\textless tab\textgreater SampleN} \\
\textbf{Structure} & Tab-delimited \\
& All rows must be unique \\
& No comments are allowed \\
\end{tabular}
\subsubsection{Gene-to-Transcript Map}
\label{tiep:gene}
\begin{tabular}{rp{12cm}}
\textbf{Purpose} & This file provides a one-to-many map of gene identifiers to transcript identifiers. \\
\textbf{Generic designation} & \textit{gene\_transcripts.txt} OR \textit{gene\_transcripts.txt.gz} \\
& Any suitable name will do but the file name must end either with \textit{*.txt} or \textit{*.txt.gz}.. \\
\textbf{Header} & \texttt{gene\_id\textless tab\textgreater tx\_id} \\
\textbf{Structure} & Tab-delimited \\
& No comments are allowed. \\
\end{tabular}
\subsection{Preparing Input Files}
\label{tiep:prepare}
\begin{enumerate}
\item \textbf{Sample names.} Please see Section~\ref{gep} \textit{GEP: Preparing Input Files}.
\item \textbf{GEP training data.} This file is automatically generated after running \texttt{prepare.data()} Please see Section~\ref{gep} Steps I-V of \textit{GEP: Quick Start Guide}.
\item \textbf{GEP test data.} This file is automatically generated after running \texttt{prepare.data()} Please see Section~\ref{gep} Steps I-V of \textit{GEP: Quick Start Guide}.
\item \textbf{Transcript HTS data file.} Several programs are available that perform transcript isoform expression quantification from HTS data. Widely used examples include \textsf{Cufflinks}, \textsf{IsoEM} and \textsf{RSEM}. As suggested in HTS data description in Section \ref{gep:prepare} \textit{GEP: Preparing Input Files}, custom scripts should be used to combine expression estimates. Sample names in the header must exactly correspond to those in the Sample names file. The order of samples is not important.
\item \textbf{Gene-to-transcript map.} A man of gene to transcript identifiers is most easily obtained from \textsf{BioMart}.
\end{enumerate}
\section{Filtering and Collating Predictions}
\label{filtering}
The training and prediction step produces data in an internal format that is not suitable for further bioinformatic analyses. The prediction estimates need to be filtered to exclude poor predictions then finally converted into a data frame that can then be saved as a tab-delimited text file. Steps IX and X show GEP OOB filtering and step VII and VIII show TIEP OOB filtering. In order for expression predictions to be collated, the sample names must first be obtained (GEP: Step XI; TIEP: Step X) from the sample names file (\textit{samples.txt}). Finally, extracted data may be written to a tab-delimited text file using base \textsf{R} functions.
Filtering takes advantage of the tree-based learning algorithm. Conditional random forests are used in the current implementation of \textsf{MaLTE}. A forest is an ensemble of trees, with each tree constructed by bootstrapping on the training data then using a recursive partition approach to define the branches. The forest consists of hundreds to thousands of such trees and predictions are aggregated either by a simple voting scheme (classification) or averaging (regression). Each training observation is used to train a subset of all the trees therefore the trees from which it is absent (for which it is \textit{out-of-bag}) can be used to predict its value. The resulting predicted estimates are therefore called the \textit{out-of-bag (OOB) estimates} from which we can estimate a correlation with the training response. It is this correlation that acts as a heuristic for prediction performance and results in \textit{OOB filtering}. An OOB Pearson correlation threshold of zero ($r_{\mathrm{OOB}} > 0$) is currently set as a default filter threshold.
Alternatively, filtering can be based on the training RPKM/FPKM values. For example, we could include all genes with at least ten samples with an FPKM above one.
\begin{verbatim}
# tt.seq - predictions from training data
# tt.seq.oob - OOB predictions (contains training RNA-Seq)
# only include genes with at least 10 samples having
# RNA-Seq expression of at least 1
> filt.indexes = .filter.trues( tt.seq.oob, filter.fpkm=1, filter.count=10 )
> tt.filt = tt.seq[ filt.indexes ]
# extract a data frame of filtered predictions
> test.names = get.test( samples.fn ) # get names of test samples
> df.malte = get.predictions( tt.filt, test.names )
\end{verbatim}
\section{Training Parameters}
\label{params}
There are eight main parameters that can be specified when running \textsf{MaLTE:}
\begin{verbatim}
> tt.params = TT.Params()
> tt.params
mtry = 2
ntree = 1000
feature.select = TRUE
min.probes = 15
cor.thresh = 0
OOB = FALSE
QuantReg = FALSE
Tune (OOB cor.P) = NA
\end{verbatim}
\begin{enumerate}
\item \textbf{\texttt{mtry.}} Specifies the number of predictors (probes) to be used in creating splits in the tree. The default and optimized parameter value is \texttt{mtry=2}.
\item \textbf{\texttt{ntree.}} The number of trees in the forest. The default and optimized value is \texttt{ntree=1000}.
\item \textbf{\texttt{feature.select.}} A boolean that sets whether feature selection is carried out. The default value is \texttt{feature.select=TRUE}.
\item \textbf{\texttt{min.probes.}} If \texttt{feature.select=TRUE} then what is the minimum number of probes below which no feature selection is carried out? Defaults to \texttt{min.probes=15} (i.e. feature selection is only carried out for 16 or more probes).
\item \textbf{\texttt{cor.thresh.}} Feature selection is implemented using a simple method: the top \texttt{min.probes} probes that correlate with the response (RNA-Seq) are used if they have a Pearson correlation of at least \texttt{cor.thresh}.
\item \textbf{\texttt{OOB.}} A boolean that specifies whether or not OOB predictions are performed. Defaults to \texttt{OOB=FALSE}.
\item \textbf{\texttt{quantreg.}} A boolean that specifies whether or not to use quantile regression forest as the learner. Defaults to \texttt{quantreg=FALSE}.
\item \textbf{\texttt{tune.}} This is an experimental feature in which parameter values are tune for each gene individually. Parameters are tuned in a cascade in the following order: \texttt{feature.select}, \texttt{mtry}, then \texttt{ntree}. Other parameters are retained as specified above. Defaults to \texttt{tune=FALSE}.
\end{enumerate}
\section{Future Work}
\label{future}
The following list of features may be added at a future date:
\begin{enumerate}
\item \textbf{Replace the underlying \textsf{Python} scripts with C/C++ programs.} We used \textsf{Python} because it was relatively simple to put together and it has mature data structures. \textsf{Python} scripts are slower than \textsf{C/C}\verb!++! programs but we have applied multiprocessing to shorten the data preparation step. However, this arrangement restricts \textsf{MaLTE} to UNIX-like systems.
\item \textbf{Configure training and test data using a standardised structured file format.} Currently, training and test data is held in custom tab-delimited files. We would like to transition either to XML or HDF.
\item \textbf{Expand the use of the sample names (\textit{samples.txt}) file.} Currently, the sample names file is underutilised. It is possible to include additional columns that could be incorporated into the learning process. For example, a column on batch information could be passed to \textsf{ComBat} to minimise batch effects. Other variables such as tissue type could be important for training.
\item \textbf{Automatic parameter tuning.} We would like to incorporate a tuning utility that uses a random sample of the training data to optimise the training parameters.
\end{enumerate}
\section{Bug Reports}
\label{bugs}
Please send bug reports and feature requests to \href{mailto:[email protected]}{[email protected]} with the subject `MaLTE Bugs' of `MaLTE Features', respectively. Alternatively, you may visit \url{https://github.com/polarise/MaLTE/issues} and click the `Issues' button to file a report.
\section{Citing \textsf{MaLTE}}
\textsf{MaLTE} has been submitted for peer-review. This section will be updated shortly.
\begin{thebibliography}{50}
\bibitem{irizarry2003summaries} Irizarry, Rafael A., et al. "Summaries of Affymetrix GeneChip probe level data." Nucleic acids research 31.4 (2003): e15-e15.
\bibitem{irizarry2005multiple} Irizarry, Rafael A., et al. "Multiple-laboratory comparison of microarray platforms." Nature Methods 2.5 (2005): 345-350.
\bibitem{fu2009estimating} Fu, Xing, et al. "Estimating accuracy of RNA-Seq and microarrays with proteomics." BMC Genomics 10.1 (2009): 161.
\bibitem{irizarry2006comparison} Irizarry, Rafael A., Zhijin Wu, and Harris A. Jaffee. "Comparison of Affymetrix GeneChip expression measures." Bioinformatics 22.7 (2006): 789-794.
\bibitem{lonsdale2013genotype}
John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo,
Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et~al.
\newblock {The Genotype-Tissue Expression (GTEx) project}.
\newblock {\em Nature Genetics}, 45(6):580--585, 2013.
\bibitem{malone2011microarrays} Malone, John H., and Brian Oliver. "Microarrays, deep sequencing and the true measure of the transcriptome." BMC biology 9.1 (2011): 34.
\bibitem{gibbs2003international} Gibbs, Richard A., et al. "The international HapMap project." Nature 426.6968 (2003): 789-796.
\end{thebibliography}
\pagebreak
\appendix
\section{Classes}
\label{classes}
The \textsf{R} \textsf{MaLTE} package uses three main classes.
\begin{enumerate}
\item \texttt{TT.Ready}. This class handles data \textit{ready} for training and test. It is an abstract base class from which two other classes are derived:
\begin{enumerate}
\item[(i)] \texttt{TT.Ready.Gene}. This class handles gene expression training and prediction data.
\item[(ii)] \texttt{TT.Ready.Txs}. This is for transcript isoform expression data.
\end{enumerate}
\item \texttt{TT.Seq}. This class handles the results of training and prediction. Just like \texttt{TT.Ready}, this is an abstract base class with two derived classes:
\begin{enumerate}
\item[(i)] \texttt{TT.Seq.Gene} for gene predictions. Both test and OOB predictions are of this class.
\item[(ii)] \texttt{TT.Seq.Txs} for transcript isoform predictions similar to \texttt{TT.Seq.Gene}.
\end{enumerate}
\item \texttt{TT.Params}. This class handles training and prediction parameters passed to the \texttt{train.and.predict()} (alias \texttt{run()}) methods of \texttt{TT.Ready} objects. It has the following slots: \texttt{mtry}, \texttt{ntree}, \texttt{feature.select}, \texttt{min.probes}, \texttt{cor.thresh}, \texttt{OOB}, \texttt{quantreg}, \texttt{tune}.
\end{enumerate}
More information on these classes can be found in the \textsf{MaLTE} manual that accompanies the package.
\pagebreak
\section{Using Alternative Learning Algorithms (Experimental)}
\label{alternate_learner}
MaLTE is designed as a framework and this allows the user to specify alternative learning algorithms. To do this requires two steps: importing the required library and incorporating the appropriate call to the model-building function.
For example, to use multivariate adaptive regression splines (MARS) as the learner, we would do the following:
\begin{enumerate}
\item Add the name of the library into line 8 of the \texttt{DESCRIPTION} file in the package root. In this example, we import the \texttt{mda} package (assuming it is installed).
\begin{verbatim}
8: Depends: R (>= 2.14.0), methods, party, multicore, \
quantregForest, limma, mda
\end{verbatim}
\item We then modify the \texttt{model} by implementing the appropriate signature. Changes have to be made to either line 77 or line 80 of \texttt{train\_and\_predict.R}:
\begin{verbatim}
74: # the model
75: if ( quantreg )
76: {
77: model <- quantregForest( y=D[,1], x=D[,2:ncol( D )],\
mtry=mtry, ntree=ntree )
78: } else
79: {
80: model <- mars( x=D[,2:ncol(D), y=D[,1] )
81: }
\end{verbatim}
\end{enumerate}
Modifying the learning algorithm may disable certain features. For example, out-of-bag (OOB) estimates are only available using forest models. Additionally, prediction of expression of transcript isoforms depends on the configuration of the learning algorithm to provide multiple response regression.
Please use the contact information provided in Section \ref{bugs} to report any undesirable behaviour.
\pagebreak
\begin{landscape}
\section{Function and Methods Table}
\label{functions}
\begin{table}[H]
\scriptsize
\centering
\begin{tabular}{|p{2.5cm}|p{6cm}|p{4cm}|p{5cm}|}
\hline
\textbf{Function/Method} & \textbf{Input} & \textbf{Output} & \textbf{Comments} \\
\hline
\texttt{prepare.data()} & \texttt{samples.fn}, \texttt{ma.fn} (OR \texttt{raw\_ma.fn} with \texttt{raw=TRUE}), \texttt{hts.fn}, \texttt{g2p.fn} & \textit{train\_data.txt.gz} and \textit{test\_data.txt.gz} (together with log files indicating which genes are missing) & This function calls the underlying \textsf{Python} script \textit{prepare\_data.py}, which can be called directly by the user. All output is written to the current directory. \\
\hline
\texttt{read.data()} & \texttt{train.fn=`train\_data.txt.gz'}, \texttt{test.fn=`test\_data.txt.gz'} & \texttt{tt.ready} & \texttt{tt.ready} is a list of objects of class \texttt{TT.Ready.Gene} that has embedded within it the training and testing data. \\
\hline
\texttt{prepare.txs.data()} & \texttt{samples.fn}, \texttt{train.fn}, \texttt{test.fn}, \texttt{hts.txs.fn}, \texttt{g2tx.fn} & \textit{train\_txs\_data.txt.gz}, \textit{test\_txs\_data.txt.gz} (together with log files indicating which genes are missing) &
This function works like \texttt{prepare.data()} by calling the undlying \textsf{Python} script \textit{prepare\_txs\_data.py}, which can be called directly. All output is written to the current directory. \\
\hline
\texttt{read.txs.data()} & \texttt{train.fn=`train\_txs\_data.txt.gz'}, \texttt{test.fn=`test\_txs\_data.txt.gz'} & \texttt{tt.ready.txs} & \texttt{tt.ready.txs} is a list of objects of class \texttt{TT.Ready.Txs} \\
\hline
\texttt{TT.Params()} & \texttt{mtry=2}, \texttt{ntree=1000}, \texttt{feature.select=TRUE}, \texttt{min.probes=15}, \texttt{cor.thresh=0}, \texttt{OOB=FALSE}, \texttt{quantreg=FALSE}, \texttt{tune=FALSE} & \texttt{tt.params} & Constructor for objects of class \texttt{TT.Params} \\
\hline
\texttt{run()}, \texttt{oob.run()} & \texttt{TT.Ready} object, \texttt{tt.params}, \texttt{OOB=FALSE} & \texttt{TT.Seq} object & Performs prediction on a single \texttt{TT.Ready.Gene} or \texttt{TT.Ready.Txs} object. \\
\hline
\texttt{array2seq()}, \texttt{array2seq.oob()} & \texttt{tt.ready}/\texttt{tt.ready.txs}, \texttt{tt.params} & \texttt{tt.seq} OR \texttt{tt.seq.txs} & \texttt{tt.seq} is a list of \texttt{TT.Seq.Gene} or \texttt{TT.Seq.Tx} objects. \\
& & & The parallelised-list version of \texttt{run()/oob.run()} \\
% & & & The tuned parameters are obtained by running 'tune'. \\
\hline
\texttt{oob.filter()} & \texttt{tt.seq/tt.seq.txs}, \texttt{tt.seq.oob/tt.seq.oob.txs} (resp.), \texttt{thresh} & \texttt{tt.filtered} & \texttt{tt.filtered} is a list of objects of class \texttt{TT.Seq.Gene} \\
& & & \texttt{tt.filtered.txs} is a list of objects of class \texttt{TT.Seq.Tx} \\
\hline
% *\texttt{tune()} & \texttt{tt.ready} & \texttt{tt.params} & Uses the training data to find the best parameters to use. Evaluation is based on OOB estimates only. \\
% & & & \texttt{tt.params} is an object of class \texttt{TT.Params} \\
% \hline
\texttt{predictions()} & \texttt{TT.Seq.Gene} OR \texttt{TT.Seq.Txs} & \texttt{tt.predicted} OR \texttt{tt.predicted.txs} & Method to extract predictions only \\
\hline
\texttt{get.predictions()} & \texttt{tt.filtered} OR \texttt{tt.filtered.txs} & \texttt{df} & \texttt{df} is a data frame of predictions \\
\hline
\texttt{cor.P()} & \texttt{tt.seq.oob} OR \texttt{tt.seq.oob.txs} & & Method to extract Pearson correlations only when HTS is available for test data \\
\hline
\texttt{cor.S()} & \texttt{tt.seq.oob} OR \texttt{tt.seq.oob.txs} & & Method to extract Spearman correlations only when HTS is available for test data \\
\hline
\texttt{get.names()}, \texttt{get.train()}, \texttt{get.test()} & \texttt{samples.fn=`samples.txt'} & \texttt{sample.names} & Returns a list of names of train/test samples \\
\hline
\end{tabular}
\end{table}
\end{landscape}
\pagebreak
\section{Use Case: MaLTE Trained with GTEx Data Applied on Exon Array}
\label{usecase}
Using \textsf{MaLTE} with the GTEx training dataset consists of three main steps illustrated in Figure \ref{fig:usecase}.
\begin{itemize}
\item Obtaining the package and training data
\item Preparing the data for training-and-testing
\item Predicting, filtering and collating expression estimates
\end{itemize}
\begin{figure}
\centering
\includegraphics[height=21cm]{schematic1.pdf}
\caption{\textbf{Schematic representing a \textsf{MaLTE} use-case.} The four text files (dark green) are the main input to \textsf{MaLTE}. These files are obtained by processing the training data using the auxiliary scripts.}
\label{fig:usecase}
\end{figure}
In the first step, the user needs to download a set of auxiliary data and scripts before installing the \textsf{MaLTE} \textsf{R} package. The training (and test) data may then be downloaded after which several preprocessing steps may be carried out depending on the type of array involved. For example, we will need to transform the exon array to a gene array. The transformed array data is then quantile normalized to reduce batch effects.
The second and third steps are relatively straightforward and employ only \textsf{MaLTE} functions as outlined in Sections~\ref{gep}.
\subsection{Obtaining Auxiliary Data and Scripts}
\label{usecase:obtaining}
The use-case presented here requires several auxiliary data and scripts to convert the exon array to a gene array. These are all contained in the \textsf{MaLTE-aux} Github repository which may be directly downloaded from \url{https://github.com/polarise/MaLTE-aux}. Click the `Download' button to get the whole directory as a zip file.
\begin{verbatim}
unzip MaLTE-aux-master.zip
cd MaLTE-aux
\end{verbatim}
Alternatively, it is much faster to use \textsf{git} as follows:
\begin{verbatim}
git clone [email protected]:polarise/MaLTE-aux.git
cd MaLTE-aux
\end{verbatim}
This directory contains pre-compiled resource files described in Table~\ref{tab:malte_aux}. \textbf{We strongly recommend that all analysis be carried out in this directory.}
\begin{table}
\centering
\begin{tabular}{|l|p{6cm}|}
\hline
\textbf{File} & \textbf{Purpose} \\
\hline
\textit{create\_samples\_template.py} & \multirow{2}{*}{\parbox{6cm}{These script and input data files are used to generate \textit{samples.txt}.}} \\
\textit{training\_samples.txt} & \\
\hline
\textit{transform\_microarrays.py} & \multirow{6}{*}{\parbox{6cm}{The \textsf{Python} script is used to transform the exon array to a gene array. Some probes are omitted in the process. The \texttt{.help} file gives the explicit use of the data files below.}} \\
\textit{transform\_microarrays.help} & \\
& \\
& \\
& \\
& \\
\hline
\textit{HuEx-1\_0-st-v2.r2.dt1.hg18.full.mps.gz} & \multirow{5}{*}{\parbox{6cm}{Data files required by \textit{transform\_microarrays.py}}} \\
\textit{HuEx-1\_0-st-v2.r2.pgf.txt.gz} & \\
\textit{HuExVsHuGene\_BestMatch.txt.gz} & \\
\textit{HuGene-1\_1-st-v1.r4.mps.gz} & \\
\textit{HuGene-1\_1-st-v1.r4.pgf.txt.gz} & \\
\hline
\textit{gene\_probesets\_HuGene\_Ens72.txt.gz} & \multirow{3}{*}{\parbox{6cm}{Input data files required to run the \textsf{MaLTE} function \texttt{prepare.data()}}} \\ \textit{transcript\_probesets\_HuGene\_Ens72.txt.gz} & \\
& \\
\hline
\end{tabular}
\caption{Contents of \textsf{MaLTE-aux} and the functions they provide}
\label{tab:malte_aux}
\end{table}
The script \textit{create\_samples\_template.py} compiles the resource file \textit{samples.txt} by choosing a random subset of samples (the user specifies the number) to be used for training specified in \textit{training\_samples.txt}. More information on how it works can be obtained by running
\begin{verbatim}
./create_samples_template.py --help
\end{verbatim}
The script \textit{transform\_microarrays.py} converts one array to another using several resource files provided in \textsf{MaLTE-aux}. A detailed specification of how to use this script is available in the file \textit{transform\_microarrays.help}.
\subsection{Working Directory Structure}
\label{usecase:directory}
As all analysis will be carried out in \textsf{MaLTE-aux} directory, we need to create two directories for the training and test array CEL files.
\begin{verbatim}
mkdir GTEx_CEL
mkdir RP_CEL
\end{verbatim}
\subsection{Obtaining MaLTE}
\label{usecase:malte}
\textsf{MaLTE} may be downloaded as a pre-built or source package. Pre-built versions can be found at \url{https://github.com/polarise/MaLTE-packages}. The latest version should be downloaded (files are not sorted by version), ensuring that all prerequisites (see Section~\ref{system}) are installed before installing \textsf{MaLTE}.
\begin{verbatim}
R CMD INSTALL MaLTE_<version>.tar.gz
\end{verbatim}
The source package requires an installation of \textsf{git} and is obtained as followed:
\begin{verbatim}
git clone [email protected]:polarise/MaLTE.git
\end{verbatim}
then built with
\begin{verbatim}
R CMD build MaLTE
\end{verbatim}
and installed using \texttt{R CMD INSTALL} as shown above.
\subsubsection{Creating the file \textit{samples.txt}}
\label{usecase:samples.txt}
The \textit{samples.txt} file is created by running
\begin{verbatim}
./create_samples_template.py
\end{verbatim}
The names of the test samples should be appended at the bottom of this file using a text editor. For example, to add a test sample called \textit{Test01.CEL} add the following line to \textit{samples.txt}:
\begin{verbatim}
*NA<tab>Test01.CEL
\end{verbatim}
The asterisk (\texttt{`*'}) marks this sample as a test sample and \texttt{`NA'} indicates that RNA-Seq data is not available. Repeat this for as many samples as are present in the test data.
\subsubsection{Obtaining the training data}
\label{usecase:training}
\textbf{\textit{GTEx RNA-Seq data.}} Gene expression levels estimated from RNA-Seq data may be downloaded from \url{http://www.broadinstitute.org/gtex}. The downloaded data needs to be processed to arrive at the training state by excluding extraneous information and converting gene annotation to Ensembl (from GENCODE) then quantile-normalizing only those genes with probe sets. Pre-processed data may be downloaded from \url{https://drive.google.com/file/d/0BxLgaMV5aZahVDNjM29WeGh2NHM/edit?usp=sharing}.
\noindent\\
\textbf{\textit{GTEx gene array data.}} CEL files for the GTEx microarray data are available as a single zipped archive which may be downloaded from GEO archive GSE45878 into the directory \texttt{GTEx\_CEL}. Background-corrected probes should be extracted as follows using the Affymetrix Power Tools function \texttt{apt-cel-extract}, which depends on the Affymetrix GeneChip$^{\text{\textregistered}}$ Human Gene 1.1 ST Array library files available at \url{http://www.affymetrix.com/estore/catalog/prod350003/AFFY/Human-Gene-ST-Array-Strips#1_3}. Select the `Technical Documentation' tab to get a link to the library files.
\begin{verbatim}
apt-cel-extract -o GTEx_probe_intensities.txt \
-c /path/to/HuGene-1_1-st-v1.r4.clf \
-p /path/to/HuGene-1_1-st-v1.r4.pgf \
-b /path/to/HuGene-1_1-st-v1.r4.bgp \
-a pm-gcbg GTEx_CEL/*.CEL
\end{verbatim}
It may be necessary to extract background probes file using \texttt{apt-dump-pgf}.
\begin{verbatim}
apt-dump-pgf \
-p /path/to/HuGene-1_1-st-v1.r4.pgf \
-c /path/to/HuGene-1_1-st-v1.r4.clf \
--probeset-type antigenomic \
-o HuGene-1_1-st-v1.r4.bgp
\end{verbatim}
\noindent\\
\textbf{\textit{Exon array data.}} We outline this procedure using a previously uploaded dataset available under GEO accession GSE43134 into the directory \texttt{RP\_CEL}. The Affymetrix GeneChip$^{\text{\textregistered}}$ Human Exon 1.0 ST Array library files are available from \url{http://www.affymetrix.com/estore/catalog/131452/AFFY/Human-Exon-ST-Array}. Similarly, background-corrected probe fluorescence intensities should be extracted as follows:
\begin{verbatim}
apt-cel-extract \
-o RP_probe_intensities.txt \
-c /path/to/HuEx-1_0-st-v2.r2.clf \
-p /path/to/HuEx-1_0-st-v2.r2.pgf \
-b /path/to/HuEx-1_0-st-v2.r2.antigenomic.bgp \
-a pm-gcbg RP_CEL/*.CEL
\end{verbatim}
\subsection{Transforming exon array probes to gene array probes}
\label{usecase:transforming}
We now run \textit{transform\_microarrays.py} using the following template
\begin{verbatim}
./transform_microarrays.py \
-e RP_probe_intensities.txt \
-g GTEx_probe_intensities.txt \
-c HuExVsHuGene_BestMatch.txt.gz \
-i HuGene-1_1-st-v1.r4.pgf.txt.gz \
-f HuEx-1_0-st-v2.r2.pgf.txt.gz \
-p HuEx-1_0-st-v2.r2.dt1.hg18.full.mps.gz \
-q HuGene-1_1-st-v1.r4.mps.gz \
--huex-out BestMatch_HuEx_probe_intensities.txt.gz \
--huge-out BestMatch_HuGe_probe_intensities.txt.gz
\end{verbatim}
Once the modified probe intensity files have been created they need to be quantile-normalized together with the training data as follows:
\begin{verbatim}
> library( limma )
> # BESTMATCH DATA
> huex.best <- read.table( "BestMatch_HuEx_probe_intensities.txt.gz",
header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
> huge.best <- read.table( "BestMatch_HuGe_probe_intensities.txt.gz",
header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
> # all raw data
> data <- cbind( huge.best[,8:ncol( huge.best )],
huex.best[,8:ncol( huex.best )] )
> # quantile normalisation only
> qnorm.data <- normalizeQuantiles( data, ties=FALSE )
> # save the quantile-normalised data for later
> # (extracting principal components)
> save( qnorm.data, file="qnorm.data.Rdata" )
> # re-insert probe metadata
> hugeex.qnorm.data <- cbind( huge.best[,1:7], qnorm.data )
> # save to file
> write.table( hugeex.qnorm.data,
file="BestMatch_GTEx_RP_probe_intensities_QN.txt", col.names=TRUE,
row.names=FALSE, quote=FALSE, sep="\t" )
\end{verbatim}
All training data needed is now available to use \textsf{MaLTE}.
\subsection{Preparing the data for training-and-testing}
\label{usecase:preparing}
We can now use \textsf{MaLTE} to prepare the data for training and testing. First, we load the package after launching \textsf{R}.
\begin{verbatim}
> library( MaLTE )
\end{verbatim}
We use the \texttt{prepare.data()} function which takes as arguments the \textit{samples.txt,\\ GTEx\_subset\_gene\_rpkm\_QN.txt, BestMatch\_GTEx\_RP\_probe\_intensities\_QN.txt.gz}, and\\ \textit{gene\_probesets\_HuGene\_Ens72.txt.gz} (available in \textsf{MaLTE-aux} directory).
\begin{verbatim}
> prepare.data( samples.fn=`samples.txt',
hts.fn=`GTEx_subset_gene_rpkm_QN.txt.gz’,
ma.fn=`BestMatch_GTEx_RP_probe_intensities_QN.txt’,
g2p.fn=`gene_probesets_HuGene_Ens72.txt.gz', raw=TRUE )
\end{verbatim}
We then read in the training and test data.
\begin{verbatim}
> tt.ready = read.data( train.fn=`train_data.tar.gz',
test.fn=`test_data.tar.gz' )
\end{verbatim}
\subsection{Prediction}
\label{usecase:prediction}
Prediction can either be performed on the test data training data (out-of-bag, OOB). We outline OOB shortly. First, we need to define training-and-test parameters (additional parameters are specified in Section~\ref{params}).
\begin{verbatim}
> tt.params = TT.Params() # default parameters
\end{verbatim}
This sets the following values:
\begin{verbatim}
> tt.params
mtry = 2
ntree = 1000
feature.select = TRUE
min.probes = 15
cor.thresh = 0
OOB = FALSE
QuantReg = FALSE
Tune (OOB cor.P) = NA
\end{verbatim}
Predictions are performed using the \texttt{array2seq()} function.
\begin{verbatim}
> tt.seq = array2seq( tt.ready, tt.params )
\end{verbatim}
OOB predictions are performed using the \texttt{array2seq.oob()} function.
\begin{verbatim}
> tt.seq.oob = array2seq.oob( tt.ready, tt.params )
\end{verbatim}
\subsection{Filtering by OOB}
\label{usecase:filtering}
OOB filtering is performed using the \texttt{oob.filter()} function, which takes a \texttt{tt.seq} pair (\texttt{tt.seq} and \texttt{tt.seq.oob}) and a filtering threshold together with the correlation to use for filtering (Pearson (default) or Spearman),
\begin{verbatim}
> tt.filtered = oob.filter( tt.seq, tt.seq.oob, thresh=0.5, method=`spearman' )
\end{verbatim}
\subsection{Collating predicted gene expression values}
\label{usecase:collating}
Predicted values should be collated as a data frame using the \texttt{get.predictions()} function. First, we get the names of test samples using the \texttt{get.names()} or \texttt{get.test()} functions.
\begin{verbatim}
> test.names = get.names( samples.fn=`samples.txt', test=TRUE )
> test.names = get.test( samples.fn=`samples.txt' )