Skip to content

Commit

Permalink
Merge in short motif code (#161)
Browse files Browse the repository at this point in the history
* Merged in PositionalMotifFrequencies report

* added todos for PositionalMotifFrequencies report

* small (formatting) corrections. updated todos

* added precision/recall to feature annotations

* Added SignificantMotifPrecisionTP report

* - rename SignificantMotifEncoder to MotifEncoder
- added MotifGeneralizationAnalysis
- do not store learned motifs as parameter of MotifEncoder, instead read from file
- added random seed option to get_train_val_indices
- MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis
-

* attempt at making MotifEncoder faster by initializing (long) growing lists with None values

* allow label to be str or dict

* parallelisation of MotifEncoder encoded data matrix construction for speed

* more parallelisation in MotifEncoder

* add weight_thresholds, split_classes via YAML

* minor updates

* add WeightsDistribution report

* add weight_thresholds, split_classes via YAML

* add docs, add unit test file (not completed)

* minor updates

* added todos for Eric in WeightsDistribution report

* fixed todos

* minor correction

* bugfix: DataWeighter should return a clone of the dataset instead of modifying the dataset

* test print statements

* debugging print statements

* attempted bugfix

* debugging prints

* debugging

* debugging

* bugfix

* removed debugging prints

* Bugfixes in MotifGeneralizationAnalysis:
- do not ignore sequences with TP=0 in test set
- include FP scores when computing combined precision

* bugfixes & added smoothing option

* Bugfix: remove sorting from ElementDataset & add assert statement in ElementGenerator for when files are not ordered correctly

* extending importance weighting to restrict mutagenesis to only one class

* finished implementation of class-specific ImportanceWeighting

* - Updated line smoothing code for MotifGeneralizationAnalysis
- Updated _safe_plot in Report: can now specify the name of the callable (default="_plot"). This means _safe_plot can be used multiple times in one report, ensuring that when one of the plots fails, the other plots are still generated.

* added more todos for WeightsDistribution report

* updated MotifGeneralizationAnalysis:
- plotting style
- cleaned up code, moved computation out of plotting functions, export all plotting-relevant details prior to plotting

* - Added AminoAcidFrequencyDistribution report: plots a barplot of each amino acid in each position of all sequences in a dataset (any dataset type)
- small fix in SequenceLengthDistribution (set output_written=False to ensure correct error message, since this report writes no data file)

* Updated color palette

* updated AminoAcidFrequencyDistribution to include splitting by label values

* Updated docs

* update style

* sorted categories AminoAcidFrequencyDistribution

* made range of figures up to 1.01 to not cut off points

* temporarily add sequence hover data to WeightsDistribution report

* added option to predefine training set for MotifGeneralizationAnalysis by file

* automatically determine the optimal TP/recall cutoff and show in plot

* moved get_numpy_sequence_representation to PositionalMotifHelper

* update: write training set ids to files instead of printing in log (too long)

* updated MotifGeneralisationAnalysis: choose last point of exceeding precision threshold for TP cutoff

* plot highlighted motifs on top

* minor refactoring

* allow generalization plot for multiple motif sizes

* bugfix: dynamically change min_total_points_in_window

* Bugfix

* bugfix

* bugfix

* plot fix

* separate recall cutoff for different motif sizes

* updated the way the recall threshold is determined

* export confusion matrix

* theme white

* minor fix

* added keep_all param to MotifClassifier

* improved error message for Metric

* bugfix

* bugfix matches report: get subject ids

* bugfix: class mapping

* added selected features as export value

* move selected feature writing to fit

* bugfix

* updated the way tp thresholds are determined

* added MotifTestSetPeformance report, refactored to share code with MotifGeneralizationAnalysis with helper class MotifPerformancePlotHelper

* New report: NonMotifSimilarity
minor changes made to other reports

* rename report

* removed deprecated report, added requirements specific for tensorflow

* updated format of example id files for compatibility

* bugfix manual splitter: it didn't work for non-string classes, now everything is cast to string

* bugfix

* shorten log text - becomes extremely long and unreadable

* bugfix identifiers

* refactored out col_names stuff for simplicity

* refactoring, more shared code, splitting per motif size of motiftestsetperformance

* Add MotifOverlapReport

* prettier plots

* all tp cutoffs in one file

* started implementation, abandoned idea for now

* export simple stats from MotifEncoder

* updated plot

* Initial version

* backup, installing new OS

* small edits

* comment out some experimental code

* bufgfix test

* added SimilarToPositiveSequenceEncoder: a full sequence hamming dist-based encoder
renamed MotifClassifier to BinaryFeatureClassifier as it's generally applicable

* add facet

* different sizes

* clean up

* slight speed improvement: allow lower size limit on motifs and don't check the motifs that are too small

* more helpful error message

* minor updates to plot styling

* minor updates

* minor bugfix

* change dataframe structure

* all in one plot, change table

* add help method

* update test bench

* add duplicate max values

* added option for negative amino acids to Motif encoder

* added option for negative amino acids to Motif encoder

* add top/bottom n and filtering to FeatureValueBarplot

* added option for negative amino acids to Motif encoder

* Add max_gap_size_only functionality

* Label:
- ensure label classes always follow the same order: positive class first
- ensure a default label positive class is always set (in LabelConfig), utility function for retrieving negative class in binary case
- enforce label classes to always be set
predictions proba:
- when predicting probabilities: explicitly keep track of the label classes (dict instead of multi-dimensional np array) -> previously the ordering of labels (positive class last) was not consistently enforced across MLMethods. These bugs are dificult to catch as the 'predictions' were correct, but the 'predictions_proba' not. I believe the current solution is less error-prone in the future when other developers may work on immuneML

These updates resolve previously observed bugs that: predictions & predictions_proba did not match (resulting in inverse ROC curves), and that the wrong positive class was sometimes assumed for asymmetric performance metrics (e.g., precision, recall)

* fixes after new update

* cleaner way of getting label desc for storing ML models

* improved tests

* minor fix

* little refactoring, cleaned up some shared code between GroundTruthMotifOverlap & PositionalMotifHelper
minor changes in GroundTruthMotifOverlap

* made gap plot a lineplot

* default param

* check params

* added BinaryFeaturePrecisionRecall: a precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs

* added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs

* added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs

* minor update error message

* bugfix

* bugfix

* improved test

* bugfix, got stuck in an infinite loop

* bugfix

* bugfix

* bugfixes

* temporarily set higher recursion depth to prevent crashing

* update report to show training-validation-test set performance independently

* Made CompAIRR-powered version of SimilarToPositiveSequenceEncoder

* minor fixes GroundTruthMotifOverlap plot & make it possible for BinaryFeatureClassifier to set max motifs to all motifs on the fly

* remove print statement

* minor update

* rename highlight_motifs_path to groundtruth_motifs_path

* bugfixes compairr-version of SimilarToPositiveSequenceEncoder

* bugfixes compairr-version of SimilarToPositiveSequenceEncoder

* separate output folder for learning model

* added option to automatically remove test dataset (can be large)

* Update AminoAcidFrequencyDistribution report to show log-fold change

* implemented get_attribute for Receptor. All receptors have identifier and metadata dict.

* bugfix

* switch from logfold change to difference in relativbe frequency

* .

* 1-based counting of positions

* functionality to export non-optimal ML models in addition to the optimal ones

* undo partial commit

* improved efficiency of BinaryFeatureClassifier

* added lots of log statements to find out where the running time bottleneck is

* keep track of val predictions instead of recomputing them every time

* added multiprocessing option for BinaryFeatureClassifier

* remove default cores for training to test

* bugfix: pass cores_for_training in recursive function

* possible speed improvement: dont recompute scoring fn when array is equal to previously tested

* remove log statement

* - in BinaryFeatureClassifier, keep track of indices that show improvement to reduce the total number of comparisons made during training
- remove learn_all option from BinaryFeatureClassifier
- updated BinaryFeaturePrecisionRecall to display only 1 data point if keep_all = True

* updated log statement

* remove log statements

* minor fix docs

* fixes for Label in MLApplication instruction: explicitly pass on the positive class to make sure the same positive class is applied during MLApplication

* Allow metrics to be computed during MLApplication if the same label is provided

* small fix to make tests pass

* fix: html was overwritten

* bugfixes

* restored example weigths

* bugfix: test if proba available

* bugfix: dont access _proba columns when not defined

* bugfix: convert everything to string

* small fixes

* bugfixes

* fix bug

* big bug fix

* added test for GroundTruthMotifOverlap + small fixes
higlight motifs: highlight sub motifs also

* small fix for faster test

* small updates to motif reports

* axis title updates

* minor aestetic update

* undo change in test

* visual updates to plots

* bugfix to gaps report

* fixed warning

* minor fix gaps figure

* minor fix gaps figure

* minor fix gaps figure

* Add new _get_max_overlap

* remove obsolete title

* minor updates

* plot update: show line on left side of test plots for motif generalization

* remove obsolete report

* remove internal cv in outer assessment loop for sklearn

* merge in sklearn cv bugfix

* final bugfixes merging in master

* added parameter checking when using manual splittype

* Keras sequence CNN documentation updates + minor fixes

* updated installation docs

* Updated SimilarToPositiveSequenceEncoder, MotifEncoder and BinaryFeatureClassifier docs,
removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes.

* fixes regarding disabling allow_negative_aas option

* updated MotifGeneralizationAnalysis docs

* added motif recovery tutorial to documentation

* updated docs

* updated docs

* remove deprecated pseudocount parameter

* removed importanceweighting strategy and updated docs for predefinedweighting

* removed importanceweighting tests

* removed importanceweighting tests

* fixing tests

* corrected docs (and variable names): percentage-wise frequency change is plotted, not logfold

* Merge latest master into short motif, resolve merge conflicts.
- Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing
- some updated variable names
- updated docs

* Bugfixes related to sequence frame type and 'productive' status for file import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown')

* workaround bionumpy+pickle error: not using pool but for loop

* Update setup.py

* Update Constants.py

---------

Co-authored-by: Eric Reber <[email protected]>
Co-authored-by: pavlovicmilena <[email protected]>
  • Loading branch information
3 people authored Dec 1, 2023
1 parent d218ad9 commit 751ae6b
Show file tree
Hide file tree
Showing 146 changed files with 6,024 additions and 241 deletions.
4 changes: 2 additions & 2 deletions docs/source/developer_docs/how_to_add_new_encoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ An example of the implementation of :code:`NewKmerFrequencyEncoder` for the :py:
"""
Encodes the repertoires of the dataset by k-mer frequencies and normalizes the frequencies to zero mean and unit variance.
Arguments:
Specification arguments:
k (int): k-mer length
Expand Down Expand Up @@ -324,7 +324,7 @@ This is the example of documentation for :py:obj:`~immuneML.encodings.filtered_s
Nature Genetics 49, no. 5 (May 2017): 659–65. `doi.org/10.1038/ng.3822 <https://doi.org/10.1038/ng.3822>`_.
Arguments:
Specification arguments:
comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in
comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/developer_docs/how_to_add_new_preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ It includes implementations of the abstract methods and class documentation at t
lower_limit, or more clonotypes than specified by the upper_limit.
Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
Arguments:
Specification arguments:
lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
Expand Down Expand Up @@ -260,7 +260,7 @@ This is the example of documentation for :py:obj:`~immuneML.preprocessing.filter
lower_limit, or more clonotypes than specified by the upper_limit.
Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
Arguments:
Specification arguments:
lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
Expand Down
59 changes: 50 additions & 9 deletions docs/source/installation/install_with_package_manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,6 @@ Note: when creating a python virtual environment, it will automatically use the
pip install immuneML
Alternatively, if you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, include the optional extra :code:`TCRdist`:

.. code-block:: console
pip install immuneML[TCRdist]
See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`

Install immuneML with conda
Expand Down Expand Up @@ -95,6 +87,25 @@ Install immuneML with conda
Installing optional dependencies
----------------------------------

TCRDist
*******

If you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, you can include the optional extra :code:`TCRdist`:

.. code-block:: console
pip install immuneML[TCRdist]
The TCRdist dependencies can also be installed manually using the :download:`requirements_TCRdist.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_TCRdist.txt>` file:

.. code-block:: console
pip install -r requirements_TCRdist.txt
DeepRC
******

Optionally, if you want to use the :ref:`DeepRC` ML method and and corresponding :ref:`DeepRCMotifDiscovery` report, you also
have to install DeepRC dependencies using the :download:`requirements_DeepRC.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_DeepRC.txt>` file.
Important note: DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.
Expand All @@ -104,8 +115,38 @@ To install the DeepRC dependencies, run:
pip install -r requirements_DeepRC.txt --no-dependencies
See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`


Keras-based sequence CNN
************************

In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed.
By default, version 2.11.0 of both dependencies are used.
Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother.

To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`:

.. code-block:: console
pip install immuneML[KerasSequenceCNN]
Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_KerasSequenceCNN.txt>` file:

.. code-block:: console
pip install -r requirements_KerasSequenceCNN.txt
The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU.

CompAIRR
********

If you want to use the :ref:`CompAIRRDistance` or :ref:`CompAIRRSequenceAbundance` encoder, you have to install the C++ tool `CompAIRR <https://github.com/uio-bmi/compairr>`_.
The easiest way to do this is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
Furthermore, the :ref:`SimilarToPositiveSequence` encoder can be run both with and without CompAIRR, but the CompAIRR-based version is faster.

The easiest way to install CompAIRR is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:

.. code-block:: console
Expand Down
7 changes: 5 additions & 2 deletions docs/source/tutorials/how_to_apply_to_new_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,11 @@ For a tutorial on importing datasets to immuneML (for training or applying an ML
YAML specification example using the MLApplication instruction
------------------------------------------------------------------
The :ref:`MLApplication` instruction takes in a :code:`dataset` and a :code:`config_path`. The :code:`config_path` should
point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. They can be found in the sub-folder
:code:`instruction_name/optimal_label_name` in the results folder.
point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction.
The configuration of the optimal ML setting can always be found in the sub-folder :code:`<instruction_name>/optimal_<label_name>/zip` in the results folder.
Alternatively, when running the :ref:`TrainMLModel` instruction with the parameter :code:`export_all_ml_settings` set to :code:`True`,
the config file for each of the ML settings can be found inside :code:`<instruction_name>/split_<number>/<ml_setting_name>/ml_settings_config/zip`
for each ML setting in each assessment split.


.. highlight:: yaml
Expand Down
45 changes: 45 additions & 0 deletions docs/source/tutorials/motif_recovery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,51 @@ immuneML provides several different options for recovering motifs associated wit
Depending on the context, immuneML provides several different reports which can be used for this purpose.


Discovering positional motifs using precision and recall thresholds
----------------------------------------------------------------------

It is often assumed that the antigen binding status of an immune receptor (antibody/TCR) may be determined by the *presence*
of a short motif in the CDR3.
We developed a method (manuscript in preparation) for the discovery of antigen binding associated motifs with the following properties:

- Short position-specific motifs with possible gaps
- High precision for predicting antigen binding
- High generalisability to unseen data, i.e., retaining a relatively high precision on test data


Method description
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A motif with a high precision for predicting antigen binding implies that when the motif is present,
the probability that the sequence is a binder is high. One can thus iterate through every possible motif and filter
them by applying a precision threshold. However, the more 'rare' a motif is, the more likely that the motif just had
a high precision by chance (for example: a motif that occurs in only 1 binder and 0 non-binders has a perfect precision,
but may not retain high precision on unseen data). Thus, an additional recall threshold is applied to remove
rare motifs.
Our method allows the user to define a precision threshold and learn the optimal recall threshold using a training + validation set.

The method consists the following steps:

1. Splitting the data into training, validation and test sets.

2. Using the training set, find all motifs with a high training-precision.

3. Using the validation set, determine the recall threshold for which the validation-precision is still high (separate recall thresholds may be learned for motifs with different sizes).

4. Using the combined training + validation set, find all motifs exceeding the user-defined precision threshold and learned recall threshold(s).

5. Using the test set, report the precision and recall of these learned motifs.

6. Optional: use the set of learned motifs as input features for ML classifiers (e.g., :ref:`BinaryFeatureClassifier` or :ref:`LogisticRegression`) for antigen binding prediction.

Steps 2+3 are done by the report :ref:`MotifGeneralizationAnalysis`. This report exports the learned recall cutoff(s).
It is recommended to run this report using the :ref:`ExploratoryAnalysis` instruction.
Steps 4+5 are done by the :ref:`Motif` encoder. The learned recall cutoff(s) are used as input parameters. This encoder
can be used either in :ref:`ExploratoryAnalysis` or :ref:`TrainMLModel` instructions.




Discovering motifs learned by classifiers
-----------------------------------------

Expand Down
8 changes: 5 additions & 3 deletions immuneML/IO/dataset_export/AIRRExporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,14 @@ def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None):
if "frame_type" in df.columns:
AIRRExporter._enums_to_strings(df, "frame_type")

df["productive"] = df["frame_type"] == SequenceFrameType.IN.name
df.loc[df["frame_type"].isnull(), "productive"] = ''
df["productive"] = df["frame_type"] == SequenceFrameType.IN.value
df.loc[df["frame_type"].isnull(), "productive"] = ""
df.loc[df["frame_type"] == "", "productive"] = ""
df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = ""

df["vj_in_frame"] = df["productive"]

df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.name
df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value
df.loc[df["frame_type"].isnull(), "stop_codon"] = ''

df.drop(columns=["frame_type"], inplace=True)
Expand Down
11 changes: 7 additions & 4 deletions immuneML/IO/dataset_import/AIRRImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ class AIRRImport(DataImport):
- import_productive (bool): Whether productive sequences (with value 'T' in column productive) should be included in the imported sequences. By default, import_productive is True.
- import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
- import_with_stop_codon (bool): Whether sequences with stop codons (with value 'T' in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.
- import_out_of_frame (bool): Whether out of frame sequences (with value 'F' in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
Expand Down Expand Up @@ -110,15 +112,16 @@ def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
- the allele information is removed from the V and J genes
"""
if "productive" in df.columns:
df["frame_type"] = SequenceFrameType.OUT.name
df.loc[df["productive"], "frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.UNDEFINED.value
df.loc[df["productive"]==True, "frame_type"] = SequenceFrameType.IN.value
df.loc[df["productive"]==False, "frame_type"] = SequenceFrameType.OUT.value
else:
df["frame_type"] = None

if "vj_in_frame" in df.columns:
df.loc[df["vj_in_frame"], "frame_type"] = SequenceFrameType.IN.name
df.loc[df["vj_in_frame"]==True, "frame_type"] = SequenceFrameType.IN.value
if "stop_codon" in df.columns:
df.loc[df["stop_codon"], "frame_type"] = SequenceFrameType.STOP.name
df.loc[df["stop_codon"]==True, "frame_type"] = SequenceFrameType.STOP.value

if "productive" in df.columns:
frame_type_list = ImportHelper.prepare_frame_type_list(params)
Expand Down
1 change: 1 addition & 0 deletions immuneML/IO/dataset_import/DatasetImportParams.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ class DatasetImportParams:
column_mapping_synonyms: dict = None
region_type: RegionType = None
import_productive: bool = None
import_unknown_productivity: bool = None
import_unproductive: bool = None
import_with_stop_codon: bool = None
import_out_of_frame: bool = None
Expand Down
20 changes: 15 additions & 5 deletions immuneML/IO/dataset_import/TenxGenomicsImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ class TenxGenomicsImport(DataImport):
- receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor. Valid values for receptor_chains are the names of the :py:obj:`~immuneML.data_model.receptor.ChainPair.ChainPair` enum. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
- import_productive (bool): Whether productive sequences (with value 'True' in column productive) should be included in the imported sequences. By default, import_productive is True.
- import_unproductive (bool): Whether productive sequences (with value 'Fale' in column productive) should be included in the imported sequences. By default, import_unproductive is False.
- import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or 'NA' value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
- import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon '*', or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
- import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
Expand Down Expand Up @@ -105,17 +111,21 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:

@staticmethod
def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
df["frame_type"] = None
df['productive'] = df['productive'] == 'True'
df.loc[df['productive'], "frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.UNDEFINED.value
df.loc[df['productive']=="True", "frame_type"] = SequenceFrameType.IN.value
df.loc[df['productive']=="False", "frame_type"] = SequenceFrameType.OUT.value

allowed_productive_values = []
if params.import_productive:
allowed_productive_values.append(True)
allowed_productive_values.append('True')
if params.import_unproductive:
allowed_productive_values.append(False)
allowed_productive_values.append('False')
if params.import_unknown_productivity:
allowed_productive_values.append('')
allowed_productive_values.append('NA')

df = df[df.productive.isin(allowed_productive_values)]
df.drop(columns=["productive"], inplace=True)

ImportHelper.junction_to_cdr3(df, params.region_type)
df.loc[:, "region_type"] = params.region_type.name
Expand Down
2 changes: 1 addition & 1 deletion immuneML/IO/dataset_import/VDJdbImport.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:

@staticmethod
def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
df["frame_type"] = SequenceFrameType.IN.name
df["frame_type"] = SequenceFrameType.IN.value
ImportHelper.junction_to_cdr3(df, params.region_type)
df.loc[:, "region_type"] = params.region_type.name

Expand Down
1 change: 1 addition & 0 deletions immuneML/config/default_params/datasets/airr_params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
paired: False
import_productive: True
import_unknown_productivity: True
import_with_stop_codon: False
import_out_of_frame: False
import_illegal_characters: False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
paired: False
import_productive: True
import_unknown_productivity: True
import_with_stop_codon: False
import_out_of_frame: False
import_illegal_characters: False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ is_repertoire: True
path: ./
import_productive: True # whether to only import productive sequences
import_unproductive: False # whether to only import unproductive sequences
import_unknown_productivity: True # whether to import sequences with unknown productivity (missing/NA)
import_illegal_characters: False
region_type: "IMGT_CDR3" # which region to use - IMGT_CDR3 option means removing first and last amino acid as 10xGenomics uses IMGT junction as CDR3
separator: "," # column separator
Expand Down
5 changes: 5 additions & 0 deletions immuneML/config/default_params/encodings/motif_params.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
max_positions: 4
min_positions: 1
min_precision: 0.8
min_recall: 0
min_true_positives: 10
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
hamming_distance: 1
ignore_genes: false
threads: 8
keep_temporary_files: false
compairr_path: null
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
separator: "\t"
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,6 @@ assessment: # outer loop of nested CV
selection: # inner loop of nested CV
split_strategy: random # perform random split to train and validation datasets
split_count: 1 # how many fold to create
training_percentage: 0.7
training_percentage: 0.7
example_weighting: null
export_all_ml_settings: False # only export the optimal model
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
training_percentage: 0.7
max_features: 100
patience: 5
min_delta: 0
keep_all: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
training_percentage: 0.7
units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
activation: relu
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
training_set_identifier_path: null
training_percentage: 0.7
split_by_motif_size: true
max_positions: 4
min_positions: 1
min_precision: 0.9
min_recall: 0
min_true_positives: 1
test_precision_threshold: 0.8
highlight_motifs_name: Highlighted motif
min_points_in_window: 50
smoothing_constant1: 5
smoothing_constant2: 10
training_set_name: training set
test_set_name: test set
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
n_splits: 5
max_positions: 4
min_precision: 0
min_recall: 0
min_true_positives: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
highlight_motifs_name: Highlighted motif
min_points_in_window: 50
smoothing_constant1: 5
smoothing_constant2: 10
training_set_name: training set
test_set_name: test set
split_by_motif_size: true
keep_test_dataset: true
Loading

0 comments on commit 751ae6b

Please sign in to comment.