Merge in short motif code (#161)

* Merged in PositionalMotifFrequencies report * added todos for PositionalMotifFrequencies report * small (formatting) corrections. updated todos * added precision/recall to feature annotations * Added SignificantMotifPrecisionTP report * - rename SignificantMotifEncoder to MotifEncoder - added MotifGeneralizationAnalysis - do not store learned motifs as parameter of MotifEncoder, instead read from file - added random seed option to get_train_val_indices - MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis - * attempt at making MotifEncoder faster by initializing (long) growing lists with None values * allow label to be str or dict * parallelisation of MotifEncoder encoded data matrix construction for speed * more parallelisation in MotifEncoder * add weight_thresholds, split_classes via YAML * minor updates * add WeightsDistribution report * add weight_thresholds, split_classes via YAML * add docs, add unit test file (not completed) * minor updates * added todos for Eric in WeightsDistribution report * fixed todos * minor correction * bugfix: DataWeighter should return a clone of the dataset instead of modifying the dataset * test print statements * debugging print statements * attempted bugfix * debugging prints * debugging * debugging * bugfix * removed debugging prints * Bugfixes in MotifGeneralizationAnalysis: - do not ignore sequences with TP=0 in test set - include FP scores when computing combined precision * bugfixes & added smoothing option * Bugfix: remove sorting from ElementDataset & add assert statement in ElementGenerator for when files are not ordered correctly * extending importance weighting to restrict mutagenesis to only one class * finished implementation of class-specific ImportanceWeighting * - Updated line smoothing code for MotifGeneralizationAnalysis - Updated _safe_plot in Report: can now specify the name of the callable (default="_plot"). This means _safe_plot can be used multiple times in one report, ensuring that when one of the plots fails, the other plots are still generated. * added more todos for WeightsDistribution report * updated MotifGeneralizationAnalysis: - plotting style - cleaned up code, moved computation out of plotting functions, export all plotting-relevant details prior to plotting * - Added AminoAcidFrequencyDistribution report: plots a barplot of each amino acid in each position of all sequences in a dataset (any dataset type) - small fix in SequenceLengthDistribution (set output_written=False to ensure correct error message, since this report writes no data file) * Updated color palette * updated AminoAcidFrequencyDistribution to include splitting by label values * Updated docs * update style * sorted categories AminoAcidFrequencyDistribution * made range of figures up to 1.01 to not cut off points * temporarily add sequence hover data to WeightsDistribution report * added option to predefine training set for MotifGeneralizationAnalysis by file * automatically determine the optimal TP/recall cutoff and show in plot * moved get_numpy_sequence_representation to PositionalMotifHelper * update: write training set ids to files instead of printing in log (too long) * updated MotifGeneralisationAnalysis: choose last point of exceeding precision threshold for TP cutoff * plot highlighted motifs on top * minor refactoring * allow generalization plot for multiple motif sizes * bugfix: dynamically change min_total_points_in_window * Bugfix * bugfix * bugfix * plot fix * separate recall cutoff for different motif sizes * updated the way the recall threshold is determined * export confusion matrix * theme white * minor fix * added keep_all param to MotifClassifier * improved error message for Metric * bugfix * bugfix matches report: get subject ids * bugfix: class mapping * added selected features as export value * move selected feature writing to fit * bugfix * updated the way tp thresholds are determined * added MotifTestSetPeformance report, refactored to share code with MotifGeneralizationAnalysis with helper class MotifPerformancePlotHelper * New report: NonMotifSimilarity minor changes made to other reports * rename report * removed deprecated report, added requirements specific for tensorflow * updated format of example id files for compatibility * bugfix manual splitter: it didn't work for non-string classes, now everything is cast to string * bugfix * shorten log text - becomes extremely long and unreadable * bugfix identifiers * refactored out col_names stuff for simplicity * refactoring, more shared code, splitting per motif size of motiftestsetperformance * Add MotifOverlapReport * prettier plots * all tp cutoffs in one file * started implementation, abandoned idea for now * export simple stats from MotifEncoder * updated plot * Initial version * backup, installing new OS * small edits * comment out some experimental code * bufgfix test * added SimilarToPositiveSequenceEncoder: a full sequence hamming dist-based encoder renamed MotifClassifier to BinaryFeatureClassifier as it's generally applicable * add facet * different sizes * clean up * slight speed improvement: allow lower size limit on motifs and don't check the motifs that are too small * more helpful error message * minor updates to plot styling * minor updates * minor bugfix * change dataframe structure * all in one plot, change table * add help method * update test bench * add duplicate max values * added option for negative amino acids to Motif encoder * added option for negative amino acids to Motif encoder * add top/bottom n and filtering to FeatureValueBarplot * added option for negative amino acids to Motif encoder * Add max_gap_size_only functionality * Label: - ensure label classes always follow the same order: positive class first - ensure a default label positive class is always set (in LabelConfig), utility function for retrieving negative class in binary case - enforce label classes to always be set predictions proba: - when predicting probabilities: explicitly keep track of the label classes (dict instead of multi-dimensional np array) -> previously the ordering of labels (positive class last) was not consistently enforced across MLMethods. These bugs are dificult to catch as the 'predictions' were correct, but the 'predictions_proba' not. I believe the current solution is less error-prone in the future when other developers may work on immuneML These updates resolve previously observed bugs that: predictions & predictions_proba did not match (resulting in inverse ROC curves), and that the wrong positive class was sometimes assumed for asymmetric performance metrics (e.g., precision, recall) * fixes after new update * cleaner way of getting label desc for storing ML models * improved tests * minor fix * little refactoring, cleaned up some shared code between GroundTruthMotifOverlap & PositionalMotifHelper minor changes in GroundTruthMotifOverlap * made gap plot a lineplot * default param * check params * added BinaryFeaturePrecisionRecall: a precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * minor update error message * bugfix * bugfix * improved test * bugfix, got stuck in an infinite loop * bugfix * bugfix * bugfixes * temporarily set higher recursion depth to prevent crashing * update report to show training-validation-test set performance independently * Made CompAIRR-powered version of SimilarToPositiveSequenceEncoder * minor fixes GroundTruthMotifOverlap plot & make it possible for BinaryFeatureClassifier to set max motifs to all motifs on the fly * remove print statement * minor update * rename highlight_motifs_path to groundtruth_motifs_path * bugfixes compairr-version of SimilarToPositiveSequenceEncoder * bugfixes compairr-version of SimilarToPositiveSequenceEncoder * separate output folder for learning model * added option to automatically remove test dataset (can be large) * Update AminoAcidFrequencyDistribution report to show log-fold change * implemented get_attribute for Receptor. All receptors have identifier and metadata dict. * bugfix * switch from logfold change to difference in relativbe frequency * . * 1-based counting of positions * functionality to export non-optimal ML models in addition to the optimal ones * undo partial commit * improved efficiency of BinaryFeatureClassifier * added lots of log statements to find out where the running time bottleneck is * keep track of val predictions instead of recomputing them every time * added multiprocessing option for BinaryFeatureClassifier * remove default cores for training to test * bugfix: pass cores_for_training in recursive function * possible speed improvement: dont recompute scoring fn when array is equal to previously tested * remove log statement * - in BinaryFeatureClassifier, keep track of indices that show improvement to reduce the total number of comparisons made during training - remove learn_all option from BinaryFeatureClassifier - updated BinaryFeaturePrecisionRecall to display only 1 data point if keep_all = True * updated log statement * remove log statements * minor fix docs * fixes for Label in MLApplication instruction: explicitly pass on the positive class to make sure the same positive class is applied during MLApplication * Allow metrics to be computed during MLApplication if the same label is provided * small fix to make tests pass * fix: html was overwritten * bugfixes * restored example weigths * bugfix: test if proba available * bugfix: dont access _proba columns when not defined * bugfix: convert everything to string * small fixes * bugfixes * fix bug * big bug fix * added test for GroundTruthMotifOverlap + small fixes higlight motifs: highlight sub motifs also * small fix for faster test * small updates to motif reports * axis title updates * minor aestetic update * undo change in test * visual updates to plots * bugfix to gaps report * fixed warning * minor fix gaps figure * minor fix gaps figure * minor fix gaps figure * Add new _get_max_overlap * remove obsolete title * minor updates * plot update: show line on left side of test plots for motif generalization * remove obsolete report * remove internal cv in outer assessment loop for sklearn * merge in sklearn cv bugfix * final bugfixes merging in master * added parameter checking when using manual splittype * Keras sequence CNN documentation updates + minor fixes * updated installation docs * Updated SimilarToPositiveSequenceEncoder, MotifEncoder and BinaryFeatureClassifier docs, removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes. * fixes regarding disabling allow_negative_aas option * updated MotifGeneralizationAnalysis docs * added motif recovery tutorial to documentation * updated docs * updated docs * remove deprecated pseudocount parameter * removed importanceweighting strategy and updated docs for predefinedweighting * removed importanceweighting tests * removed importanceweighting tests * fixing tests * corrected docs (and variable names): percentage-wise frequency change is plotted, not logfold * Merge latest master into short motif, resolve merge conflicts. - Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing - some updated variable names - updated docs * Bugfixes related to sequence frame type and 'productive' status for file import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown') * workaround bionumpy+pickle error: not using pool but for loop * Update setup.py * Update Constants.py --------- Co-authored-by: Eric Reber <[email protected]> Co-authored-by: pavlovicmilena <[email protected]>
uio-bmi · Dec 1, 2023 · 751ae6b · 751ae6b
1 parent d218ad9
commit 751ae6b
Show file tree

Hide file tree

Showing 146 changed files with 6,024 additions and 241 deletions.
diff --git a/docs/source/developer_docs/how_to_add_new_encoding.rst b/docs/source/developer_docs/how_to_add_new_encoding.rst
@@ -50,7 +50,7 @@ An example of the implementation of :code:`NewKmerFrequencyEncoder` for the :py:
         """
         Encodes the repertoires of the dataset by k-mer frequencies and normalizes the frequencies to zero mean and unit variance.
 
-        Arguments:
+        Specification arguments:
 
             k (int): k-mer length
 
@@ -324,7 +324,7 @@ This is the example of documentation for :py:obj:`~immuneML.encodings.filtered_s
     Nature Genetics 49, no. 5 (May 2017): 659–65. `doi.org/10.1038/ng.3822 <https://doi.org/10.1038/ng.3822>`_.
 
 
-    Arguments:
+    Specification arguments:
 
         comparison_attributes (list): The attributes to be considered to group receptors into clonotypes. Only the fields specified in
         comparison_attributes will be considered, all other fields are ignored. Valid comparison value can be any repertoire field name.

diff --git a/docs/source/developer_docs/how_to_add_new_preprocessing.rst b/docs/source/developer_docs/how_to_add_new_preprocessing.rst
@@ -35,7 +35,7 @@ It includes implementations of the abstract methods and class documentation at t
         lower_limit, or more clonotypes than specified by the upper_limit.
         Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
 
-        Arguments:
+        Specification arguments:
 
             lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
 
@@ -260,7 +260,7 @@ This is the example of documentation for :py:obj:`~immuneML.preprocessing.filter
     lower_limit, or more clonotypes than specified by the upper_limit.
     Note that this filter filters out repertoires, not individual sequences, and can thus only be applied to RepertoireDatasets.
 
-    Arguments:
+    Specification arguments:
 
         lower_limit (int): The minimal inclusive lower limit for the number of clonotypes allowed in a repertoire.
 

diff --git a/docs/source/installation/install_with_package_manager.rst b/docs/source/installation/install_with_package_manager.rst
@@ -50,14 +50,6 @@ Note: when creating a python virtual environment, it will automatically use the
 
   pip install immuneML
 
-Alternatively, if you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, include the optional extra :code:`TCRdist`:
-
-.. code-block:: console
-
-  pip install immuneML[TCRdist]
-
-See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`
-
 
 
 Install immuneML with conda
@@ -95,6 +87,25 @@ Install immuneML with conda
 Installing optional dependencies
 ----------------------------------
 
+TCRDist
+*******
+
+If you want to use the :ref:`TCRdistClassifier` ML method and corresponding :ref:`TCRdistMotifDiscovery` report, you can include the optional extra :code:`TCRdist`:
+
+.. code-block:: console
+
+  pip install immuneML[TCRdist]
+
+The TCRdist dependencies can also be installed manually using the :download:`requirements_TCRdist.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_TCRdist.txt>` file:
+
+.. code-block:: console
+
+  pip install -r requirements_TCRdist.txt
+
+
+DeepRC
+******
+
 Optionally, if you want to use the :ref:`DeepRC` ML method and and corresponding :ref:`DeepRCMotifDiscovery` report, you also
 have to install DeepRC dependencies using the :download:`requirements_DeepRC.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_DeepRC.txt>` file.
 Important note: DeepRC uses PyTorch functionalities that depend on GPU. Therefore, DeepRC does not work on a CPU.
@@ -104,8 +115,38 @@ To install the DeepRC dependencies, run:
 
   pip install -r requirements_DeepRC.txt --no-dependencies
 
+See also this question under 'Troubleshooting': :ref:`I get an error when installing PyTorch (could not find a version that satisfies the requirement torch)`
+
+
+Keras-based sequence CNN
+************************
+
+In order to use the :ref:`KerasSequenceCNN`, optional dependencies :code:`keras` and :code:`tensorflow` need to be installed.
+By default, version 2.11.0 of both dependencies are used.
+Other versions may work as well, as long as the used versions of :code:`keras` and :code:`tensorflow` are compatible with eachother.
+
+To install the default versions of these packages, you can include the optional extra :code:`KerasSequenceCNN`:
+
+.. code-block:: console
+
+  pip install immuneML[KerasSequenceCNN]
+
+Or install the dependencies manually using the :download:`requirements_KerasSequenceCNN.txt <https://raw.githubusercontent.com/uio-bmi/immuneML/master/requirements_KerasSequenceCNN.txt>` file:
+
+.. code-block:: console
+
+  pip install -r requirements_KerasSequenceCNN.txt
+
+
+The :ref:`KerasSequenceCNN` uses CPU, it does *not* rely on GPU.
+
+CompAIRR
+********
+
 If you want to use the :ref:`CompAIRRDistance` or :ref:`CompAIRRSequenceAbundance` encoder, you have to install the C++ tool `CompAIRR <https://github.com/uio-bmi/compairr>`_.
-The easiest way to do this is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
+Furthermore, the :ref:`SimilarToPositiveSequence` encoder can be run both with and without CompAIRR, but the CompAIRR-based version is faster.
+
+The easiest way to install CompAIRR is by cloning CompAIRR from GitHub and installing it using :code:`make` in the main folder:
 
 .. code-block:: console
 

diff --git a/docs/source/tutorials/how_to_apply_to_new_data.rst b/docs/source/tutorials/how_to_apply_to_new_data.rst
@@ -33,8 +33,11 @@ For a tutorial on importing datasets to immuneML (for training or applying an ML
 YAML specification example using the MLApplication instruction
 ------------------------------------------------------------------
 The :ref:`MLApplication` instruction takes in a :code:`dataset` and a :code:`config_path`. The :code:`config_path` should
-point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction. They can be found in the sub-folder
-:code:`instruction_name/optimal_label_name` in the results folder.
+point at one of the .zip files exported by the previously run :ref:`TrainMLModel` instruction.
+The configuration of the optimal ML setting can always be found in the sub-folder :code:`<instruction_name>/optimal_<label_name>/zip` in the results folder.
+Alternatively, when running the :ref:`TrainMLModel` instruction with the parameter :code:`export_all_ml_settings` set to :code:`True`,
+the config file for each of the ML settings can be found inside :code:`<instruction_name>/split_<number>/<ml_setting_name>/ml_settings_config/zip`
+for each ML setting in each assessment split.
 
 
 .. highlight:: yaml

diff --git a/docs/source/tutorials/motif_recovery.rst b/docs/source/tutorials/motif_recovery.rst
@@ -5,6 +5,51 @@ immuneML provides several different options for recovering motifs associated wit
 Depending on the context, immuneML provides several different reports which can be used for this purpose.
 
 
+Discovering positional motifs using precision and recall thresholds
+----------------------------------------------------------------------
+
+It is often assumed that the antigen binding status of an immune receptor (antibody/TCR) may be determined by the *presence*
+of a short motif in the CDR3.
+We developed a method (manuscript in preparation) for the discovery of antigen binding associated motifs with the following properties:
+
+- Short position-specific motifs with possible gaps
+- High precision for predicting antigen binding
+- High generalisability to unseen data, i.e., retaining a relatively high precision on test data
+
+
+Method description
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A motif with a high precision for predicting antigen binding implies that when the motif is present,
+the probability that the sequence is a binder is high. One can thus iterate through every possible motif and filter
+them by applying a precision threshold. However, the more 'rare' a motif is, the more likely that the motif just had
+a high precision by chance (for example: a motif that occurs in only 1 binder and 0 non-binders has a perfect precision,
+but may not retain high precision on unseen data). Thus, an additional recall threshold is applied to remove
+rare motifs.
+Our method allows the user to define a precision threshold and learn the optimal recall threshold using a training + validation set.
+
+The method consists the following steps:
+
+1. Splitting the data into training, validation and test sets.
+
+2. Using the training set, find all motifs with a high training-precision.
+
+3. Using the validation set, determine the recall threshold for which the validation-precision is still high (separate recall thresholds may be learned for motifs with different sizes).
+
+4. Using the combined training + validation set, find all motifs exceeding the user-defined precision threshold and learned recall threshold(s).
+
+5. Using the test set, report the precision and recall of these learned motifs.
+
+6. Optional: use the set of learned motifs as input features for ML classifiers (e.g., :ref:`BinaryFeatureClassifier` or :ref:`LogisticRegression`) for antigen binding prediction.
+
+Steps 2+3 are done by the report :ref:`MotifGeneralizationAnalysis`. This report exports the learned recall cutoff(s).
+It is recommended to run this report using the :ref:`ExploratoryAnalysis` instruction.
+Steps 4+5 are done by the :ref:`Motif` encoder. The learned recall cutoff(s) are used as input parameters. This encoder
+can be used either in :ref:`ExploratoryAnalysis` or :ref:`TrainMLModel` instructions.
+
+
+
+
 Discovering motifs learned by classifiers
 -----------------------------------------
 

diff --git a/immuneML/IO/dataset_export/AIRRExporter.py b/immuneML/IO/dataset_export/AIRRExporter.py
@@ -207,12 +207,14 @@ def _postprocess_dataframe(df, dataset_labels: dict, omit_columns: list = None):
         if "frame_type" in df.columns:
             AIRRExporter._enums_to_strings(df, "frame_type")
 
-            df["productive"] = df["frame_type"] == SequenceFrameType.IN.name
-            df.loc[df["frame_type"].isnull(), "productive"] = ''
+            df["productive"] = df["frame_type"] == SequenceFrameType.IN.value
+            df.loc[df["frame_type"].isnull(), "productive"] = ""
+            df.loc[df["frame_type"] == "", "productive"] = ""
+            df.loc[df["frame_type"] == SequenceFrameType.UNDEFINED.value, "productive"] = ""
 
             df["vj_in_frame"] = df["productive"]
 
-            df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.name
+            df["stop_codon"] = df["frame_type"] == SequenceFrameType.STOP.value
             df.loc[df["frame_type"].isnull(), "stop_codon"] = ''
 
             df.drop(columns=["frame_type"], inplace=True)

diff --git a/immuneML/IO/dataset_import/AIRRImport.py b/immuneML/IO/dataset_import/AIRRImport.py
@@ -38,6 +38,8 @@ class AIRRImport(DataImport):
 
     - import_productive (bool): Whether productive sequences (with value 'T' in column productive) should be included in the imported sequences. By default, import_productive is True.
 
+    - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
+
     - import_with_stop_codon (bool): Whether sequences with stop codons (with value 'T' in column stop_codon) should be included in the imported sequences. This only applies if column stop_codon is present. By default, import_with_stop_codon is False.
 
     - import_out_of_frame (bool): Whether out of frame sequences (with value 'F' in column vj_in_frame) should be included in the imported sequences. This only applies if column vj_in_frame is present. By default, import_out_of_frame is False.
@@ -110,15 +112,16 @@ def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
             - the allele information is removed from the V and J genes
         """
         if "productive" in df.columns:
-            df["frame_type"] = SequenceFrameType.OUT.name
-            df.loc[df["productive"], "frame_type"] = SequenceFrameType.IN.name
+            df["frame_type"] = SequenceFrameType.UNDEFINED.value
+            df.loc[df["productive"]==True, "frame_type"] = SequenceFrameType.IN.value
+            df.loc[df["productive"]==False, "frame_type"] = SequenceFrameType.OUT.value
         else:
             df["frame_type"] = None
 
         if "vj_in_frame" in df.columns:
-            df.loc[df["vj_in_frame"], "frame_type"] = SequenceFrameType.IN.name
+            df.loc[df["vj_in_frame"]==True, "frame_type"] = SequenceFrameType.IN.value
         if "stop_codon" in df.columns:
-            df.loc[df["stop_codon"], "frame_type"] = SequenceFrameType.STOP.name
+            df.loc[df["stop_codon"]==True, "frame_type"] = SequenceFrameType.STOP.value
 
         if "productive" in df.columns:
             frame_type_list = ImportHelper.prepare_frame_type_list(params)

diff --git a/immuneML/IO/dataset_import/DatasetImportParams.py b/immuneML/IO/dataset_import/DatasetImportParams.py
@@ -19,6 +19,7 @@ class DatasetImportParams:
     column_mapping_synonyms: dict = None
     region_type: RegionType = None
     import_productive: bool = None
+    import_unknown_productivity: bool = None
     import_unproductive: bool = None
     import_with_stop_codon: bool = None
     import_out_of_frame: bool = None

diff --git a/immuneML/IO/dataset_import/TenxGenomicsImport.py b/immuneML/IO/dataset_import/TenxGenomicsImport.py
@@ -38,6 +38,12 @@ class TenxGenomicsImport(DataImport):
 
     - receptor_chains (str): Required for ReceptorDatasets. Determines which pair of chains to import for each Receptor.  Valid values for receptor_chains are the names of the :py:obj:`~immuneML.data_model.receptor.ChainPair.ChainPair` enum. If receptor_chains is not provided, the chain pair is automatically detected (only one chain pair type allowed per repertoire).
 
+    - import_productive (bool): Whether productive sequences (with value 'True' in column productive) should be included in the imported sequences. By default, import_productive is True.
+
+    - import_unproductive (bool): Whether productive sequences (with value 'Fale' in column productive) should be included in the imported sequences. By default, import_unproductive is False.
+
+    - import_unknown_productivity (bool): Whether sequences with unknown productivity (missing or 'NA' value in column productive) should be included in the imported sequences. By default, import_unknown_productivity is True.
+
     - import_illegal_characters (bool): Whether to import sequences that contain illegal characters, i.e., characters that do not appear in the sequence alphabet (amino acids including stop codon '*', or nucleotides). When set to false, filtering is only applied to the sequence type of interest (when running immuneML in amino acid mode, only entries with illegal characters in the amino acid sequence are removed). By default import_illegal_characters is False.
 
     - import_empty_nt_sequences (bool): imports sequences which have an empty nucleotide sequence field; can be True or False. By default, import_empty_nt_sequences is set to True.
@@ -105,17 +111,21 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:
 
     @staticmethod
     def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
-        df["frame_type"] = None
-        df['productive'] = df['productive'] == 'True'
-        df.loc[df['productive'], "frame_type"] = SequenceFrameType.IN.name
+        df["frame_type"] = SequenceFrameType.UNDEFINED.value
+        df.loc[df['productive']=="True", "frame_type"] = SequenceFrameType.IN.value
+        df.loc[df['productive']=="False", "frame_type"] = SequenceFrameType.OUT.value
 
         allowed_productive_values = []
         if params.import_productive:
-            allowed_productive_values.append(True)
+            allowed_productive_values.append('True')
         if params.import_unproductive:
-            allowed_productive_values.append(False)
+            allowed_productive_values.append('False')
+        if params.import_unknown_productivity:
+            allowed_productive_values.append('')
+            allowed_productive_values.append('NA')
 
         df = df[df.productive.isin(allowed_productive_values)]
+        df.drop(columns=["productive"], inplace=True)
 
         ImportHelper.junction_to_cdr3(df, params.region_type)
         df.loc[:, "region_type"] = params.region_type.name

diff --git a/immuneML/IO/dataset_import/VDJdbImport.py b/immuneML/IO/dataset_import/VDJdbImport.py
@@ -109,7 +109,7 @@ def import_dataset(params: dict, dataset_name: str) -> Dataset:
 
     @staticmethod
     def preprocess_dataframe(df: pd.DataFrame, params: DatasetImportParams):
-        df["frame_type"] = SequenceFrameType.IN.name
+        df["frame_type"] = SequenceFrameType.IN.value
         ImportHelper.junction_to_cdr3(df, params.region_type)
         df.loc[:, "region_type"] = params.region_type.name
 

diff --git a/immuneML/config/default_params/datasets/airr_params.yaml b/immuneML/config/default_params/datasets/airr_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 paired: False
 import_productive: True
+import_unknown_productivity: True
 import_with_stop_codon: False
 import_out_of_frame: False
 import_illegal_characters: False

diff --git a/immuneML/config/default_params/datasets/i_receptor_params.yaml b/immuneML/config/default_params/datasets/i_receptor_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 paired: False
 import_productive: True
+import_unknown_productivity: True
 import_with_stop_codon: False
 import_out_of_frame: False
 import_illegal_characters: False

diff --git a/immuneML/config/default_params/datasets/tenx_genomics_params.yaml b/immuneML/config/default_params/datasets/tenx_genomics_params.yaml
@@ -2,6 +2,7 @@ is_repertoire: True
 path: ./
 import_productive: True # whether to only import productive sequences
 import_unproductive: False # whether to only import unproductive sequences
+import_unknown_productivity: True # whether to import sequences with unknown productivity (missing/NA)
 import_illegal_characters: False
 region_type: "IMGT_CDR3" # which region to use - IMGT_CDR3 option means removing first and last amino acid as 10xGenomics uses IMGT junction as CDR3
 separator: "," # column separator

diff --git a/immuneML/config/default_params/encodings/motif_params.yaml b/immuneML/config/default_params/encodings/motif_params.yaml
@@ -0,0 +1,5 @@
+max_positions: 4
+min_positions: 1
+min_precision: 0.8
+min_recall: 0
+min_true_positives: 10
diff --git a/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml b/immuneML/config/default_params/encodings/similar_to_positive_sequence_params.yaml
@@ -0,0 +1,5 @@
+hamming_distance: 1
+ignore_genes: false
+threads: 8
+keep_temporary_files: false
+compairr_path: null
diff --git a/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml b/immuneML/config/default_params/example_weighting/predefined_weighting_params.yaml
@@ -0,0 +1 @@
+separator: "\t"
diff --git a/immuneML/config/default_params/instructions/train_ml_model_params.yaml b/immuneML/config/default_params/instructions/train_ml_model_params.yaml
@@ -10,4 +10,6 @@ assessment: # outer loop of nested CV
 selection: # inner loop of nested CV
   split_strategy: random # perform random split to train and validation datasets
   split_count: 1 # how many fold to create
-  training_percentage: 0.7
+  training_percentage: 0.7
+example_weighting: null
+export_all_ml_settings: False # only export the optimal model
diff --git a/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml b/immuneML/config/default_params/ml_methods/binary_feature_classifier_params.yaml
@@ -0,0 +1,5 @@
+training_percentage: 0.7
+max_features: 100
+patience: 5
+min_delta: 0
+keep_all: false
diff --git a/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml b/immuneML/config/default_params/ml_methods/keras_sequence_cnn_params.yaml
@@ -0,0 +1,3 @@
+training_percentage: 0.7
+units_per_layer: [[CONV, 400, 3, 1], [DROP, 0.5], [POOL, 2, 1], [FLAT], [DENSE, 50]]
+activation: relu
diff --git a/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml b/immuneML/config/default_params/reports/motif_generalization_analysis_params.yaml
@@ -0,0 +1,15 @@
+training_set_identifier_path: null
+training_percentage: 0.7
+split_by_motif_size: true
+max_positions: 4
+min_positions: 1
+min_precision: 0.9
+min_recall: 0
+min_true_positives: 1
+test_precision_threshold: 0.8
+highlight_motifs_name: Highlighted motif
+min_points_in_window: 50
+smoothing_constant1: 5
+smoothing_constant2: 10
+training_set_name: training set
+test_set_name: test set
diff --git a/immuneML/config/default_params/reports/motif_overlap_params.yaml b/immuneML/config/default_params/reports/motif_overlap_params.yaml
@@ -0,0 +1,5 @@
+n_splits: 5
+max_positions: 4
+min_precision: 0
+min_recall: 0
+min_true_positives: 1
diff --git a/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml b/immuneML/config/default_params/reports/motif_test_set_performance_params.yaml
@@ -0,0 +1,8 @@
+highlight_motifs_name: Highlighted motif
+min_points_in_window: 50
+smoothing_constant1: 5
+smoothing_constant2: 10
+training_set_name: training set
+test_set_name: test set
+split_by_motif_size: true
+keep_test_dataset: true