Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Merged in PositionalMotifFrequencies report * added todos for PositionalMotifFrequencies report * small (formatting) corrections. updated todos * added precision/recall to feature annotations * Added SignificantMotifPrecisionTP report * - rename SignificantMotifEncoder to MotifEncoder - added MotifGeneralizationAnalysis - do not store learned motifs as parameter of MotifEncoder, instead read from file - added random seed option to get_train_val_indices - MotifPrecisionTP may soon be deprecated or must be refactored to share code with MotifGeneralizationAnalysis - * attempt at making MotifEncoder faster by initializing (long) growing lists with None values * allow label to be str or dict * parallelisation of MotifEncoder encoded data matrix construction for speed * more parallelisation in MotifEncoder * add weight_thresholds, split_classes via YAML * minor updates * add WeightsDistribution report * add weight_thresholds, split_classes via YAML * add docs, add unit test file (not completed) * minor updates * added todos for Eric in WeightsDistribution report * fixed todos * minor correction * bugfix: DataWeighter should return a clone of the dataset instead of modifying the dataset * test print statements * debugging print statements * attempted bugfix * debugging prints * debugging * debugging * bugfix * removed debugging prints * Bugfixes in MotifGeneralizationAnalysis: - do not ignore sequences with TP=0 in test set - include FP scores when computing combined precision * bugfixes & added smoothing option * Bugfix: remove sorting from ElementDataset & add assert statement in ElementGenerator for when files are not ordered correctly * extending importance weighting to restrict mutagenesis to only one class * finished implementation of class-specific ImportanceWeighting * - Updated line smoothing code for MotifGeneralizationAnalysis - Updated _safe_plot in Report: can now specify the name of the callable (default="_plot"). This means _safe_plot can be used multiple times in one report, ensuring that when one of the plots fails, the other plots are still generated. * added more todos for WeightsDistribution report * updated MotifGeneralizationAnalysis: - plotting style - cleaned up code, moved computation out of plotting functions, export all plotting-relevant details prior to plotting * - Added AminoAcidFrequencyDistribution report: plots a barplot of each amino acid in each position of all sequences in a dataset (any dataset type) - small fix in SequenceLengthDistribution (set output_written=False to ensure correct error message, since this report writes no data file) * Updated color palette * updated AminoAcidFrequencyDistribution to include splitting by label values * Updated docs * update style * sorted categories AminoAcidFrequencyDistribution * made range of figures up to 1.01 to not cut off points * temporarily add sequence hover data to WeightsDistribution report * added option to predefine training set for MotifGeneralizationAnalysis by file * automatically determine the optimal TP/recall cutoff and show in plot * moved get_numpy_sequence_representation to PositionalMotifHelper * update: write training set ids to files instead of printing in log (too long) * updated MotifGeneralisationAnalysis: choose last point of exceeding precision threshold for TP cutoff * plot highlighted motifs on top * minor refactoring * allow generalization plot for multiple motif sizes * bugfix: dynamically change min_total_points_in_window * Bugfix * bugfix * bugfix * plot fix * separate recall cutoff for different motif sizes * updated the way the recall threshold is determined * export confusion matrix * theme white * minor fix * added keep_all param to MotifClassifier * improved error message for Metric * bugfix * bugfix matches report: get subject ids * bugfix: class mapping * added selected features as export value * move selected feature writing to fit * bugfix * updated the way tp thresholds are determined * added MotifTestSetPeformance report, refactored to share code with MotifGeneralizationAnalysis with helper class MotifPerformancePlotHelper * New report: NonMotifSimilarity minor changes made to other reports * rename report * removed deprecated report, added requirements specific for tensorflow * updated format of example id files for compatibility * bugfix manual splitter: it didn't work for non-string classes, now everything is cast to string * bugfix * shorten log text - becomes extremely long and unreadable * bugfix identifiers * refactored out col_names stuff for simplicity * refactoring, more shared code, splitting per motif size of motiftestsetperformance * Add MotifOverlapReport * prettier plots * all tp cutoffs in one file * started implementation, abandoned idea for now * export simple stats from MotifEncoder * updated plot * Initial version * backup, installing new OS * small edits * comment out some experimental code * bufgfix test * added SimilarToPositiveSequenceEncoder: a full sequence hamming dist-based encoder renamed MotifClassifier to BinaryFeatureClassifier as it's generally applicable * add facet * different sizes * clean up * slight speed improvement: allow lower size limit on motifs and don't check the motifs that are too small * more helpful error message * minor updates to plot styling * minor updates * minor bugfix * change dataframe structure * all in one plot, change table * add help method * update test bench * add duplicate max values * added option for negative amino acids to Motif encoder * added option for negative amino acids to Motif encoder * add top/bottom n and filtering to FeatureValueBarplot * added option for negative amino acids to Motif encoder * Add max_gap_size_only functionality * Label: - ensure label classes always follow the same order: positive class first - ensure a default label positive class is always set (in LabelConfig), utility function for retrieving negative class in binary case - enforce label classes to always be set predictions proba: - when predicting probabilities: explicitly keep track of the label classes (dict instead of multi-dimensional np array) -> previously the ordering of labels (positive class last) was not consistently enforced across MLMethods. These bugs are dificult to catch as the 'predictions' were correct, but the 'predictions_proba' not. I believe the current solution is less error-prone in the future when other developers may work on immuneML These updates resolve previously observed bugs that: predictions & predictions_proba did not match (resulting in inverse ROC curves), and that the wrong positive class was sometimes assumed for asymmetric performance metrics (e.g., precision, recall) * fixes after new update * cleaner way of getting label desc for storing ML models * improved tests * minor fix * little refactoring, cleaned up some shared code between GroundTruthMotifOverlap & PositionalMotifHelper minor changes in GroundTruthMotifOverlap * made gap plot a lineplot * default param * check params * added BinaryFeaturePrecisionRecall: a precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * added precision-recall plot for BinaryFeatureClassifier, plus the option to force learn all motifs * minor update error message * bugfix * bugfix * improved test * bugfix, got stuck in an infinite loop * bugfix * bugfix * bugfixes * temporarily set higher recursion depth to prevent crashing * update report to show training-validation-test set performance independently * Made CompAIRR-powered version of SimilarToPositiveSequenceEncoder * minor fixes GroundTruthMotifOverlap plot & make it possible for BinaryFeatureClassifier to set max motifs to all motifs on the fly * remove print statement * minor update * rename highlight_motifs_path to groundtruth_motifs_path * bugfixes compairr-version of SimilarToPositiveSequenceEncoder * bugfixes compairr-version of SimilarToPositiveSequenceEncoder * separate output folder for learning model * added option to automatically remove test dataset (can be large) * Update AminoAcidFrequencyDistribution report to show log-fold change * implemented get_attribute for Receptor. All receptors have identifier and metadata dict. * bugfix * switch from logfold change to difference in relativbe frequency * . * 1-based counting of positions * functionality to export non-optimal ML models in addition to the optimal ones * undo partial commit * improved efficiency of BinaryFeatureClassifier * added lots of log statements to find out where the running time bottleneck is * keep track of val predictions instead of recomputing them every time * added multiprocessing option for BinaryFeatureClassifier * remove default cores for training to test * bugfix: pass cores_for_training in recursive function * possible speed improvement: dont recompute scoring fn when array is equal to previously tested * remove log statement * - in BinaryFeatureClassifier, keep track of indices that show improvement to reduce the total number of comparisons made during training - remove learn_all option from BinaryFeatureClassifier - updated BinaryFeaturePrecisionRecall to display only 1 data point if keep_all = True * updated log statement * remove log statements * minor fix docs * fixes for Label in MLApplication instruction: explicitly pass on the positive class to make sure the same positive class is applied during MLApplication * Allow metrics to be computed during MLApplication if the same label is provided * small fix to make tests pass * fix: html was overwritten * bugfixes * restored example weigths * bugfix: test if proba available * bugfix: dont access _proba columns when not defined * bugfix: convert everything to string * small fixes * bugfixes * fix bug * big bug fix * added test for GroundTruthMotifOverlap + small fixes higlight motifs: highlight sub motifs also * small fix for faster test * small updates to motif reports * axis title updates * minor aestetic update * undo change in test * visual updates to plots * bugfix to gaps report * fixed warning * minor fix gaps figure * minor fix gaps figure * minor fix gaps figure * Add new _get_max_overlap * remove obsolete title * minor updates * plot update: show line on left side of test plots for motif generalization * remove obsolete report * remove internal cv in outer assessment loop for sklearn * merge in sklearn cv bugfix * final bugfixes merging in master * added parameter checking when using manual splittype * Keras sequence CNN documentation updates + minor fixes * updated installation docs * Updated SimilarToPositiveSequenceEncoder, MotifEncoder and BinaryFeatureClassifier docs, removed generalize_motifs option as it is currently not used in practice, and disabled allow_negative_aas option as it requires a few more fixes. * fixes regarding disabling allow_negative_aas option * updated MotifGeneralizationAnalysis docs * added motif recovery tutorial to documentation * updated docs * updated docs * remove deprecated pseudocount parameter * removed importanceweighting strategy and updated docs for predefinedweighting * removed importanceweighting tests * removed importanceweighting tests * fixing tests * corrected docs (and variable names): percentage-wise frequency change is plotted, not logfold * Merge latest master into short motif, resolve merge conflicts. - Bugfix in AIRRExporter which read sequences as 'productive'=False when 'productive' was missing - some updated variable names - updated docs * Bugfixes related to sequence frame type and 'productive' status for file import. Add explicit option to import sequecnes with unknown productivity where relevant (true by default, option not made available for immunoseq import types as their documentation reveals that productivity type for those file formats is never 'unknown') * workaround bionumpy+pickle error: not using pool but for loop * Update setup.py * Update Constants.py --------- Co-authored-by: Eric Reber <[email protected]> Co-authored-by: pavlovicmilena <[email protected]>
- Loading branch information