Skip to content
This repository has been archived by the owner on May 31, 2023. It is now read-only.

Artifact Estimator Project

Maxwell Murphy edited this page Sep 6, 2019 · 3 revisions

Artifact Estimator Projects are used to generate artifact estimator equations, simple linear models that predict artifact peaks. Artifact peaks are false peaks that occur consistently relative to a true peak, and their heights can be described as a function of the height of a true peak. After peaks are identified, the Artifact Estimator Project will cluster peaks based on their distance to the presumed primary true peak. Within each of these groupings, the Artifact Estimator Project will then generate a linear model that describes the relationship.

New Artifact Estimator Project

To create a new Artifact Estimator Project, select the Artifact Estimators tab in the left panel. Fill in the relevant information, including a unique title, the locus set to be analyzed, and a valid Bin Estimator Project that will be used to bin peaks.

Add Samples

After creating an Artifact Estimator Project, select the samples tab and select from the inactive samples the samples you wish to incorporate into the project. After selecting, presss Add to add them to the project.

Artifact Estimator Settings

After creating a new Arfitact Estimator Project, select the project and navigate to the Analysis Settings tab. From here, each locus may be individually analyzed to create a new Artifact Estimator Equation Set. During analysis, runs are only used if they are determined to be mono-allelic, that is, there is only one primary peak.

Parameter Description
Max Secondary Relative Peak Height Used to determine mono-allelic runs. A run is mono-allelic if all secondary peaks have a relative peak height less than Max Secondary Relative Peak Height
Min Artifact Peak Frequency Minimum number of peaks that must be present in a relative distance class before an Artifact Estimator Equation Set is generated.

Artifact Estimator Equation Set

After analyzing peaks for a given locus, a set of equations are created that describe the relationship of the primary peak to the identified artifact peak classes, where each class is a discrete distance from the primary peak. Within each class, there may be one or more linear regressions that describe the relationship as a function of the base size of the primary peak. There are 4 main methods for generating linear regressions:

Method Description
Least Squares Regression (LSR) Minimizes the residual sum of squares between observed response and response predicted
Theil-Sen Regression (TSR) Non-parametric method, median-based estimator that is more robust to outliers. Default estimator used
Random Sampling Consensus (RANSAC) Iterative algorithm that is robust to outliers. Partitions peaks into inliers and outliers, uses only inliers to generate model. Provides very tight estimates of artifact, however is erratic when little data is available
No Slope Estimate is the mean of the data

Sometimes the relationship between the primary peak and the artifact peak does not follow a strictly linear relationship, and is better represented by multiple linear segments as represented above. A possible biological explanation for this is the presence of in-dels within the sequence that impact amplification dynamics. Breakpoints may be added by double-clicking on the panel at the point you wish to add a breakpoint to accommodate these different regions. If an artifact estimator equation set does not appear to be accurate, it can be simply deleted. A global artifact estimator equation set is also generated for every marker that attempts to identify low level artifacts that appear but do not have a relationship to a primary peak.