Result Interpretation #8

Taghrid-M · 2024-03-18T20:20:19Z

Dears,

I am working with the Rye algorithm for genetic ancestry analysis and need guidance on a few points:

Configuring Reference and Query Samples: How do I designate samples as reference or query within my dataset? Is there a specific format or procedure for this?

Results Interpretation: How should I interpret the ancestry group fraction values for query individuals? Does the highest value indicate primary ancestry, or is there a better approach?

Sample Proportions and Minimums: Is there an optimal reference-to-query sample ratio for accurate results? Additionally, what is the minimum total number of samples required for effective analysis?

Your advice on these matters would greatly aid my research. Thank you for your support.

Many thanks,
Taghrid

ShivamSharma13 · 2024-03-23T15:35:08Z

Hi Taghrid,

Please check the following responses.

Configuring Reference and Query Samples
This will depend on your target population. Ideally, your reference samples will come from open source datasets such as 1000 Genomes or HGDP. For example: if your query samples come from a South American country, you would select European (such as GRB, IBS), African (such as YRI, GWD) and American (such as Pima, PEL, Maya) as three reference groups (you can assign European, African and American as subgroup and group columns of pop2group.txt files).

Results Interpretation
RYE generates continuous genetic ancestry estimates (0-100%). Determining primary ancestry is a subjective task, especially for admixed individuals. For example, a sample showing up as 100% European could theoretically indicate an individual with primarily European descent. But an admixed sample from South America could quite possibly come out as 33% European, 33% African, and 33% American ancestry. In such a case, primary ancestry label cannot be assigned as either of reference group, but you can put samples with similar genetic ancestry estimates in something like "Admixed South American".

Sample Proportions and Minimums
This is important, specially in the case of reference samples. Please read more about PC bias: [https://www.nature.com/articles/s41598-022-14395-4](Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated). Ideally, it's essential to keep your n(reference samples) balanced between groups, otherwise PCA will start finding more granular variations within groups. A typical RYE run for query samples from South America could include 300 European, 300 African, and 200 American reference samples (you can extract them from 1000Genomes + HGDP open-source datasets).

n(query samples) can vary but please make sure that you have appropriate reference samples that cover all extremes of your PCA sample distribution. For example, if you have query samples having high African ancestry, but no reference samples that are designated to capture African ancestry, then RYE would find other proxy ancestries (which may or may not be accurately interpret-able).

Thanks,
Shivam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result Interpretation #8

Result Interpretation #8

Taghrid-M commented Mar 18, 2024

ShivamSharma13 commented Mar 23, 2024

Result Interpretation #8

Result Interpretation #8

Comments

Taghrid-M commented Mar 18, 2024

ShivamSharma13 commented Mar 23, 2024