Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result Interpretation #8

Open
Taghrid-M opened this issue Mar 18, 2024 · 1 comment
Open

Result Interpretation #8

Taghrid-M opened this issue Mar 18, 2024 · 1 comment

Comments

@Taghrid-M
Copy link

Dears,

I am working with the Rye algorithm for genetic ancestry analysis and need guidance on a few points:

Configuring Reference and Query Samples: How do I designate samples as reference or query within my dataset? Is there a specific format or procedure for this?

Results Interpretation: How should I interpret the ancestry group fraction values for query individuals? Does the highest value indicate primary ancestry, or is there a better approach?

Sample Proportions and Minimums: Is there an optimal reference-to-query sample ratio for accurate results? Additionally, what is the minimum total number of samples required for effective analysis?

Your advice on these matters would greatly aid my research. Thank you for your support.

Many thanks,
Taghrid

@ShivamSharma13
Copy link
Collaborator

Hi Taghrid,

Please check the following responses.

Configuring Reference and Query Samples
This will depend on your target population. Ideally, your reference samples will come from open source datasets such as 1000 Genomes or HGDP. For example: if your query samples come from a South American country, you would select European (such as GRB, IBS), African (such as YRI, GWD) and American (such as Pima, PEL, Maya) as three reference groups (you can assign European, African and American as subgroup and group columns of pop2group.txt files).

Results Interpretation
RYE generates continuous genetic ancestry estimates (0-100%). Determining primary ancestry is a subjective task, especially for admixed individuals. For example, a sample showing up as 100% European could theoretically indicate an individual with primarily European descent. But an admixed sample from South America could quite possibly come out as 33% European, 33% African, and 33% American ancestry. In such a case, primary ancestry label cannot be assigned as either of reference group, but you can put samples with similar genetic ancestry estimates in something like "Admixed South American".

Sample Proportions and Minimums
This is important, specially in the case of reference samples. Please read more about PC bias: [https://www.nature.com/articles/s41598-022-14395-4](Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated). Ideally, it's essential to keep your n(reference samples) balanced between groups, otherwise PCA will start finding more granular variations within groups. A typical RYE run for query samples from South America could include 300 European, 300 African, and 200 American reference samples (you can extract them from 1000Genomes + HGDP open-source datasets).

n(query samples) can vary but please make sure that you have appropriate reference samples that cover all extremes of your PCA sample distribution. For example, if you have query samples having high African ancestry, but no reference samples that are designated to capture African ancestry, then RYE would find other proxy ancestries (which may or may not be accurately interpret-able).

Thanks,
Shivam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants