• Don't want to see ads? Install an adblocker like uBlock Origin or use a Europe-based privacy-friendly browser like Vivaldi or Mullvad.

Best Practices - ADMIXTOOLS2 Functions: extract_f2/f2_from_precomp, qpWave, qpadm

Jovialis

Advisor
Messages
9,888
Reaction score
6,794
Points
113
Ethnic group
Italian
Y-DNA haplogroup
R1b-PF7566>Y227216
mtDNA haplogroup
H6a1b7
5bqNIem.png


I used ChatGPT’s Deep Research (o3 mini-high) and Grok 3 Think to analyze Harney et al. (2019)'s PDF supplement as well as the GitHub ADMIXTOOLS2 tutorial to determine best practices for analyzing my modern WGS30x sample alongside AADR 62.0 1240K aDNA data. Over the past few days, I’ve been running tests to develop an optimal, comprehensive script that yields the most accurate results by employing F2 pre-computation, qpWave, and qpAdm, with each function’s parameters set appropriately.

Code:
# Step 1: Define file paths
prefix <- "D:\\Bioinformatics\\01_Admixtools_Dataset\\v62.0_1240K_HO_Jovialis_Plink\\v62.0_1240K_HO_Jovialis"
my_f2_dir <- "D:\\Bioinformatics\\01_Admixtools_Dataset\\my_f2_dir_v62.0_1240K_HO_Jovialis"

# Step 2: Load required libraries
library(admixtools)
library(tidyverse)

# Step 3: Define target and population lists
target <- c('Jovialis')
left <- c('Moldova_MBA_Catacomb.AG', 'Italy_Ordona_Daunian.SG', 'Greece_Crete_HgCharalambos_EMBA.AG')
right  <- c("Ethiopia_4500BP.AG", "Russia_UstIshim_IUP.DG", "Russia_Sidelkino_HG.SG", "Russia_Kostenki14_UP.AG.BY.AA", "Russia_AfontovaGora3_UP.AG", "Russia_MA1_UP.SG", "Luxembourg_Mesolithic.AG",  "Czechia_Gravettian.AG.BY.AA", "Belgium_GoyetQ116_1_UP.AG", "Spain_Magdalenian_contam.AG", "Georgia_Kotias_Mesolithic.SG", "Iran_GanjDareh_N.AG", "Turkey_Marmara_Barcin_N.AG", "Jordan_PPNB.AG", "Israel_Natufian.AG")

# Step 4: Combine all populations for f2 statistics
mypops <- c(target, left, right)

# Step 5: Generate f2 statistics for qpWave
cat("\nExtracting f2 statistics...\n")
extract_f2(prefix, my_f2_dir, pops = mypops, overwrite = TRUE, maxmiss = 1, blgsize = 0.05)
f2_blocks <- f2_from_precomp(my_f2_dir, pops = mypops, afprod = TRUE)

# Step 6: Run qpWave to validate admixture model complexity
cat("\nRunning qpWave to check rank...\n")
qpwave_results <- qpwave(f2_blocks, left = left, right = right, verbose = TRUE)

# Save qpWave results to file
write.csv(qpwave_results$rankdrop, file = "qpwave_rankdrop_results.csv", row.names = FALSE)

cat("\nqpWave Rank Drop Results:\n")
print(qpwave_results$rankdrop)

# Step 7: Determine whether to proceed with qpAdm
min_p <- min(qpwave_results$rankdrop$p, na.rm = TRUE)

if (!is.na(min_p) && min_p > 0.05) {
    cat("\nqpWave indicates that the source populations form a clade (p > 0.05). This suggests that the sources do not exhibit distinct ancestries.\n")
    cat("\nqpAdm may not provide meaningful results in this case, but we will proceed for comparison.\n")
 
    # Run qpAdm with allsnps = TRUE using genotype files
    results <- qpadm(prefix, left, right, target, allsnps = TRUE)
 
    cat("\nqpAdm Admixture Weights (allsnps = TRUE, may not be meaningful due to clade formation):\n")
    print(results$weights)
 
    cat("\nqpAdm Population Drop Results (allsnps = TRUE):\n")
    print(results$popdrop)
 
    # Save results to file
    write.csv(results$weights, file = "qpadm_weights_clade.csv", row.names = FALSE)
    write.csv(results$popdrop, file = "qpadm_popdrop_clade.csv", row.names = FALSE)
 
} else {
    cat("\nqpWave indicates that the source populations do not form a clade (p ≤ 0.05). Proceeding with qpAdm to estimate admixture proportions.\n")
 
    # Run qpAdm with allsnps = TRUE using genotype files
    results_true <- qpadm(prefix, left, right, target, allsnps = TRUE)
 
    cat("\nqpAdm Admixture Weights (allsnps = TRUE):\n")
    print(results_true$weights)
 
    cat("\nqpAdm Population Drop Results (allsnps = TRUE):\n")
    print(results_true$popdrop)
 
    # Save results to file
    write.csv(results_true$weights, file = "qpadm_weights_allsnps_TRUE.csv", row.names = FALSE)
    write.csv(results_true$popdrop, file = "qpadm_popdrop_allsnps_TRUE.csv", row.names = FALSE)
 
    # Run qpAdm with allsnps = FALSE for comparison
    results_false <- qpadm(f2_blocks, left, right, target, allsnps = FALSE)
 
    cat("\nqpAdm Admixture Weights (allsnps = FALSE):\n")
    print(results_false$weights)
 
    cat("\nqpAdm Population Drop Results (allsnps = FALSE):\n")
    print(results_false$popdrop)
 
    # Save results to file
    write.csv(results_false$weights, file = "qpadm_weights_allsnps_FALSE.csv", row.names = FALSE)
    write.csv(results_false$popdrop, file = "qpadm_popdrop_allsnps_FALSE.csv", row.names = FALSE)
}

cat("\nAnalysis complete. Check the output CSV files for detailed results.\n")
 
Last edited:
Maxmiss = 1 is what is optimal when dealing with aDNA. I have modified the script to reflect that. The reason is because aDNA is generally high in missingness, and could affect the results. Stricter parameters could be set for high-quality dna, like modern samples. (i.e. 0.2)
 
Here's an explanation of the script pipeline:

F2 Extraction:
extract_f2() creates a precomputed f2 dataset from your raw genotype data, using a lenient setting (maxmiss = 1) to retain as many SNPs as possible. This approach is optimal for aDNA because of its typically high missingness.

qpWave Analysis
qpWave uses this precomputed f2 data to determine whether the left (source) populations form a clade relative to the right (outgroup) populations. If they do not (indicated by a low p-value), it confirms that the sources are distinct and that qpAdm analysis is warranted.

qpAdm with allsnps = TRUE:
qpAdm is then run with allsnps = TRUE, which directly computes f4-statistics from the raw genotype data for each quartet. This dynamic approach maximizes SNP usage and is generally more robust for aDNA, although it may introduce some run-to-run variation. Therefore, repeating the analysis to replicate a viable model is advisable.

qpAdm with allsnps = FALSE:
qpAdm is also run using the precomputed f2 dataset (allsnps = FALSE), resulting in a fixed SNP set across all comparisons. However, this fixed method can be less statistically robust for aDNA because the filtering process might lead to a lower SNP count, even with maxmiss set to 1.
 
Last edited:
Is it true that qpAdm is only reliable for population-wide studies and not studies of individuals?
 
Is it true that qpAdm is only reliable for population-wide studies and not studies of individuals?
No, it’s not true that qpAdm is only reliable for population-wide studies. While qpAdm is mainly built for analyzing groups, it can also work for individuals. If you create a statistically solid qpAdm model for one person, it reliably shows that person’s ancestry. The difference is that population studies average out individual quirks for a broader picture, while an individual model just reflects that one genome accurately without representing the full diversity of a group.
 
No, it’s not true that qpAdm is only reliable for population-wide studies. While qpAdm is mainly built for analyzing groups, it can also work for individuals. If you create a statistically solid qpAdm model for one person, it reliably shows that person’s ancestry. The difference is that population studies average out individual quirks for a broader picture, while an individual model just reflects that one genome accurately without representing the full diversity of a group.
Thanks for that information.

Have you had a chance to gauge the accuracy of the Insights on Ancestry website which provides qpAdm reports for $33?
It gave me a qpAdm report based on two ancient populations.
 
Thanks for that information.

Have you had a chance to gauge the accuracy of the Insights on Ancestry website which provides qpAdm reports for $33?
It gave me a qpAdm report based on two ancient populations.
I've heard, I think I need to come up with my own service that charges $32! :)
 
Wow, that's kind of messed up. Frankly, I've run these models thousands of times to come up statistically robust models. Also, what kind of outgroups are they using? That is crucial.

Frankly, you are better off just following the guides here on eupedia, and experimenting with it yourself.

1741261963695.png
 
Wow, that's kind of messed up. Frankly, I've run these models thousands of times to come up statistically robust models. Also, what kind of outgroups are they using? That is crucial.

Frankly, you are better off just following the guides here on eupedia, and experimenting with it yourself.

View attachment 17922
They gave me a P-value of 0.070439 for 81.6% Italy_TarquiniaMonterozzi_IA... & 18.4% Lebanon_Medieval.

What does this tell me about my ancestry?
 
Did they provide standard errors? Sorry to say it, but that sounds like it is not worth the money. It is just one model among virtually countless that could be yielded with more robust results. Moreover, it is not historically plausible because Italians were not pure IA Latins/Etruscans until purported medieval Near eastern enrichment. There was also Aegean/Greek, and Central European enrichment that are more than likely to have taken place over the course of time. Frankly, this seems like a rip-off.

I think it is likely they are using very broad outgroups as well for a one-size fits all. But that within an of itself should be taken with great consideration. Personally, I prefer to replicate the methodology of studies that are prioritized for specifically for Italians.
 
. Moreover, it is not historically plausible because Italians were not pure IA Latins/Etruscans until purported medieval Near eastern enrichment. There was also Aegean/Greek, and Central European enrichment that are more than likely to have taken place over the course of time.
Thanks but could you walk me through this statement.

It was 81.6pc (plus or minus 2.0 standard error) for Tarquinia_Monterozzi_IA and 18.4pc (plus or minus 5.2pc standard error) for Lebanon_Medieval.
 
Here I created this script to perfectly align with the methodology of Raveane et al. 2022, which used 600K AADR, I replicated the outgroups, as well as the parameters set to qpWave which is a p-vaule of 0.1, using the ALLSNPS= TRUE setting in qpAdm analysis. The FAM is also included. They didn't mention how to set MAXMISS for the pre-computation, but typically it is set to 1 for aDNA. I also included ALLSNPS=False for comparative results, though it typically fails at least with these specifications.


Code:
# Step 1: Define file paths
prefix <- "D:\\Bioinformatics\\01_Admixtools_Dataset\\V62.0_HO_Plink_Merged_Jovialis\\v62.0_Jovialis_Merged"
my_f2_dir <- "D:\\Bioinformatics\\01_Admixtools_Dataset\\V62.0_HO_Plink_Merged_Jovialis\\my_f2_dir_v62.0_Jovialis_Merged"

# Step 2: Load required libraries
library(admixtools)
library(tidyverse)

# Step 3: Define target and population lists
target <- c('Jovialis')
left <- c( 'Ukraine_EBA_Catacomb.SG', 'Greece_Crete_HgCharalambos_EMBA.AG')
right <- c('Mota', 'AfontovaGora3', 'EHG', 'WHG', 'CHG', 'Iran_N', 'Goyet', 'Vestonice16', 'Malta1', 'ElMiron', 'Levant_N', 'Kostenki14', 'Anatolia_N', 'Natufian', 'Ust_Ishim')

# Step 4: Combine all populations for f2 statistics
mypops <- c(target, left, right)

# Step 5: Generate f2 statistics for qpWave
cat("\nExtracting f2 statistics...\n")
extract_f2(prefix, my_f2_dir, pops = mypops, overwrite = TRUE, maxmiss = 1, blgsize = 0.05)
f2_blocks <- f2_from_precomp(my_f2_dir, pops = mypops, afprod = TRUE)

# Step 6: Run qpWave to validate admixture model complexity
cat("\nRunning qpWave to check rank...\n")
qpwave_results <- qpwave(f2_blocks, left = left, right = right, verbose = TRUE)

# Save qpWave results to file
write.csv(qpwave_results$rankdrop, file = "qpwave_rankdrop_results.csv", row.names = FALSE)

cat("\nqpWave Rank Drop Results:\n")
print(qpwave_results$rankdrop)

# Step 7: Determine whether to proceed with qpAdm
min_p <- min(qpwave_results$rankdrop$p, na.rm = TRUE)

if (!is.na(min_p) && min_p > 0.01) {
    cat("\nqpWave indicates that the source populations form a clade (p > 0.01). This suggests that the sources do not exhibit distinct ancestries.\n")
    cat("\nqpAdm may not provide meaningful results in this case, but we will proceed for comparison.\n")
   
    # Run qpAdm with allsnps = TRUE using genotype files
    results <- qpadm(prefix, left, right, target, allsnps = TRUE)
   
    cat("\nqpAdm Admixture Weights (allsnps = TRUE, may not be meaningful due to clade formation):\n")
    print(results$weights)
   
    cat("\nqpAdm Population Drop Results (allsnps = TRUE):\n")
    print(results$popdrop)
   
    # Save results to file
    write.csv(results$weights, file = "qpadm_weights_clade.csv", row.names = FALSE)
    write.csv(results$popdrop, file = "qpadm_popdrop_clade.csv", row.names = FALSE)
   
} else {
    cat("\nqpWave indicates that the source populations do not form a clade (p ≤ 0.01). Proceeding with qpAdm to estimate admixture proportions.\n")
   
    # Run qpAdm with allsnps = TRUE using genotype files
    results_true <- qpadm(prefix, left, right, target, allsnps = TRUE)
   
    cat("\nqpAdm Admixture Weights (allsnps = TRUE):\n")
    print(results_true$weights)
   
    cat("\nqpAdm Population Drop Results (allsnps = TRUE):\n")
    print(results_true$popdrop)
   
    # Save results to file
    write.csv(results_true$weights, file = "qpadm_weights_allsnps_TRUE.csv", row.names = FALSE)
    write.csv(results_true$popdrop, file = "qpadm_popdrop_allsnps_TRUE.csv", row.names = FALSE)
   
    # Run qpAdm with allsnps = FALSE for comparison
    results_false <- qpadm(f2_blocks, left, right, target, allsnps = FALSE)
   
    cat("\nqpAdm Admixture Weights (allsnps = FALSE):\n")
    print(results_false$weights)
   
    cat("\nqpAdm Population Drop Results (allsnps = FALSE):\n")
    print(results_false$popdrop)
   
    # Save results to file
    write.csv(results_false$weights, file = "qpadm_weights_allsnps_FALSE.csv", row.names = FALSE)
    write.csv(results_false$popdrop, file = "qpadm_popdrop_allsnps_FALSE.csv", row.names = FALSE)
}

cat("\nAnalysis complete. Check the output CSV files for detailed results.\n")
 

Attachments

Last edited:
Thanks but could you walk me through this statement.

It was 81.6pc (plus or minus 2.0 standard error) for Tarquinia_Monterozzi_IA and 18.4pc (plus or minus 5.2pc standard error) for Lebanon_Medieval.
not sure how they determine that, but in the raw output the SE should ideally be about 0.05 or less, with 0.1 being permissible so long as the Z value is robust (above 3.0)
 
not sure how they determine that, but in the raw output the SE should ideally be about 0.05 or less, with 0.1 being permissible so long as the Z value is robust (above 3.0)
So, a meaningful qpAdm breakdown of my genetic background should contain Aegean Greek and/or Gallic or Germanic from central Europe?
 
The SE are a tad high, so this model is not optimal, nevertheless, it is starting the narrow down to historical reality.

The so-called "Langobards_2" are not actually Northern European Langobards, but samples from the Amorim et al. 2018 that overlap with modern Northern Italians/Tuscans.

1741277836347.png

1741278033782.png
 
So, a meaningful qpAdm breakdown of my genetic background should contain Aegean Greek and/or Gallic or Germanic from central Europe?
qpAdm can produce models that are statistically robust, but do not align with historical/pre-historical reality. For example I could model myself as 30% catacomb and 70% Crete_EMBA, that's not reflective of the historical reality.
 
qpAdm can produce models that are statistically robust, but do not align with historical/pre-historical reality. For example I could model myself as 30% catacomb and 70% Crete_EMBA, that's not reflective of the historical reality.
So G25 is a better match to historical reality?
 
So G25 is a better match to historical reality?
No, G25 is basically qpAdm, without any the ability to utilize crucial metrics such as standard errors/Z value, Chisq/dof determine statistical robustness. Thus, while it can sometimes replicate the proportions that are shown in qpAdm, there's no way of knowing if it is actually valid unless you replicate it in qpAdm to observe those metrics. There's no way to know if a model is overfitting, because you cannot see those metrics to make sure it meets the 0.05-1 pval criteria in qpAdm. The reason why it is similar is because it is actually a derivative of SMARTPCA utilizing AADR data. However, even as a PCA, the reason why it looks different from Academic PCAs is due to the fact that it is including aDNA along with modern samples in the projection. When in fact the standard practice in academic PCAs is to have aDNA defer to modern DNA in a projection, to maintain consistency.

 
Last edited:
I use this model as a test to see if it is working correctly; this is using the Raveane et al. 2022 replication script:

1741284331490.png
 
Back
Top