• Don't want to see ads? Install an adblocker like uBlock Origin or use a Europe-based privacy-friendly browser like Vivaldi or Mullvad.

Admixtools admixtools2 TUTORIAL for WINDOWS.

its a p value extremely close to 1 good? seems a bit fishy to me, I got high standard errors tho so maybe thats why
Figure_2.png
 
its a p value extremely close to 1 good? seems a bit fishy to me, I got high standard errors tho so maybe thats whyView attachment 16961
You have to aim for 0.05/5% and below of SE (But sometimes is not possible to get that, still try it or at least as low as possible); Z-score (Between 3 to -3 at least for AdmixTools1) and 0.05 to 1 of p-value.

Ps: Use EEF (Germany EN LBK) instead Turkey Barcin, it improves the model.
 
Last edited:
Z-score (Between 3 to -3 at least for AdmixTools1)
Where did you read that? Z-score should be over 1.5 or 2 to be significant, the more the better, more Z-score = more robustness/certainty.
It estimates how much given ancestry proportions differ from the expected value (https://en.wikipedia.org/wiki/Expected_value).
I think its impossible to get a bad Z if your P-value and SEs are good tho, so not something I pay a lot of attention to.
 
Where did you read that? Z-score should be over 1.5 or 2 to be significant, the more the better, more Z-score = more robustness/certainty.
It estimates how much given ancestry proportions differ from the expected value (https://en.wikipedia.org/wiki/Expected_value).
I think its impossible to get a bad Z if your P-value and SEs are good tho, so not something I pay a lot of attention to.
When I first started using it, that was the advice some users gave me, this page (https://a-genetics.blogspot.com/2021/11/qpadm-101.html?m=1) and the AI told me the same thing and something that has to be taken in consideration.

And yes, it happened to me in AdmixTools1 that if I don't choose the best right hand set the individual Z-scores are high (12 or more) even when the model has a good p-value and SE.

So I try to be in the range 3 to -3 searching pops and usually my p-value and SE get better too.
 
Last edited:
When I first started using it, that was the advice some users gave me, this page (https://a-genetics.blogspot.com/2021/11/qpadm-101.html?m=1) and the AI told me the same thing and something that has to be taken in consideration.

And yes, it happened to me in AdmixTools1 that if I don't choose the best right hand set the individual Z-scores are high (12 or more) even when the model has a good p-value and SE.

So I try to be in the range 3 to -3 searching pops and usually my p-value and SE get better too.
haha you are a bit confused, the Z-score in qpAdm is calculated by dividing the weight between the standard error(calculate it yourself, you will see how it matches), so of course you want the z score to be high! imagine your model gets 0.54 or 54% of Turkey_Barcin and only 0.0180/1.8% of SE, the Z-score would be 30! you probably noticed you get higher z scores in the largest/most certain contributor to the models (logically)
The link you posted talks about Dstats and comparing samples directly to others, to check whether one is underrepresented relative to another
lol I really would like to see those models of yours
 
Hi.
So I installed and got qpAdm running without having too many problems but merging my files with the AADR dataset is proving to be impossible. First I tried merging the files directly in the EIGENSTRAT format(with -mergeit) but didnt get very far, then I read here about converting both the AADR dataset and your personal files in PLINK format to merge them and I followed the @Jovialis tutorial (post #201), which did work but I encountered another problem when converting the merged files back to EIGENSTRAT (as far as I know qpAdm only works with this format except if your are using AT2), it leaves out me out!
This is the output:
View attachment 16915
Any idea why this happens? maybe my original RAW DNA file has poor coverage(MyHeritage, 600k SNPs)
View attachment 16916
^here is the problem I believe
I've had some of the same issues merging in Plink which is why I tend to prefer merging in MergeIt which is generally more robust. However I find the v62.0 of the AADR is much more stable even when merging in Plink. BTW those warnings you've highlighted usually aren't a problem for uses discussed here.
 
I've had some of the same issues merging in Plink which is why I tend to prefer merging in MergeIt which is generally more robust. However I find the v62.0 of the AADR is much more stable even when merging in Plink. BTW those warnings you've highlighted usually aren't a problem for uses discussed here.
At the end I did the merge with both plink and eigensoft, the two work okay.
Where did you get hold of v62.0?
edit:
 
Personally, I don't know how to convert back from plink to Eigenstrat; every time I try I get a massively bloated file. I've asked around and haven't found anyone that knows either. I would love to figure it out, so I could start making PCAs in smartpca in the Eigensoft suite, which is what academic studies use.

How big is the resulting file that is produced? Mine is usually 4x bigger than the original geno file, but the other files are the same size.

However, I haven't encountered the issue of it not recognizing mine.

If you figure this out, or if someone knows, please share.
I've noticed the massive size increase with Eigenstrat files as well. A thing to note is that the AADR downloads are actually in PackedAncestryMap format– not Eigenstrat. Both Eigestrat and PAM formats use the same file suffixes – .geno, .snp and .ind – but the only difference is with the .geno file. PackedAncestryMap format uses a binary-encoded .geno file which is why it is the same size as the binary-encoded PackedPed (aka Plink 1) .bed file. The Eigenstrat .geno is not binary encoded and is about 4x the size.

I've also noticed that when merging my individual dna sample in Eigenstrat format with the AADR in PackedAncestryMap format using MergeIt that it sometimes outputs the merged file as PAM format even if I set the output in the parfile as Eigenstrat. It seems to do this with the HO versions of the dataset.
 
Last edited:
haha you are a bit confused, the Z-score in qpAdm is calculated by dividing the weight between the standard error(calculate it yourself, you will see how it matches), so of course you want the z score to be high! imagine your model gets 0.54 or 54% of Turkey_Barcin and only 0.0180/1.8% of SE, the Z-score would be 30! you probably noticed you get higher z scores in the largest/most certain contributor to the models (logically)
The link you posted talks about Dstats and comparing samples directly to others, to check whether one is underrepresented relative to another
lol I really would like to see those models of yours
We are confusing us here, I mean the dscores, gendstats, worst Z-score with right hand mix, not the Z-scores in weight/SE; yeah, I should have specified in my first post, but when I sent you the link, you knew which Z I meant.

Anyway, I have been searching the Z in weight/SE in the AdmixTools1 file but I haven't seen them, at least I know how to calculate it now lol so thanks for that.
 
Last edited:
A quick guide:

Code:
Here’s a detailed guide to subsetting a dataset and performing PCA using SmartPCA (from ADMIXTOOLS):

### **Step 1: Prepare Your Files**
Ensure you have the following files:
1. **`.geno`** file: Contains genotype data.
2. **`.snp`** file: Contains SNP information (chromosome, position, etc.).
3. **`.ind`** file: Contains individual IDs and population assignments.
4. **`Sample_List.txt`**: List of populations or individuals to include in the subset.

---

### **Step 2: Create the Subset**
To create a subset of your dataset, follow these steps:

#### **2.1. Create a `par` file for conversion**
This file specifies the input and output for subsetting. Example `par` file:

```
genotypename:    /path/to/original.geno
snpname:         /path/to/original.snp
indivname:       /path/to/original.ind
outputformat:    EIGENSTRAT
genotypeoutname: /path/to/subset.geno
snpoutname:      /path/to/subset.snp
indivoutname:    /path/to/subset.ind
poplistname:     /path/to/Sample_List.txt
```

#### **2.2. Run `convertf` to create the subset**
Use the following command to run the conversion:

```bash
convertf -p /path/to/par.file
```

---

### **Step 3: Run PCA with SmartPCA**

#### **3.1. Create the SmartPCA `par` file**
This file contains parameters for PCA analysis. Example `par` file:

```
genotypename:    /path/to/subset.geno
snpname:         /path/to/subset.snp
indivname:       /path/to/subset.ind
evecoutname:     /path/to/projected.evec.txt
evaloutname:     /path/to/projected.eval.txt
lsqproject:      YES
numoutevec:      10
poplistname:     /path/to/Sample_List.txt
```

- **`evecoutname`**: Output eigenvectors.
- **`evaloutname`**: Output eigenvalues.
- **`lsqproject`**: Projects individuals onto principal components.
- **`numoutevec`**: Number of principal components to output.

#### **3.2. Run SmartPCA**
Execute the following command:

```bash
smartpca -p /path/to/smartpca.params.txt > /path/to/smartpca.log
```

---

### **Step 4: Analyze and Visualize the PCA Results in R**

Once SmartPCA completes, use the generated `.evec` and `.eval` files for visualization.

#### **4.1. Load Required Libraries**
```r
library(ggplot2)
library(dplyr)
```

#### **4.2. Load Eigenvalues and Eigenvectors**
```r
# Set working directory
setwd("/path/to/output/files")

# Read the eigenvalues
evals <- scan("projected.eval.txt")

# Read the eigenvectors
evecs <- read.table("projected.evec.txt", header = FALSE, stringsAsFactors = FALSE)

# Extract individual IDs and population labels
individuals <- evecs$V1
populations <- evecs$V12  # Adjust if necessary
```

#### **4.3. Plot PCA in R**
```r
# Extract PCs
pc1 <- evecs$V2
pc2 <- evecs$V3

# Create data frame
pca_data <- data.frame(Individual = individuals, Population = populations, PC1 = pc1, PC2 = pc2)

# Plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = Population)) +
  geom_point(size = 2) +
  labs(
    title = "PCA Projection",
    x = paste0("PC1 (", round(evals[1] / sum(evals) * 100, 2), "% variance)"),
    y = paste0("PC2 (", round(evals[2] / sum(evals) * 100, 2), "% variance)")
  ) +
  theme_minimal()
```

---

### **Summary of Commands**
1. **Subsetting**:
   ```bash
   convertf -p /path/to/par.file
   ```

2. **Run PCA**:
   ```bash
   smartpca -p /path/to/smartpca.params.txt > /path/to/smartpca.log
   ```

3. **Visualize in R**:
   Follow the provided R script to load and plot the results.
 
Last edited:
Sample_List.txt

It is literally a text file that lists the samples you want to retain in the subset:

Examples:


Code:
Jovialis
Armenian.HO
Iranian.HO
Turkish.HO
Albanian.HO
Italian_North.HO
Bulgarian.HO
Cypriot.HO
Greek.HO
Italian_South.HO
Maltese.HO
Sicilian.HO
Italian_Central.HO
English.HO
French.HO
Icelandic.HO
Norwegian.HO
Orcadian.HO
Scottish.HO
BedouinA.HO
BedouinB.HO
Jordanian.HO
Palestinian.HO
Saudi.HO
Syrian.HO
Abkhasian.HO
Adygei.HO
Balkar.HO
Chechen.HO
Georgian.HO
Kumyk.HO
Lezgin.HO
Russia_NorthOssetian.HO
Jew_Ashkenazi.HO
Jew_Georgian.HO
Jew_Iranian.HO
Jew_Iraqi.HO
Jew_Libyan.HO
Jew_Moroccan.HO
Jew_Tunisian.HO
Jew_Turkish.HO
Jew_Yemenite.HO
Basque.HO
Spanish.HO
Spanish_North.HO
Druze.HO
Lebanese.HO
Belarusian.HO
Croatian.HO
Czech.HO
Estonian.HO
Hungarian.HO
Lithuanian.HO
Ukrainian.HO
IBS_CanaryIslands.DG
Sardinian.HO
Finnish.HO
Mordovian.HO
Russian.HO
Greece_Crete_Chania_LBA.AG
Italy_PianSultano_BA.SG
 
Example of smartpca.params.txt

Those are just really long file paths, you could have a single folder if you'd like, containing the files you need.

Code:
genotypename:    /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.geno
snpname:         /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.snp
indivname:       /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.ind
evecoutname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/projected.evec.txt
evaloutname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/projected.eval.txt
lsqproject:      YES
poplistname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/Sample_List.txt
 
Example of an R Script to project the PCA in Rstudio:

Note, this is just a work in progress, I'll probably refine this and make it less clunky. But this is stuff I've experimented thus far. Good thing to remember, is that changing the familyIDs in the IND file isn't as straight forward as the FAM. Thus some of the aDNA I renamed didn't project. Which is why I am not currently presenting one at the moment.


Code:
# Load necessary libraries
library(ggplot2)
library(dplyr)

# Set working directory
setwd("D:/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1")

# Read the eigenvalues
evals <- scan("projected.eval.txt")

# Read the eigenvectors
evecs <- read.table("projected.evec.txt", header = FALSE, stringsAsFactors = FALSE)

# Extract individual IDs and population labels
individuals <- evecs$V1
populations <- evecs$V12  # Adjust if your population labels are in a different column

# Extract the first two principal components and flip both axes
pc1 <- -evecs$V2  # Horizontal flip
pc2 <- -evecs$V3  # Vertical flip

# Create a data frame for plotting
pca_data <- data.frame(Individual = individuals, Population = populations, PC1 = pc1, PC2 = pc2)

# Define populations to filter out
excluded_pops <- c(
  "EEHG", "W_Anatolia_N", "CHG", "ANE", "WHG", "Serbia_IronGates_Mesolithic.AG",
  "Italy_Epigravettian.AG", "Germany_BellBeaker.AG", "Italy_IA_Republic.SG",
  "Italy_Tarquinia_Etruscan.SG", "Italy_IA_Republic_oLatini",
  "Italy_IA_Republic_oPrenestini", "Italy_Broion_BA.SG",
  "Italy_ReginaMargherita_BA.SG", "Italy_PianSultano_BA.SG", "Italy_BA.SG",
  "Italy_Ordona_Daunian.SG", "Italy_Salapia_Daunian.SG", "Italy_SanGiovanni_IA.SG",
  "Italy_Sicily_IA_Sicani.AG", "Italy_Sicily_IA.AG", "Italy_Sicily_EBA.AG",
  "Italy_Sicily_LBA.AG", "Italy_Sicily_Himera_409BCE.AG", "Italy_Sicily_Himera_480BCE.AG",
  "Italy_Sicily_Himera_480BCE_Greek.AG", "Italy_Sicily_Himera_480BCE_Balkan.AG",
  "Italy_Sicily_Himera_480BCE_NEurope.AG", "Italy_Sicily_Himera_480BCE_Sarmatian.AG",
  "Italy_Sicily_Himera_480BCE_Caucasus.AG", "Italy_Sicily_Himera_GenPop.AG",
  "Italy_Sicily_Himera_Classical_oAegean.AG", "Italy_Sicily_Himera_Classical_1.AG",
  "Italy_Sardinia_N.AG", "Italy_Imperial.SG", "Italy_LateAntiquity_oCentralEuropean.SG",
  "Greece_Aidonia_LBA.AG", "Greece_GlykaNera_LBA.AG", "Greece_Crete_HgCharalambos_EMBA.AG",
  "Greece_Crete_Aposelemis_N.AG", "Greece_Koukounaries_LBA.AG",
  "Greece_Lazarides_EBA.AG", "Greece_Lazarides_LBA.AG", "Greece_Crete_Chania_LBA.AG",
  "Greece_Mygdalia_LBA.AG", "Greece_NeaStyra_EBA.AG", "Greece_Tiryns_LBA.AG",
  "Greece_Tiryns_IA.AG", "Greece_PalaceOfNestor_BA.AG", "Greece_Theopetra_EBA.SG",
  "Greece_Perachora_EBA.SG", "Greece_Sarakenos_EBA.SG", "Greece_Koufonisi_Cycladic_EBA.SG",
  "Greece_Logkas_MBA.SG", "Greece_Manika_EBA.SG", "Greece_Crete_Kephala_Petras.SG",
  "Greece_North_N.SG", "Greece_Attica_Kolikrepi_BA.AG", "Greece_Crete_Zakros_BA.AG",
  "Greece_Kastrouli_BA.AG", "Greece_Kastrouli_IA.AG", "Greece_PalaceOfNestor_EIA.AG",
  "Greece_NeaNikomedeia_EN.SG", "Greece_Crete_Krousonas_LBA.SG"
)

# Filter out excluded populations
pca_data <- pca_data %>% filter(!Population %in% excluded_pops)

# Define highlighted populations
highlighted_pops <- c(
  "Greece_Salamis_BA.AG", "Greece_Galatas_BA.AG", "Greece_Peristeria_BA.AG",
  "Greece_Crete_Armenoi_BA.AG", "Greece_Crete_Odigitria_BA.AG",
  "Turkey_Mediterranean_Isparta_BA.AG", "Jovialis",
  "Russia_Samara_EBA_Yamnaya.AG", "Greece_Peloponnese_N.AG",
  "Iran_GanjDareh_N.AG", "Israel_Natufian.AG",
  "Turkey_Marmara_Barcin_Chalcolithic.AG"
)

# Assign colors for highlighted populations and gray for others
pca_data <- pca_data %>%
  mutate(
    Color = ifelse(Population %in% highlighted_pops, Population, "Other")
  )

custom_colors <- c(
  "Greece_Salamis_BA.AG" = "blue", "Greece_Galatas_BA.AG" = "green",
  "Greece_Peristeria_BA.AG" = "purple", "Greece_Crete_Armenoi_BA.AG" = "orange",
  "Greece_Crete_Odigitria_BA.AG" = "cyan",
  "Turkey_Mediterranean_Isparta_BA.AG" = "red",
  "Jovialis" = "magenta",
  "Russia_Samara_EBA_Yamnaya.AG" = "pink",
  "Greece_Peloponnese_N.AG" = "brown",
  "Iran_GanjDareh_N.AG" = "yellow",
  "Israel_Natufian.AG" = "darkgreen",
  "Turkey_Marmara_Barcin_Chalcolithic.AG" = "darkblue",
  "Other" = "gray"
)

# Plot the PCA
ggplot(pca_data, aes(x = PC1, y = PC2, color = Color)) +
  geom_point(size = 2) +
  scale_color_manual(values = custom_colors) +
  labs(
    title = "PCA Projection of Highlighted Populations (Flipped Axes)",
    x = paste0("PC1 (", round(evals[1] / sum(evals) * 100, 2), "% variance)"),
    y = paste0("PC2 (", round(evals[2] / sum(evals) * 100, 2), "% variance)")
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )
 
Seems like it is best to produce PCAs in smartpca with 1240K, because of higher resolution, and is consistent with what aDNA studies use.
 
hey, got it working first try but it didnt add me
edit:
View attachment 17002
Make sure the the ind file's name for your sample is unaltered as it is originally. Otherwise it could me there's not enough snps, some samples like CHG do not project in HO, due to small amount of SNPs. I will endeavor to make a 1240k pca, because that's what is commonly used in studies. Unfortunately their modern set is less comprehensive.
 
Are you sure it isn't projecting? Typically it would not appear on the legend if it didn't. Try to change the symbol to distinguish it from others.

Edit: I see your sample in the Spanish cluster

Great job btw!
 
Sample_List.txt

It is literally a text file that lists the samples you want to retain in the subset:

Examples:


Code:
Jovialis
Armenian.HO
Iranian.HO
Turkish.HO
Albanian.HO
Italian_North.HO
Bulgarian.HO
Cypriot.HO
Greek.HO
Italian_South.HO
Maltese.HO
Sicilian.HO
Italian_Central.HO
English.HO
French.HO
Icelandic.HO
Norwegian.HO
Orcadian.HO
Scottish.HO
BedouinA.HO
BedouinB.HO
Jordanian.HO
Palestinian.HO
Saudi.HO
Syrian.HO
Abkhasian.HO
Adygei.HO
Balkar.HO
Chechen.HO
Georgian.HO
Kumyk.HO
Lezgin.HO
Russia_NorthOssetian.HO
Jew_Ashkenazi.HO
Jew_Georgian.HO
Jew_Iranian.HO
Jew_Iraqi.HO
Jew_Libyan.HO
Jew_Moroccan.HO
Jew_Tunisian.HO
Jew_Turkish.HO
Jew_Yemenite.HO
Basque.HO
Spanish.HO
Spanish_North.HO
Druze.HO
Lebanese.HO
Belarusian.HO
Croatian.HO
Czech.HO
Estonian.HO
Hungarian.HO
Lithuanian.HO
Ukrainian.HO
IBS_CanaryIslands.DG
Sardinian.HO
Finnish.HO
Mordovian.HO
Russian.HO
Greece_Crete_Chania_LBA.AG
Italy_PianSultano_BA.SG
Which pops did you use to set the eigenvectors?
 
Back
Top