Admixtools admixtools2 TUTORIAL for WINDOWS.

baeticvs · Nov 17, 2024

its a p value extremely close to 1 good? seems a bit fishy to me, I got high standard errors tho so maybe thats why

Jalisciense · Nov 17, 2024

baeticvs said:
its a p value extremely close to 1 good? seems a bit fishy to me, I got high standard errors tho so maybe thats whyView attachment 16961

You have to aim for 0.05/5% and below of SE (But sometimes is not possible to get that, still try it or at least as low as possible); Z-score (Between 3 to -3 at least for AdmixTools1) and 0.05 to 1 of p-value.

Ps: Use EEF (Germany EN LBK) instead Turkey Barcin, it improves the model.

baeticvs · Nov 18, 2024

Jalisciense said:
Z-score (Between 3 to -3 at least for AdmixTools1)

Where did you read that? Z-score should be over 1.5 or 2 to be significant, the more the better, more Z-score = more robustness/certainty.
It estimates how much given ancestry proportions differ from the expected value (https://en.wikipedia.org/wiki/Expected_value).
I think its impossible to get a bad Z if your P-value and SEs are good tho, so not something I pay a lot of attention to.

Jalisciense · Nov 18, 2024

baeticvs said:
Where did you read that? Z-score should be over 1.5 or 2 to be significant, the more the better, more Z-score = more robustness/certainty.
It estimates how much given ancestry proportions differ from the expected value (https://en.wikipedia.org/wiki/Expected_value).
I think its impossible to get a bad Z if your P-value and SEs are good tho, so not something I pay a lot of attention to.

When I first started using it, that was the advice some users gave me, this page (https://a-genetics.blogspot.com/2021/11/qpadm-101.html?m=1) and the AI told me the same thing and something that has to be taken in consideration.

And yes, it happened to me in AdmixTools1 that if I don't choose the best right hand set the individual Z-scores are high (12 or more) even when the model has a good p-value and SE.

So I try to be in the range 3 to -3 searching pops and usually my p-value and SE get better too.

baeticvs · Nov 19, 2024

Jalisciense said:
When I first started using it, that was the advice some users gave me, this page (https://a-genetics.blogspot.com/2021/11/qpadm-101.html?m=1) and the AI told me the same thing and something that has to be taken in consideration.

And yes, it happened to me in AdmixTools1 that if I don't choose the best right hand set the individual Z-scores are high (12 or more) even when the model has a good p-value and SE.

So I try to be in the range 3 to -3 searching pops and usually my p-value and SE get better too.

haha you are a bit confused, the Z-score in qpAdm is calculated by dividing the weight between the standard error(calculate it yourself, you will see how it matches), so of course you want the z score to be high! imagine your model gets 0.54 or 54% of Turkey_Barcin and only 0.0180/1.8% of SE, the Z-score would be 30! you probably noticed you get higher z scores in the largest/most certain contributor to the models (logically)
The link you posted talks about Dstats and comparing samples directly to others, to check whether one is underrepresented relative to another
lol I really would like to see those models of yours

Celtion · Nov 19, 2024

baeticvs said:
Hi.
So I installed and got qpAdm running without having too many problems but merging my files with the AADR dataset is proving to be impossible. First I tried merging the files directly in the EIGENSTRAT format(with -mergeit) but didnt get very far, then I read here about converting both the AADR dataset and your personal files in PLINK format to merge them and I followed the @Jovialis tutorial (post #201), which did work but I encountered another problem when converting the merged files back to EIGENSTRAT (as far as I know qpAdm only works with this format except if your are using AT2), it leaves out me out!
This is the output:
View attachment 16915
Any idea why this happens? maybe my original RAW DNA file has poor coverage(MyHeritage, 600k SNPs)
View attachment 16916
^here is the problem I believe

I've had some of the same issues merging in Plink which is why I tend to prefer merging in MergeIt which is generally more robust. However I find the v62.0 of the AADR is much more stable even when merging in Plink. BTW those warnings you've highlighted usually aren't a problem for uses discussed here.

baeticvs · Nov 19, 2024

Celtion said:
I've had some of the same issues merging in Plink which is why I tend to prefer merging in MergeIt which is generally more robust. However I find the v62.0 of the AADR is much more stable even when merging in Plink. BTW those warnings you've highlighted usually aren't a problem for uses discussed here.

At the end I did the merge with both plink and eigensoft, the two work okay.
Where did you get hold of v62.0?
edit:

The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes - David Reich Lab Dataverse

Celtion · Nov 19, 2024

Jovialis said:
Personally, I don't know how to convert back from plink to Eigenstrat; every time I try I get a massively bloated file. I've asked around and haven't found anyone that knows either. I would love to figure it out, so I could start making PCAs in smartpca in the Eigensoft suite, which is what academic studies use.

How big is the resulting file that is produced? Mine is usually 4x bigger than the original geno file, but the other files are the same size.

However, I haven't encountered the issue of it not recognizing mine.

If you figure this out, or if someone knows, please share.

I've noticed the massive size increase with Eigenstrat files as well. A thing to note is that the AADR downloads are actually in PackedAncestryMap format– not Eigenstrat. Both Eigestrat and PAM formats use the same file suffixes – .geno, .snp and .ind – but the only difference is with the .geno file. PackedAncestryMap format uses a binary-encoded .geno file which is why it is the same size as the binary-encoded PackedPed (aka Plink 1) .bed file. The Eigenstrat .geno is not binary encoded and is about 4x the size.

I've also noticed that when merging my individual dna sample in Eigenstrat format with the AADR in PackedAncestryMap format using MergeIt that it sometimes outputs the merged file as PAM format even if I set the output in the parfile as Eigenstrat. It seems to do this with the HO versions of the dataset.

Celtion · Nov 19, 2024

baeticvs said:
At the end I did the merge with both plink and eigensoft, the two work okay.
Where did you get hold of v62.0?View attachment 16988

link to v62.0 is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFIDCW

Jalisciense · Nov 19, 2024

baeticvs said:
haha you are a bit confused, the Z-score in qpAdm is calculated by dividing the weight between the standard error(calculate it yourself, you will see how it matches), so of course you want the z score to be high! imagine your model gets 0.54 or 54% of Turkey_Barcin and only 0.0180/1.8% of SE, the Z-score would be 30! you probably noticed you get higher z scores in the largest/most certain contributor to the models (logically)
The link you posted talks about Dstats and comparing samples directly to others, to check whether one is underrepresented relative to another
lol I really would like to see those models of yours

We are confusing us here, I mean the dscores, gendstats, worst Z-score with right hand mix, not the Z-scores in weight/SE; yeah, I should have specified in my first post, but when I sent you the link, you knew which Z I meant.

Anyway, I have been searching the Z in weight/SE in the AdmixTools1 file but I haven't seen them, at least I know how to calculate it now lol so thanks for that.

Jovialis · Nov 19, 2024

I figured out how to produce PCAs with Smartpca. I will provide a tutorial when I get a chance:

https://twitter.com/x/status/1858618996981198968

Jovialis · Nov 19, 2024

A quick guide:

Code:

Here’s a detailed guide to subsetting a dataset and performing PCA using SmartPCA (from ADMIXTOOLS):

### **Step 1: Prepare Your Files**
Ensure you have the following files:
1. **`.geno`** file: Contains genotype data.
2. **`.snp`** file: Contains SNP information (chromosome, position, etc.).
3. **`.ind`** file: Contains individual IDs and population assignments.
4. **`Sample_List.txt`**: List of populations or individuals to include in the subset.

---

### **Step 2: Create the Subset**
To create a subset of your dataset, follow these steps:

#### **2.1. Create a `par` file for conversion**
This file specifies the input and output for subsetting. Example `par` file:

```
genotypename:    /path/to/original.geno
snpname:         /path/to/original.snp
indivname:       /path/to/original.ind
outputformat:    EIGENSTRAT
genotypeoutname: /path/to/subset.geno
snpoutname:      /path/to/subset.snp
indivoutname:    /path/to/subset.ind
poplistname:     /path/to/Sample_List.txt
```

#### **2.2. Run `convertf` to create the subset**
Use the following command to run the conversion:

```bash
convertf -p /path/to/par.file
```

---

### **Step 3: Run PCA with SmartPCA**

#### **3.1. Create the SmartPCA `par` file**
This file contains parameters for PCA analysis. Example `par` file:

```
genotypename:    /path/to/subset.geno
snpname:         /path/to/subset.snp
indivname:       /path/to/subset.ind
evecoutname:     /path/to/projected.evec.txt
evaloutname:     /path/to/projected.eval.txt
lsqproject:      YES
numoutevec:      10
poplistname:     /path/to/Sample_List.txt
```

- **`evecoutname`**: Output eigenvectors.
- **`evaloutname`**: Output eigenvalues.
- **`lsqproject`**: Projects individuals onto principal components.
- **`numoutevec`**: Number of principal components to output.

#### **3.2. Run SmartPCA**
Execute the following command:

```bash
smartpca -p /path/to/smartpca.params.txt > /path/to/smartpca.log
```

---

### **Step 4: Analyze and Visualize the PCA Results in R**

Once SmartPCA completes, use the generated `.evec` and `.eval` files for visualization.

#### **4.1. Load Required Libraries**
```r
library(ggplot2)
library(dplyr)
```

#### **4.2. Load Eigenvalues and Eigenvectors**
```r
# Set working directory
setwd("/path/to/output/files")

# Read the eigenvalues
evals <- scan("projected.eval.txt")

# Read the eigenvectors
evecs <- read.table("projected.evec.txt", header = FALSE, stringsAsFactors = FALSE)

# Extract individual IDs and population labels
individuals <- evecs$V1
populations <- evecs$V12  # Adjust if necessary
```

#### **4.3. Plot PCA in R**
```r
# Extract PCs
pc1 <- evecs$V2
pc2 <- evecs$V3

# Create data frame
pca_data <- data.frame(Individual = individuals, Population = populations, PC1 = pc1, PC2 = pc2)

# Plot
ggplot(pca_data, aes(x = PC1, y = PC2, color = Population)) +
  geom_point(size = 2) +
  labs(
    title = "PCA Projection",
    x = paste0("PC1 (", round(evals[1] / sum(evals) * 100, 2), "% variance)"),
    y = paste0("PC2 (", round(evals[2] / sum(evals) * 100, 2), "% variance)")
  ) +
  theme_minimal()
```

---

### **Summary of Commands**
1. **Subsetting**:
   ```bash
   convertf -p /path/to/par.file
   ```

2. **Run PCA**:
   ```bash
   smartpca -p /path/to/smartpca.params.txt > /path/to/smartpca.log
   ```

3. **Visualize in R**:
   Follow the provided R script to load and plot the results.

Jovialis · Nov 19, 2024

Sample_List.txt

It is literally a text file that lists the samples you want to retain in the subset:

Examples:

Code:

Jovialis
Armenian.HO
Iranian.HO
Turkish.HO
Albanian.HO
Italian_North.HO
Bulgarian.HO
Cypriot.HO
Greek.HO
Italian_South.HO
Maltese.HO
Sicilian.HO
Italian_Central.HO
English.HO
French.HO
Icelandic.HO
Norwegian.HO
Orcadian.HO
Scottish.HO
BedouinA.HO
BedouinB.HO
Jordanian.HO
Palestinian.HO
Saudi.HO
Syrian.HO
Abkhasian.HO
Adygei.HO
Balkar.HO
Chechen.HO
Georgian.HO
Kumyk.HO
Lezgin.HO
Russia_NorthOssetian.HO
Jew_Ashkenazi.HO
Jew_Georgian.HO
Jew_Iranian.HO
Jew_Iraqi.HO
Jew_Libyan.HO
Jew_Moroccan.HO
Jew_Tunisian.HO
Jew_Turkish.HO
Jew_Yemenite.HO
Basque.HO
Spanish.HO
Spanish_North.HO
Druze.HO
Lebanese.HO
Belarusian.HO
Croatian.HO
Czech.HO
Estonian.HO
Hungarian.HO
Lithuanian.HO
Ukrainian.HO
IBS_CanaryIslands.DG
Sardinian.HO
Finnish.HO
Mordovian.HO
Russian.HO
Greece_Crete_Chania_LBA.AG
Italy_PianSultano_BA.SG

Jovialis · Nov 19, 2024

Example of smartpca.params.txt

Those are just really long file paths, you could have a single folder if you'd like, containing the files you need.

Code:

genotypename:    /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.geno
snpname:         /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.snp
indivname:       /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/subset.ind
evecoutname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/projected.evec.txt
evaloutname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/projected.eval.txt
lsqproject:      YES
poplistname:     /mnt/d/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1/Sample_List.txt

Jovialis · Nov 19, 2024

Example of an R Script to project the PCA in Rstudio:

Note, this is just a work in progress, I'll probably refine this and make it less clunky. But this is stuff I've experimented thus far. Good thing to remember, is that changing the familyIDs in the IND file isn't as straight forward as the FAM. Thus some of the aDNA I renamed didn't project. Which is why I am not currently presenting one at the moment.

Code:

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Set working directory
setwd("D:/Bioinformatics/01_Admixtools_Dataset/V62.0_HO_Eigenstrat_Merged_Jovialis/Modern_aDNA_Subset1")

# Read the eigenvalues
evals <- scan("projected.eval.txt")

# Read the eigenvectors
evecs <- read.table("projected.evec.txt", header = FALSE, stringsAsFactors = FALSE)

# Extract individual IDs and population labels
individuals <- evecs$V1
populations <- evecs$V12  # Adjust if your population labels are in a different column

# Extract the first two principal components and flip both axes
pc1 <- -evecs$V2  # Horizontal flip
pc2 <- -evecs$V3  # Vertical flip

# Create a data frame for plotting
pca_data <- data.frame(Individual = individuals, Population = populations, PC1 = pc1, PC2 = pc2)

# Define populations to filter out
excluded_pops <- c(
  "EEHG", "W_Anatolia_N", "CHG", "ANE", "WHG", "Serbia_IronGates_Mesolithic.AG",
  "Italy_Epigravettian.AG", "Germany_BellBeaker.AG", "Italy_IA_Republic.SG",
  "Italy_Tarquinia_Etruscan.SG", "Italy_IA_Republic_oLatini",
  "Italy_IA_Republic_oPrenestini", "Italy_Broion_BA.SG",
  "Italy_ReginaMargherita_BA.SG", "Italy_PianSultano_BA.SG", "Italy_BA.SG",
  "Italy_Ordona_Daunian.SG", "Italy_Salapia_Daunian.SG", "Italy_SanGiovanni_IA.SG",
  "Italy_Sicily_IA_Sicani.AG", "Italy_Sicily_IA.AG", "Italy_Sicily_EBA.AG",
  "Italy_Sicily_LBA.AG", "Italy_Sicily_Himera_409BCE.AG", "Italy_Sicily_Himera_480BCE.AG",
  "Italy_Sicily_Himera_480BCE_Greek.AG", "Italy_Sicily_Himera_480BCE_Balkan.AG",
  "Italy_Sicily_Himera_480BCE_NEurope.AG", "Italy_Sicily_Himera_480BCE_Sarmatian.AG",
  "Italy_Sicily_Himera_480BCE_Caucasus.AG", "Italy_Sicily_Himera_GenPop.AG",
  "Italy_Sicily_Himera_Classical_oAegean.AG", "Italy_Sicily_Himera_Classical_1.AG",
  "Italy_Sardinia_N.AG", "Italy_Imperial.SG", "Italy_LateAntiquity_oCentralEuropean.SG",
  "Greece_Aidonia_LBA.AG", "Greece_GlykaNera_LBA.AG", "Greece_Crete_HgCharalambos_EMBA.AG",
  "Greece_Crete_Aposelemis_N.AG", "Greece_Koukounaries_LBA.AG",
  "Greece_Lazarides_EBA.AG", "Greece_Lazarides_LBA.AG", "Greece_Crete_Chania_LBA.AG",
  "Greece_Mygdalia_LBA.AG", "Greece_NeaStyra_EBA.AG", "Greece_Tiryns_LBA.AG",
  "Greece_Tiryns_IA.AG", "Greece_PalaceOfNestor_BA.AG", "Greece_Theopetra_EBA.SG",
  "Greece_Perachora_EBA.SG", "Greece_Sarakenos_EBA.SG", "Greece_Koufonisi_Cycladic_EBA.SG",
  "Greece_Logkas_MBA.SG", "Greece_Manika_EBA.SG", "Greece_Crete_Kephala_Petras.SG",
  "Greece_North_N.SG", "Greece_Attica_Kolikrepi_BA.AG", "Greece_Crete_Zakros_BA.AG",
  "Greece_Kastrouli_BA.AG", "Greece_Kastrouli_IA.AG", "Greece_PalaceOfNestor_EIA.AG",
  "Greece_NeaNikomedeia_EN.SG", "Greece_Crete_Krousonas_LBA.SG"
)

# Filter out excluded populations
pca_data <- pca_data %>% filter(!Population %in% excluded_pops)

# Define highlighted populations
highlighted_pops <- c(
  "Greece_Salamis_BA.AG", "Greece_Galatas_BA.AG", "Greece_Peristeria_BA.AG",
  "Greece_Crete_Armenoi_BA.AG", "Greece_Crete_Odigitria_BA.AG",
  "Turkey_Mediterranean_Isparta_BA.AG", "Jovialis",
  "Russia_Samara_EBA_Yamnaya.AG", "Greece_Peloponnese_N.AG",
  "Iran_GanjDareh_N.AG", "Israel_Natufian.AG",
  "Turkey_Marmara_Barcin_Chalcolithic.AG"
)

# Assign colors for highlighted populations and gray for others
pca_data <- pca_data %>%
  mutate(
    Color = ifelse(Population %in% highlighted_pops, Population, "Other")
  )

custom_colors <- c(
  "Greece_Salamis_BA.AG" = "blue", "Greece_Galatas_BA.AG" = "green",
  "Greece_Peristeria_BA.AG" = "purple", "Greece_Crete_Armenoi_BA.AG" = "orange",
  "Greece_Crete_Odigitria_BA.AG" = "cyan",
  "Turkey_Mediterranean_Isparta_BA.AG" = "red",
  "Jovialis" = "magenta",
  "Russia_Samara_EBA_Yamnaya.AG" = "pink",
  "Greece_Peloponnese_N.AG" = "brown",
  "Iran_GanjDareh_N.AG" = "yellow",
  "Israel_Natufian.AG" = "darkgreen",
  "Turkey_Marmara_Barcin_Chalcolithic.AG" = "darkblue",
  "Other" = "gray"
)

# Plot the PCA
ggplot(pca_data, aes(x = PC1, y = PC2, color = Color)) +
  geom_point(size = 2) +
  scale_color_manual(values = custom_colors) +
  labs(
    title = "PCA Projection of Highlighted Populations (Flipped Axes)",
    x = paste0("PC1 (", round(evals[1] / sum(evals) * 100, 2), "% variance)"),
    y = paste0("PC2 (", round(evals[2] / sum(evals) * 100, 2), "% variance)")
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.title = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Jovialis · Nov 20, 2024

Seems like it is best to produce PCAs in smartpca with 1240K, because of higher resolution, and is consistent with what aDNA studies use.

baeticvs · Nov 20, 2024

hey, got it working first try but it didnt add me
edit:

Jovialis · Nov 20, 2024

baeticvs said:
hey, got it working first try but it didnt add me
edit:
View attachment 17002

Make sure the the ind file's name for your sample is unaltered as it is originally. Otherwise it could me there's not enough snps, some samples like CHG do not project in HO, due to small amount of SNPs. I will endeavor to make a 1240k pca, because that's what is commonly used in studies. Unfortunately their modern set is less comprehensive.

Jovialis · Nov 20, 2024

Are you sure it isn't projecting? Typically it would not appear on the legend if it didn't. Try to change the symbol to distinguish it from others.

Edit: I see your sample in the Spanish cluster

Great job btw!

Celtion · Nov 20, 2024

Jovialis said:

Sample_List.txt

It is literally a text file that lists the samples you want to retain in the subset:

Examples:

Code:

Jovialis
Armenian.HO
Iranian.HO
Turkish.HO
Albanian.HO
Italian_North.HO
Bulgarian.HO
Cypriot.HO
Greek.HO
Italian_South.HO
Maltese.HO
Sicilian.HO
Italian_Central.HO
English.HO
French.HO
Icelandic.HO
Norwegian.HO
Orcadian.HO
Scottish.HO
BedouinA.HO
BedouinB.HO
Jordanian.HO
Palestinian.HO
Saudi.HO
Syrian.HO
Abkhasian.HO
Adygei.HO
Balkar.HO
Chechen.HO
Georgian.HO
Kumyk.HO
Lezgin.HO
Russia_NorthOssetian.HO
Jew_Ashkenazi.HO
Jew_Georgian.HO
Jew_Iranian.HO
Jew_Iraqi.HO
Jew_Libyan.HO
Jew_Moroccan.HO
Jew_Tunisian.HO
Jew_Turkish.HO
Jew_Yemenite.HO
Basque.HO
Spanish.HO
Spanish_North.HO
Druze.HO
Lebanese.HO
Belarusian.HO
Croatian.HO
Czech.HO
Estonian.HO
Hungarian.HO
Lithuanian.HO
Ukrainian.HO
IBS_CanaryIslands.DG
Sardinian.HO
Finnish.HO
Mordovian.HO
Russian.HO
Greece_Crete_Chania_LBA.AG
Italy_PianSultano_BA.SG

Which pops did you use to set the eigenvectors?

Admixtools admixtools2 TUTORIAL for WINDOWS.

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Advisor

Advisor

Advisor

Advisor

Advisor

Advisor

Regular Member

Advisor

Advisor

Regular Member