Admixtools Using Admixtools2 to model admixture

Tautalus · Jan 19, 2024

In the previous post I said that the trickiest part of all this is the merging of our own data with the Reich's data.
I was wrong, software is easy, the trickiest part is finding the best combinations of populations for qpAdm.
It takes tests and more tests to find them, luckily I had studies to base myself on.
For this analysis I was based mainly on the study The genomic history of the Iberian Peninsula in the last 8000 years by Olalde, 2019.

So far the best values I've found for myself are these:
51% from Portugal EBA (a subset I analyzed separately)
30% from a population with France_Beaker-like ancestry (probably related to the LBA Urnfield culture)
14% from Italy_Imperial (central/eastern Mediterranean ancestry arrived during the Roman Empire)
3% Iberomaurusian

Code:

target = 'Tautalus'
left= c('Portugal_EBA','France_BellBeaker','Italy_Imperial_o5.SG','Morocco_Iberomaurusian_TAF012')
right = c('Mbuti.DG','Ethiopia_4500BP.SG' ,'Czech_Vestonice16', 'Belgium_UP_GoyetQ116_1','France_NouvelleAquitaine_Mesolithic.SG', 'Italy_North_Villabruna_HG', 'Karitiana.DG', 'Papuan.DG', 'Iran_GanjDareh_N', 'Russia_Boisman_MN',  'Czech_CordedWare', 'Netherlands_EIA', 'Turkey_Arslantepe_LateC', 'Israel_C', 'ONG.SG')
results = qpadm(prefix, left, right, target, allsnps = TRUE)
results$weights
results$popdrop

Portugal EBA is made up of 68% of Chalcolithic Portugal and 32% of Germany BellBeaker, which should represent the main ancestry of the Lusitanian peoples. I used the same reference populations.

Who Cares? · Jan 21, 2024

Tautalus said:
The trickiest part of all this is the merging of our own data with the data from the Reich Lab.

There are several ways of doing it, here is one of the fastest and simplest.

It’s the conversion of your raw data in 23andme format to Bed format, then to Geno format, then merging it with the Reich data.

All this instructions assume that all the programs and packages needed for the execution in DOS or in Wsl are already installed.
The names of the files can be whatever you want.

1) Convert raw data in 23andme format to bed file (DOS session)
plink --allow-no-sex --alleleACGT --23file 23andMe.txt --make-bed --out outfile

This will produce 3 essential files, a bed, a bim and a fam file. In the fam file you could replace the -9 for 1.

2) Convert the bed file to geno (Eigenstrat format) (Wsl session)
You need to have a parameter file, with whatever name you want. I name it par.BED2GENO.par, its content are :

genotypename: outfile.bed
snpname: outfile.bim
indivname: outfile.fam
outputformat: EIGENSTRAT
genotypeoutname: outfile.geno
snpoutname: outfile.snp
indivoutname: outfile.ind

After the parameter file is done execute the command : convertf -p par.BED2GENO.par
This will produce 3 files, a geno, a ind and a snp file. In the ind file you can replace “Control” by your own name or alias.

3) Merge your data with the Reich data (In Wsl)
You need to have a parameter file. I name it par.MERGEGENO.par, its content :

geno1: outfile.geno
snp1: outfile.snp
ind1: outfile.ind
geno2: v54.1.p1_1240K_public.geno
snp2: v54.1.p1_1240K_public.snp
ind2: v54.1.p1_1240K_public.ind
genooutfilename: merged.geno
snpoutfilename: merged.snp
indoutfilename: merged.ind
outputformat: EIGENSTRAT
docheck: YES
hashcheck: YES
strandcheck: YES

Then execute the command : mergeit -p par.MERGEGENO.par

Mergeit, according to the documentation, merges two data sets into a third, which has the union of the individuals and the intersection of the SNPs in the first two, which means that the final merged file will only have the SNPs that exist in both files and all the remaining SNPs will be discarded. This will produce a merged file smaller in number of SNPs, not in size, than the original Reich data, with all the info you need to model your admixture.

I compared the test results of this file with the test results of a merged file with all of Reich's data plus my data and they are identical.
This merge is the process that takes the longest, between half hour and an hour depending on the computer.

And that's it, after this process the merged files are ready to be used by qpadm. Now you have two datasets, one is the original Reich data, to test all the populations and the other is your merged files, to test your admixture.

Assuming you used Linux here for step 2, as only tool I could find is this one:

GitHub - argriffing/eigensoft: principal components population genetics analysis on linux

principal components population genetics analysis on linux - argriffing/eigensoft

github.com

and

GitHub - DReichLab/EIG: Eigen tools by Nick Patterson and Alkes Price lab

Eigen tools by Nick Patterson and Alkes Price lab. Contribute to DReichLab/EIG development by creating an account on GitHub.

github.com

Tautalus · Jan 21, 2024

Yes, WSL is Windows Subsystem for Linux.
Just run these commands to install convertf and mergeit :
sudo apt update
sudo apt -y install eigensoft

Jovialis · Jan 21, 2024

Working in Ubuntu WSL (or any terminal for that matter, like Powershell) was extremely alien and esoteric to me. But ever since I've utilized AI to help me with it, I feel excited to use it, and it is comfortable.

Who Cares? · Jan 21, 2024

Tautalus said:
Yes, WSL is Windows Subsystem for Linux.
Just run these commands to install convertf and mergeit :
sudo apt update
sudo apt -y install eigensoft

Tried running both vcf file and 23 and me, but it would not merge properly for me (merged.geno file smaller than v54.1.p1_HO_public.geno)

Ivorix · Jan 22, 2024

Someone did mine too:

59.4 AHG
23.0 EHG
13.4 CHG
3.6 WHG
0.6 Iran_N

------------------------

55.2 EEF
38.0 Yamnaya
6.8 WHG

Tautalus · Jan 22, 2024

I merged two different raw data files, one from 23andMe which ended up being smaller than the Reich file, and another from Ancestry (in 23andMe format) which ended up being bigger, that was the one I was referring to in the post when I said that it got bigger, in size, than the Reich file.
And it's bigger because it has many more SNPs in common with the Reich file, 379784 SNPs. Its the one I normally use with qpAdm.
My 23andMe file only had 137289 SNPs in common with the Reich file, which is why it is smaller in size. So it's normal.
If you have a file from Ancestry its better, if you only have one from 23andMe, then it's just those SNPs that qpAdm will work with when you're modeling your admixture.

Who Cares? · Jan 22, 2024

Ivorix said:
Someone did mine too:

59.4 AHG
23.0 EHG
13.4 CHG
3.6 WHG
0.6 Iran_N

------------------------

55.2 EEF
38.0 Yamnaya
6.8 WHG

I just want to do it so that we can end the stupid debate I had with some Albanian members in another thread.
Somebody there claimed how South Slavs living in areas near Albania (west parts of North Macedonia and southeastern Serbia) killed Balkan population upon their arrival and that South Slavs today don't have significant genetic heritage from those native Balkan people (lets call it like that) living prior to the appearance of Slavs.

Who Cares? · Jan 22, 2024

Tautalus said:
I merged two different raw data files, one from 23andMe which ended up being smaller than the Reich file, and another from Ancestry (in 23andMe format) which ended up being bigger, that was the one I was referring to in the post when I said that it got bigger, in size, than the Reich file.
And it's bigger because it has many more SNPs in common with the Reich file, 379784 SNPs. Its the one I normally use with qpAdm.
My 23andMe file only had 137289 SNPs in common with the Reich file, which is why it is smaller in size. So it's normal.
If you have a file from Ancestry its better, if you only have one from 23andMe, then it's just those SNPs that qpAdm will work with when you're modeling your admixture.

I have 95% coverage VCF file with ~2 million SNPs and about 400 MB in size.

I tried running:
plink --allow-no-sex --alleleACGT --vcf input.vcf.gz --make-bed --out outfile
but it did not work well, so I added --aec flag to allow extended chromosomes, and I also added to go over from 1 to 22 chromosome pairs only. This provided me with working files, but when I ran mergeit I noticed the size is smaller.

I also tried running 23andMe file and it did the same thing when I ran mergeit.

Tautalus · Jan 22, 2024

Who Cares? said:
I have 95% coverage VCF file with ~2 million SNPs and about 400 MB in size.

I tried running:
plink --allow-no-sex --alleleACGT --vcf input.vcf.gz --make-bed --out outfile
but it did not work well, so I added --aec flag to allow extended chromosomes, and I also added to go over from 1 to 22 chromosome pairs only. This provided me with working files, but when I ran mergeit I noticed the size is smaller.

I also tried running 23andMe file and it did the same thing when I ran mergeit.

Open the snp file with a text editor and check how many SNPs it has.
Mergeit only joins the common SNPs in both files.

I haven't worked with VCF files yet, I don't know the structure.
Have you tried converting the VCF to 23andme and working with that file?
You can do this, for example, with the DNA Kit Studio from DNAGenics.

Who Cares? · Jan 22, 2024

Tautalus said:
Open the snp file with a text editor and check how many SNPs it has.
Mergeit only joins the common SNPs in both files.

I haven't worked with VCF files yet, I don't know the structure.
Have you tried converting the VCF to 23andme and working with that file?
You can do this, for example, with the DNA Kit Studio from DNAGenics.

As I said, I tried with 23andme file as well. I can convert BAM file to 23andme format with WGS Extract, and from there I tried using 23andme, but I also had the same problem with merged files being smaller than Reich's file.
I'll deal with it in one of the upcoming days if I manage to find some time and let you know what was the outcome.

Tautalus · Jan 22, 2024

Who Cares? said:
As I said, I tried with 23andme file as well. I can convert BAM file to 23andme format with WGS Extract, and from there I tried using 23andme, but I also had the same problem with merged files being smaller than Reich's file.
I'll deal with it in one of the upcoming days if I manage to find some time and let you know what was the outcome.

Ok, I assumed the VCF and 23andme were two different files from two different companies.

One thing I did to confirm that Mergeit worked well was to import all of Reich's SNPs and those in my files into an Access database and validate the common SNPs between them with SQL queries.

Jalisciense · Oct 31, 2024

Tautalus said:
qpAdm models a target population as a mixture of left (source) populations, given a set of right (reference) populations.
Choosing the reference populations can significantly impact the results of the qpAdm analysis, so it is important to make this selection well. So I made some adjustments to the model according to the following qpAdm assumptions and requirements.

The fundamental assumptions of qpAdm are:

1) “There are no gene flows between lineages unique to candidate source populations (post their divergence from the actual admixing populations) and the reference populations”, that is, no gene flow occurs between source and references populations following the split of the source population from the true lineage that participated in the admixture event.
2) “There are no gene flows from the fully formed target lineage to reference populations.”

It is crucial to select reference populations that are genetically distinct and not directly ancestral to the target population to avoid biasing admixture estimates. Ideally, as said before, the reference populations should have no gene flow connecting them to the private lineages of the candidate source populations (after their divergence from the true admixed populations).
Likewise, there should be no gene flows from the fully formed target lineage to reference populations.

The qpAdm method requires also differential relatedness, that is, it requires that at least one population in the reference set is differentially related to those in the source set, which “means that at least some reference populations must be more closely related to some source populations than to others.”

Code:

target = 'Tautalus' left= c('Germany_EN_LBK_Stuttgart.DG','Russia_Samara_EBA_Yamnaya','Luxembourg_Loschbour.DG') right = c('Mbuti.DG', 'Ethiopia_4500BP.SG', 'Han.DG', 'Papuan.DG', 'Karitiana.DG', 'Georgia_Satsurblia.SG', 'Iran_GanjDareh_N', 'Jordan_PPNB','Russia_Kostenki14.SG','Russia_Ust_Ishim.DG','Armenia_LBA.SG', 'ONG.SG') results = qpadm(prefix, left, right, target, allsnps = TRUE) results$weights results$popdrop

The p-value is better, but it is not a definitive model, there is a lot to learn and improve, it is a work in progress.

Once again similar values to what I get with G25.

What if you use Iberomaurusian or Morocco_EN too (4-way)?

Tautalus · Oct 31, 2024

Jalisciense said:
What if you use Iberomaurusian or Morocco_EN too (4-way)?

Taforalt does not improve the model. The p-value is almost identical. What happens is that it will steal a small percentage of EEF.
Although they are different components and are not exactly assimilable there are links between the two, for example through Dzudzuana.

Jalisciense · Nov 1, 2024

Tautalus said:
Taforalt does not improve the model. The p-value is almost identical. What happens is that it will steal a small percentage of EEF.
Although they are different components and are not exactly assimilable there are links between the two, for example through Dzudzuana.

I see bro, but how much percentage did you get, SE and Z-score? And what were your outgroups/right did you use when you added Iberomaurusian in the Left?

Tautalus · Nov 3, 2024

Jalisciense said:
I see bro, but how much percentage did you get, SE and Z-score? And what were your outgroups/right did you use when you added Iberomaurusian in the Left?

The outgroups are the same as in the previous example.

Jalisciense · Nov 4, 2024

Tautalus said:
The outgroups are the same as in the previous example.

Nice, did you use qpAdm rotation?

Tautalus · Nov 4, 2024

Jalisciense said:
Nice, did you use qpAdm rotation?

If you define qpAdm rotation as testing different models, rotating populations, to find the best fit of sources(left pops) and outgroups/reference(right pops) for the target, then yes I used it, but not iteratively, I tested each model one at a time.

Jalisciense · Nov 4, 2024

Tautalus said:
If you define qpAdm rotation as testing different models, rotating populations, to find the best fit of sources(left pops) and outgroups/reference(right pops) for the target, then yes I used it, but not iteratively, I tested each model one at a time.

Cool bro, but how did you do it? It's a code to do the rotation or how?

Tautalus · Nov 5, 2024

Jalisciense said:
Cool bro, but how did you do it? It's a code to do the rotation or how?

As I told you, I didn't do it iteratively, that is, through a program that would rotate the populations automatically. I chose and changed the populations manually.
But there are functions that allow you to automate this population rotation, like qpAdm_rotation().
You can also test multiple models this way:
https://www.eupedia.com/forum/threads/multi-qpadm-automation-script.45272/

Admixtools Using Admixtools2 to model admixture

Regular Member

Junior Member

Regular Member

Advisor

Junior Member

Regular Member

Regular Member

Junior Member

Junior Member

Regular Member

Junior Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member

Regular Member