Colloquy re Use of STRs to Predict SNPs

In August 2014, Professor Anatole Klyosov and Jeff Wexler engaged in the following colloquy concerning the use of a limited number of STRs to predict the SNPs of R1a1a Ashkenazi Levites.

* * *

From Professor Anatole Klyosov to Meir G. Gover re Gover's chart using DYS650, DYS537, and DYS495 to predict whether R1a1a Ashkenazi Levites are Y2630+ or Y2630- and YP264/YP265+ or YP264/YP265-:

This is a good table / data. I am not familiar with YP265, but it seems that it is synonymous with YP264, with it being quite recent, as I remember (around 550 +/- 100 years before present).

As you know, there are quite a few "undertyped" individuals, particularly marked as M198, M17, M417, SRY, etc. They do not have their "terminal SNP" determined; therefore, they cannot be compared with other "relatives." Other individuals, with "deep SNPs," can be easily grouped into several "families."

I do not quite understand the significance of picking just three DYS, namely 537, 650 and 495. Why those three? They all have rather "fast" ("medium" would be a more appropriate term) mutation rates, 0.00080, 0.00580, and 0.00120 mutations per marker per 25 years (a conditional generation), respectively. The “slowest” is DYS472 (mutation rate 0.00001); a "slow" STR is DYS426 (0.00009). In other words, it is no wonder that the picked DYS have mutated among R1a1a Ashkenazi Levites in ranges such as 11-12-13 (DYS537), 17-18-19-20 (DYS650), and 16-17-18 (DYS495). Please notice that the fastest marker, DYS650, mutates over the four-step range (17-18-19-20), while the two slower markers (DYS537 and DYS495) mutate only over the three-step range (11-12-13 and 16-17-18). Again, what is the point of choosing those three markers? Why not present all of the 111-marker haplotypes?

Anyway, the table is good and useful for "grouping" listed individuals in terms of their "clans."

* * *

From Jeff Wexler to Professor Klyosov:

Meir Gover forwarded me your e-mail and suggested that I respond to you since I was the one who identified DYS650, DYS537, and DYS495 as being of particular significance in clustering R1a1a Ashkenazi Levites according to their patterns of STRs (and, as it turns out, according to their SNPs).

You are correct that YP265 is at the same level as YP264; based upon additional testing, those SNPs are now believed to date back to about 615 to 705 years ago, a slight increase in the time to an MRCA. (YSEQ decided to offer a la carte testing of YP265 rather than YP264 because of the difficulty in distinguishing between the X allele and the Y allele at the site of YP264; FTDNA prefers YP264, which it apparently can test accurately.)

I initially focused on these STRs because of my observations of patterns of marker values among R1a1a Ashkenazi Levites. (I've now compiled the STR marker values of nearly 90 R1a1a Ashkenazi Levites who have tested to 111 markers and more than 200 R1a1a Ashkenazi Levites who have tested to 67 markers.)

I saw that men with DYS650=20 and DYS537=12 generally had few deviations from the R1a1a Ashkenazi Levite modal haplotype, while men with DYS650<20 and DYS537=11 were generally far from the modal haplotype. I hypothesized that this split in marker values characterize an early divide among R1a1a Ashkenazi Levites. (DYS537 seems to back-mutate with some frequency -- indeed, one of Meir's sons has a back mutation on DYS537 -- so I think that DYS537 is less reliable than DYS650.)

I also saw that DYS495=16 seemed to be characteristic of known descendants of the Horowitz branch.

As SNP results came in, they supported the hypothesis that these marker values are very useful in predicting the SNPs of R1a1a Ashkenazi Levites. The second chart on this page of my website evidences this correlation: [link omitted]

I think that the other STRs shown on that page (and some other SNPs not listed on that page because of formatting problems) will prove significant in identifying branches among Y2619+ Y2630- men. My initial efforts at clustering those men are posted here: [link omitted]

I think that we'll see three significant lines of R1a1a Ashkenazi Levites, sharing a common ancestor perhaps 1,350 to 1,600 years ago: (1) the cluster of Y2619+ Y2630+ men who share an MRCA about 1,050 to 1,240 years ago; and (2) two distinct lines of Y2619+ Y2630- men, identified as tentative Clusters A and B on my website. (Those SNP-based dates are consistent with your STR-based dates; because the SNP-based dates are based on a very small sample of full Y-DNA SNP testing, they're still quite tentative.)

Thanks for your time and your comments.

* * *

From Professor Klyosov to Jeff Wexler:

Thank you for your explanations. Let me consider the most important (as it seems to me) of them in the context of the study.

First, what is it your study aims at? As I see it, the primary goal is to separate the R1a Jewish male population into five principal groups:

(1) Those who have the terminal SNP M582 or/and F1345 (synonyms are Z2472, FGC99, and F2997), and who are direct descendants of a common ancestor who lived around 4000 years before present. That common ancestor descended along the Y-DNA lineage Z93 --> L342.2 --> Z2124 --> Z2122 --> M582

(2) Those who have the terminal SNP CTS6, downstream of M582 and/or F1345, and who are direct descendants of a common ancestor who lived around 2500 years before present.

(3) Those who have the terminal SNP Y2619, downstream of CTS6, and who are direct descendants of a common ancestor who lived around 1500 years before present.

(4) Those who have the terminal SNP Y2630, downstream of Y2619, and who are direct descendants of a common ancestor who lived around 1200 years before present.

(5) Those who have the terminal SNP YP264 (a synonym is YP265), downstream of YP264, and who are direct descendants of a common ancestor who lived around 600 years before present.

The SNPs and the TMRCAs will be refined in the future (or maybe even now); however, it would not apparently change the principal approach.

Before moving to your data and (tentative) conclusions, let me make some comments on "vocabulary," or terminology, if you wish, in order to avoid confusions. You may disagree with some (or all) of them; it is your right. However, in that case we unavoidably enter into an area of confusions, which haunt population genetics for many years.

First, "modal haplotype" is a poorly defined term. It is an obsolete one. It is confusing. The term "modal haplotype," as it is used (and abused), commonly is one which is the most frequent in a given series of haplotypes. It is a purely empirical, "visual" haplotype. We see it; we record it. If a haplotype set is "bimodal," having two (or more) common ancestors, which happens very often, the "modal" is a prevailing haplotype. A "smaller modal" is ignored. In other words, if one does not separate the haplotype dataset into branches (SNPs, haplotype branches, separate TMRCAs, etc.), the "modal" does not carry much sense except it is a prevailing one for some reason. New York City certainly has its own "modal" haplotype. Does it make any scientific sense for any meaningful analysis? No, it does not. So, I stick to a term "base haplotype," not to be confused with "modal." The base haplotype is a deduced ancestral haplotype, having a certain TMRCA attached to it. Two branches with two different TMRCAs cannot have one base haplotypes, but they can have one modal haplotype. Do you see a difference?

Second, I do not base conclusions on some selected (assorted) single loci, such as DYS537, DYS650, or DYS495, as you have suggested, unless their combination (sic!) has a certain historical, ethnical, populational, etc. meaning. In that case this combination has a sense of a "signature," and it is useful for (tentative) considerations.

For example, in R1a Jewish haplotypes, it is 12-12-15-15 (in DYS464), or, even better, 14 -- 12-12-15-15 (along with DYS458). With these four or five alleles, Jewish R1a are easily recognizable; they are unique in the R1a "universe" (and in the haplogroup universe in general). Those are not an absolute criteria, though; I personally prefer them as a "signature." On the other hand, to pick use three (or several) assorted markers (not making a signature) is not a good way. Each of them mutates with a certain rate. So it is no wonder that even the slowest (among the three), DYS537, mutated in one of Meir's sons. It does not make the marker "less reliable than DYS650"; in fact, DYS650 is the most reliable one among the three chosen.

Let's take a look at those rates. DYS537 is the "slowest" one from those three; its mutation rate constant equals 0.00080 per marker per conditional generation (of 25 years). It means, in practical terms, that - on average - eight boys (Jewish or not) per every 10,000 births will have their DYS537 mutated. As you see, it is not a unique event. It means, under a different angle, that after 1200 years (since the common ancestor of most of the Jewish R1a individuals) 96 "Ashkenazi Levites" retain that ancestral DYS537=12, but four of them have 11 or 13 (3.5% of them will have 1 mutation, and 0.5% of them will have 2 mutations). There are formulae which are convenient for calculating those percentages.

DYS495 is 50% faster, its mutation rate is 0.00120 per marker per conditional generation. After 1200 years 95% of the descendants retain the ancestral DYS495=17, and 5% have 16 or 18.

DYS650 is the fastest marker of the three, with its mutation rate constant of 0.00580. Only 76% of descendants of their common ancestor who lived 1200 years ago retain its ancestral allele, 20% will have 1 mutation, and another 4% will have 2 or 3 mutations.

Furthermore, the same mutations will be observed in non-Jewish haplotypes; the mutation rate is common across haplotypes and haplogroups. So, there is no "split" between those haplotypes among the Jewish R1a; there is rather a "cloud" of those mutations, not a "split" between them. So, I do not see those markers as "very useful in predicting the SNPs of R1a Ashkenazi Levites." There will always be some exceptions due to said mutations.

Finally, I understand that "Ashkenazi Levites" is a kind of slang; there were no "Ashkenazi Levites" 4000 years ago or 2500 years ago, but there were R1a bearers from whom the present day R1a Jews descended. Not all of them are Levites or could be even called "Levites." So, scientifically it is not a proper term. However, as slang, it might be kind of handy.

If I obtain those 90 of 111-marker haplotypes and 200 of the 67 marker haplotypes, I can compose a haplotype tree (actually, two trees), which might be able to pick certain branches, either SNP-branches or branches within those SNPs.

Having said that, I have to emphasize that despite my (sometimes) critical comments, your work is very useful and advanced. Thank you for sharing the data.

* * *

From Jeff Wexler to Professor Klyosov:

Thanks for your e-mail.

I have two principal interests: (1) finding men who are related to the Y2619+ cluster of R1a1a Ashkenazi Levites (primarily for reasons of identifying historical migratory patterns and determining the likely origins of Y2619+ men subsequent to the Aryan migrations that you discuss); and (2) identifying branches and sub-branches of men within the Y2619+ cluster (largely for genealogical reasons).

So far, our SNP work is distilled on the following tree (you've commented on an earlier version of the tree): https://sites.google.com/site/levitedna/y-dna-analysis/snp-tree-for-r1a1a-ashkenazi-levites

We know of several men who are F1345+ CTS6-. The information that we have about those men is posted here: https://sites.google.com/site/levitedna/origins-of-r1a1a-ashkenazi-levites/y-dna-relationship-between-r1a1a-ashkenazi-levites-and-their-closest-matches

Based upon the very limited SNP results that we have to date, we know of three lines that share the 22 SNPs that we currently have at the Y2619 level, below the CTS6 level (and none of the SNPs that are downstream from the Y2619 level). That break in SNPs appears to correspond to the bottleneck in the R1a1a Ashkenazi Levite population that you've written about based upon STRs.

I'm hoping that we'll find that the population isn't completely bottlenecked at that point, and that we'll be able to identify lines that share some - but not all - of the SNPs that are currently at the Y2619 level. (It would be especially useful if we were to find a line that pre-dates the bottleneck that has a Levite tradition; the distribution of M582 reported in the Rootsi & Behar paper suggests that such lines may exist, perhaps in areas that are underrepresented in the FTDNA database).

Below the Y2619 level, I'm hoping to identify SNPs that define branches at about the level of Y2630, and sub-branches at about the level of YP264. I think that about 60% of the R1a1a Ashkenazi Levite population is Y2619+ Y2630+, with all (or almost all) of the remaining R1a1a Ashkenazi Levites being Y2619+ Y2630-.

We have full Y-DNA SNPs for two lines of Y2619+ Y2630- men - the Ashkenazi Levite from the Rootsi & Behar paper (GS20424) and me (241703). If we can get test results for a second man on each of those lines, we'll have terminal SNPs for those lines.

I've been using the term "modal haplotype" (or, more commonly, the R1a1a Ashkenazi Levite mode) to refer to the most common marker values on each STR among R1a1a Ashkenazi Levites. The "modal haplotype" is characteristic of the Y2630+ cluster of men. I think that the base haplotype for Y2630* men is different than the "modal haplotype" on DYS537 and DYS650 because the two lines of Y2630- men with distinctive SNPs have DYS537=11 and DYS650<20, while the Y2630+ line has DYS537=12 and DYS650=20.

The Y2619+ cluster is so tight and well-defined that it seems highly unlikely that the cluster is bimodal.

I've been using the term "mode" to refer to the prevailing haplotype so frequently (and on so many website pages) that it would be difficult to change that term everywhere. I'll try to phase in use of the term prevailing haplotype as I move forward. (Because of the manner in which FTDNA projects report marker values and my past analyses, it's too late in the day to start reporting marker values in terms of the hypothetical base haplotype instead of the prevailing haplotype.)

In looking for potential R1a1a Ashkenazi Levites, I start by looking for DYS464a-d = 12-12-15-15, along with a few other marker values. Unless I've confirmed that a man is an R1a1a Ashkenazi Levite, I won't include him in my clustering analysis.

Presumably because R1a1a Ashkenazi Levites are so tightly clustered within the past 1,500 years, I haven't been able to identify any slow-mutating markers that seem to characterize branches of R1a1a Ashkenazi Levites.

I've spent a lot of time looking at STR mutation rates among R1a1a Ashkenazi Levites, in relation to the general mutation rates for those STRs. [link omitted] (I haven't updated this page in eight months, so the percentages have changed somewhat.)

Because DYS650=20 and DYS537=12 seem to be highly correlated with Y2630+, it's my hypothesis that the Y2630+ progenitor's line had mutations from DYS650=19 and DYS537=11 to DYS650=20 and DYS537=12.

I recognize, however, that this division is imprecise and will sometimes be inaccurate -- for example, there is at least one distinct Y2630- line that has DYS537=12 -- and that it is far better to rely upon shared patterns in marker values.

DYS650 does mutate frequently; there is a substantial cluster of men with DYS650=17, meaning, I think, that the marker mutated downward twice in a relatively brief period of time. A marker that mutates in one direction that frequently has an equal likelihood, of course, of mutating in the other directions.

Where I have values for both DYS650 and DYS537 and those values coincide, I'm very confident in my prediction whether men are Y2630+ or Y2630-. Where there is a discrepancy between a man's values for DYS650 and DYS537 or a man has not tested DYS650, I'm far less confident because, as you note, those markers can (and sometimes do) mutate or back-mutate.

On the website, to explain my SNP predictions I've posted marker values for 10 STRs that seem to me to be the most significant in identifying SNP clusters because (1) there are substantial numbers of R1a1a Ashkenazi Levites who have more than one value on those markers but (2) the markers don't seem to mutate too frequently to be of use in identifying shared ancestors going back 500 to 1,500 years.

I currently have no historical, ethnical, or populational basis for splitting the STRs in this manner; R1a1a Ashkenazi Levites were spread throughout the parts of Europe where Ashkenazi Jews lived, with no discernible distinctions between R1a1a Ashkenazi Levites and non-R1a1a Ashkenazi Levites.

As you note, R1a1a Ashkenazi Levite is shorthand for the men whom I'm researching -- slang that was adopted well before I started my work. There are a substantial number of R1a1a Ashkenazi men (including men with a Levite tradition) who don't fall within the cluster. [link omitted]

For that matter, I think that there are quite a few men in this cluster who are not Ashkenazi; if the Horowitz rabbinical family's progenitor had moved to Horovice, near Prague, in the 1490s rather than the 1470s, we might think of that family as Sephardic.

The term R-Y2619 Levite would be a better designation if we were writing on a blank slate. (My assumption is that the R-Y2619 men are descended from men who at one time had a Levite tradition, even if some such men no longer have that tradition; we know that R-Y2619 men of broadly varying marker values have a Levite tradition, so I think that it's safe to conclude that the MRCA who lived 1,500 years ago was a Levite.)

Thanks for the offer of generating haplotype trees from my collected marker values. I'll compile 111-marker and 67-marker sets for you.

Would it be useful for me to also provide separate sets of marker values for the men hypothesized to be Y2630+ and those hypothesized to be Y2630- (or for the hypothesized subclusters of Y2630- men)?

Thanks for your kind words, and for all of your time and help.

* * *

From Professor Klyosov to Jeff Wexler:

I agree, of course, with your reasoning, including those that it is too late to change some terms (when you stay in the established area). That is why I (in fact) created DNA genealogy in 2007-2008, unlike “population genetics”, in order to be free with more correct (as I see it) terminology and new ways of TMRCA calculations, etc. In DNA genealogy there is nothing “prevailing”, there are mutation rate constants instead, base haplotypes (which are essentially ancestral haplotypes), and bimodal, etc. datasets are easily and reliably identified. I am not restrained with “common knowledge”, “accepted terms”, FTDNA rules, “modal haplotypes”, “population rate constants”, etc.

Your interest and goals regarding “R1a Ashkenazi Levites” are understandable. They are all well defined.

A few comments:

> I think that about 60% of the R1a1a Ashkenazi Levite population is Y2619+ Y2630+, with all (or almost all) of the remaining R1a1a Ashkenazi Levites being Y2619+ Y2630-.

I think it will be noticeable on the 111-marker (and probably on the 67 marker) haplotype trees.

>The Y2619+ cluster is so tight and well-defined that it seems highly unlikely that the cluster is bimodal.

Of course, there are various cases. In some cases bi-modality is overlooked, and “modal” does not have a clear sense. Even if some haplotypes is prevailing. This error was made with the “CMH” (Cohen Modal Haplotype) which was at least bi-modal (in fact, multi-modal). In some cases a dataset is mono-modal. This makes the area a mess, in terms of “modal haplotypes.” This is what my comment was about. However, people are free to make a choice whether to follow science or a “common opinion” and “established practice.” It is fine with me.

>…I think, that the marker mutated downward twice in a relatively brief period of time.

It happens, of course, such as when you toss a coin, a head or tail can happen twice in a row.

>Would it be useful for me to also provide separate sets of marker values for the men hypothesized to be Y2630+ and those hypothesized to be Y2630- (or for the hypothesized subclusters of Y2630- men)?

Not really. To hypothesize in this context is not very productive, as I see it. They are two more productive ways – to test for actual SNPs, and to analyze a haplotype tree. The latter will show distinct branches, if any.

* * *

After Professor Klyosov prepared his haplotype trees and analysis, posted here, Jeff Wexler sent Professor Klyosov the following e-mail:

* * *

I'm glad to see that your analysis shows DYS650, DYS537, and DYS459b as the markers where the older branch's allele was different than the prevailing allele. As you know, I've been focusing on DYS650 and DYS537 as short-hand for identifying men who are likely to be Y2630+ and those who are likely to be Y2630-. Your analysis suggests that what I've been tracking are ancestral marker values for the Y2619* and the Y2630+ progenitors, which makes sense.

* * *

Professor Klyosov responded:

* * *

As I have explained, the three alleles you have mentioned were those which have mutated between a common ancestor of 900 ybp to a principal descendant of 575 ybp, however, it does not make them "diagnostic" ones. Each of those three shows a "cloud," more or less tight. But the borderline is still fuzzy. Nevertheless, I should praise you for the right identification of the principal alleles differentiating those SNPs.

Professor Anatole Klyosov's e-mails are presented here with his permission.