This is a guest post by Graham Coop and Peter Ralph, cross-posted from the Coop Lab website.
We’ve been addressing some of the FAQs on topics arising from our paper on the geography of recent genetic genealogy in Europe (PLOS Biology). We wanted to write one on shared genetic material in personal genomics data but it got a little long, and so we are posting it as its own blog post.
Personal genomics companies that type SNPs genome-wide can identify blocks of shared genetic material between people in their databases, offering the chance to identify distant relatives. Finding a connection to someone else who is an unknown relative is exciting, whether you do this through your family tree or through personal genomics (we’ve both pored over our 23&me results a bunch). However, given the fact that nearly everyone in Europe is related to nearly everyone else over the past 1000 years (see our recent paper and FAQs), and likely everyone in the world is related over the past ~3000 years, how should you interpret that genetic connection?
The answer to that question is obviously highly personal, and specific to the relationship identified. For example, Peter and Graham are likely to be related a few tens of generations back, but our connection to our siblings is obviously much closer. (Also shared genetic inheritance is only one aspect of what it means to be family, e.g. step parents are part of a family.)
Our paper offers some preliminary answers to questions concerning the observation of distant connections found by personal genomics companies. A lot of theses ideas that we’ll touch on in this post are explained more thoroughly here. The short answer is that we think that these single shared blocks (especially the short ones) are from much older shared relatives than you would think, and that they often aren’t a particularly meaningful connection in a genealogical sense.
The difficulty is that, the further we go back the less sharing of genetic material due to recent ancestry there is. Individuals with who share many long blocks (if those blocks are correctly identified) are likely close relatives. However, individuals who share a specific ancestor more than eight generations back are unlikely to share even a single chunk of genetic material due to that particular connection (Donnelly 1983, see also the discussion around Figure 1 in Huff et al, and Luke Jostins post on this). That said, you have many 8th cousins, so you will share a block with quite a few of these cousins. Conditionally on sharing a block of material, from that far back, this block is often quite long, highly variable in length, but frequently identifiable by using SNP chips. So a more concrete question is, if you and I share a single block of a given length (say ~10cM) what is it possible to say about our relationship?
We tackle this question in the discussion of our paper. The first difficulty is that the length of the block due to a given relationship is highly variable. The other problem is that while you have many close relatives, you have a huge number of more distant relatives ( explained here). This acts to seriously distort our intuition of when a block of a given length would have come from. This is further complicated as the number of distant relatives (e.g. 10th cousins) you have depends strongly on the demography of all of the myriad populations that contributed to your ancestry. For example, if your ancestry comes from a set of populations that have grown very rapidly, like many populations around the world have over the past few thousand years, you will have much fewer close relatives than if you come from a small population that was constant in size. For example in these two figures [1,2] we show theoretical age distribution of blocks of three different lengths, for two different demographic scenarios (a constant population and an exponentially growing population respectively). This means that we can’t make a statement like “10cM blocks are from 20-30 generations ago” that will hold for everyone.
Consider that hypothetical block of length 10cM shared between 2 people. Since the mean length of a shared IBD block inherited from five generations ago is 10 cM, we might expect the age of the corresponding common ancestor to be from around five generations ago (10 meioses, since 10cM is 1/10th of a typical chromosome). However, a direct calculation using our inferred demographic histories says that the typical age of a 10 cM block shared by two individuals from the United Kingdom is between 32 and 52 generations (depending on the inferred distribution used). This giant discrepancy results from the fact that you are a priori much more likely to share a common genetic ancestor further in the past, and this acts to skew our answers away from the naive expectation—even though it is unlikely that a 10 cM block is inherited from a particular shared ancestor from 40 generations ago, there are a great number of such older shared ancestors. As discussed above, our estimated does depend drastically on the populations’ shared histories: for instance, the age of such a block shared by someone from the United Kingdom with someone from Italy is even older, usually from around 60 generations ago.
A corollary of this is that if we were seeing 10cM blocks from only 5 generations ago, we must be sampling from a really tiny population, since that would mean a large chance that random people were related through ancestors 5 generations ago (fourth cousins).
Numbers like the 32-52 generations above must be taken with a grain of salt, as they are highly dependent on the demographic history. However, it does imply that blocks of these lengths are likely coming from deeper in time than the time when all Europeans share all of their common ancestors. Therefore, a single example of a block of around this length is not a particularly meaningful statement about genealogical relationship between two people, as these people share all of their ancestors that far back.
This conclusion may not apply to ancestors from the past very few (perhaps less than eight) generations, from whom we expect to inherit multiple long blocks—in this case, we can hope to infer a specific genealogical relationship with reasonable certainty (e.g., Huff et al., Henn et al), although even then care must be taken to exclude the possibility that these multiple blocks have not been inherited from distinct common ancestors (and this will also vary across countries). It is not totally obvious to us how/whether this is currently being done in relative finding software that personal genomics companies use. What is really needed is some guidelines and tests, informed by data from Europe and elsewhere, of how long a single shared block has to be to indicate a more meaningful relationship. These efforts have begun in some populations (Henn et al Gusev et al, Kong et al) but we likely need more of it.
What is potentially informative about these single shared blocks is the geographic pattern of who you share these blocks with. For example, If you have many shared blocks with people from Norway in a company’s database, this would suggest that some of your recent ancestors lived in Norway (although we need to know how many Norwegian people there are in the database to truly understand this result).This is the kind of information that some of these companies use to work out where your genomic ancestry derives from. However, we think that we are still a long way from understanding these tools thoroughly, and that these tools should be treated as only one (likely imperfect) aspect of family history research. For a more general discussion of how personal genomics can inform our views of family history see Sense about Science, which takes a (rightly) skeptical view of some of the more dubious claims (especially those made by companies that only test Y/mtDNA markers).
We note that even if sharing a single long block doesn’t imply a particularly close genealogical relationship, it can imply a stronger genetic relationship than usual. Both are significant, in different ways.
Peter Ralph and Graham Coop
The progress on this subject will come quite soon with analysis of full genomic sequencing increasingly available because of the advent NGS.
1000 USD price tag per individual genome is not rare and even claims of 100 USD by GnuBiolabs have appeared!
Clarification: 100 USD per sample genome sequencing claim is from GnuBIO Inc. in Cambridge, Mass. on 2nd of April press report.
Emanuel: Note that NGS alone will not offer a huge improvement of identifying distant relatives (say <10 generations ago), as the the limiting factor here is that the long shared identical blocks are highly variable in length. Having full sequence won't side step that problem, we will still have large uncertainty about when these shared blocks come from.
If you are considering blocks of say 10 Mb or larger being tens of generations old, then WGS could help provide a rough indication of their age, at least in an aggregate/average sense. Given a germline mutation rate of about 1E-8/site/generation (http://www.sciencemag.org/content/328/5978/636.short), this would suggest an average of 1 mutation in the shared segment for every 5 generations back (i.e. 5 generations on each side for a total of 10 generations) for a 10 Mb block. If aggregated over a sufficient number of shared segments to account for mutation rate variability (http://www.nature.com/ng/journal/v43/n7/full/ng.862.html), stochastic effects, etc., this could be an interesting empirical test of your hypotheses.
I have some empirical matches, i.e. confirmed 5th cousins, 3rd cousin twice removed, sixth cousins and I have the amount of cm shared among the relative matches I have found in various databases. The cm shared has largely matched those predictions by the companies that perform these tests.
Hi Greg,
Yes we’ve thought a bunch about this too, and talked to some folks . However, sequencing errors and polymorphism make this challenging.
Given the rate of sequencing errors a priori we expect most mutations in between IBD haplotypes (from say the past 30 generations) to be errors. We’d need to be very confident of the sequencing to call things as de novo mutations. Obviously people are now getting to that stage, as they can reasonably reliably call events in the past generation.
The polymorphism aspect is potentially more problematic. The IBD we see between individuals, is for one of each individual’s haplotypes only. That means that our pair of individual also each have haplotypes that aren’t IBD [in the past 10s of generations] to the other individual. That means that our pair of individuals will have 2 kinds of polymorphism between them in a region where we see IBD. The first is due to denovo mutations that have arisen in the past tens of generations, the second are much older alleles segregating between their non-IBD haplotypes. If we don’t have phase for polymorphisms, we can’t distinguish the two. There’s ways around this even without phase information, but they make the inference problem more subtle.
We are definitely keen to pursue this issue more, as we agree that it would be a nice confirmation.
Graham
Hi Justin,
It is fun to see the genetic predictions and family tree to line up. We are not saying that those results are wrong. Conditional on sharing a known, recent relationship, you will indeed share [on average] roughly the predicted amount of genetic material with these individuals. So the companies will on average get roughly the right predicted relationship.
Our point is a slightly more subtle. You are likely to share blocks >=10cM due to these recent relationships, and unlikely to share such long blocks due to a particular very distant relative (say >10 generations in the past). The problem is that a typical individual will have many more of distant relatives, than close relatives, in these companies databases. So the random individuals you share blocks >10cM, we think, are much more likely to be due to these distant relationships, than close relationships.
Graham
Hello Graham,
Although this is not my area of study, some friends of mine have done some work on this Leon Kull: http://snpology.com/ (now a member of Full Genomes) and Jim McMillian (MS, Physics, UC Berkeley) as well a particularly gifted person in mathematics. I believe they might be interested in some brief follow-up.
Correction: Jim McMillan http://www.isogg.org/wiki/23andMe_projects
My apologies for the typo.