Nowhere Near Ithaca: Modeling our Most Recent Common Ancestor - Very First Steps in Understanding Rohde et al.'s 2004 Population Model

This post marks some of my first steps towards understanding population models used in Rohde et al's 2004 work regarding our Most Recent Common Ancestor (MRCA).

This is a work-in-progress. Please let me know if I have misinterpreted something in their work.

Rohde et al's 2004 Paper on Our Most Recent Common Ancestor

In 2004, Douglas Rohde, Steve Olson, and Joe Chang had an interesting paper appear in Letters to Nature, titled "Modelling the recent common ancestry of all living humans". The paper summarized their analyses related to how long ago our Most Recent Common Ancestor (MRCA) might have lived, and how long ago the Identical Ancestors Point (IAP) might have been.

The MRCA is person which all people alive today are directly descended - note that this does not imply that genetic material is shared. The mere existence of an IAP may be counterintuitive. As you go back to ancestors of the MRCA, "eventually we reach a point in the past where all humans can be divided into two groups: those who left no descendants today and those who are common ancestors of all living humans. This point in time is termed the identical ancestors point (IAP)" (wikipedia).

It may initially be hard to believe that any kind of "recent" MRCA or IAP could be possible. In the case of the MRCA/IAP, I think part of this is how bad we are (as a species) at appreciating the implications of exponential growth.

In the paper, Rohde et al. used a simple statistical population model, coupled with estimates of historical world population and spatial migration, to try to estimate when the MRCA might have lived.

One of the main results from their modeling was that our MRCA might have lived in the recent past - on the order of a "few thousand years ago". Further, the IAP may be just a few thousand years before that.

The Rohde et al. paper was a nice dovetail between and extension of Chang's earlier statistical work in this area in 1999 ("Recent Common Ancestors of All Present-Day Individuals"), and Rohde's earlier modeling work while he was a postdoc at MIT.

Does the Most Recent Common Ancestor or IAP Matter?

Yes.

Its impact on popular culture and zeitgeist may be just as important as its scientific relevance. Simply the realization of the existence of the MRCA and IAP can remind you of our very real connection to each other - a repeating coda on Lennon's Imagine. Just as staring into a clear star-filled sky can drive home Beethoven's "Alle Menschen werden Brueder", appreciating the densely dotted canopy of our shared ancestral past might do the same thing.

Why Am I Looking at This?

When trying to calculate the MRCA for a given simulated population, the size of the data that needs to processed in some way quickly grows to challenging levels. If you want to model any largish population, this really is a big data problem. For this reason, back in 2004 Rohde limited the total world population size to 50 million. Even then, with the small amount of data being associated with each person being modeled, Rohde estimated that over 300 terabytes of information would need to be stored. For this reason, he used an alternative (approximate? I am still coming to terms with this) approach that allowed him to find the MRCA and IAP in "five to ten hours" using a Pentium 4 with 1 to 2 GB of RAM.

One of my initial motivations for looking into the Rohde et al. work in more detail was to explore extending it to population sizes that are more reflective of actual values - certainly the improved computing and storage power of 2015 could be brought to bear. It is an intriguing and enticing challenge, with lots of opportunity to work with a large data problem associated with a topic I am interested in already. Of course, it also will require that I understand Rohde's approach.

As I started to dive into this further, I noticed that almost no one else has since come along to look at this problem again. One other related work that I have come across (after thinking there was none) is that of Joseph Lachance (Inbreeding, pedigree size, and the most recent common ancestor of humanity, in the Journal of Theoretical Biology, November 2009). Lachance looks at the impact of inbreeding using methods based on recurrence relations. The model seems much simpler than Rohde et al. Another related work is from May 2015, whereAndrew Collier implemented some aspects of Chang's earlier 1999 work in R (link) - the R code is amazingly short.

As for the models by Rohde et al, I wanted to explore opportunities for playing around with their parameterization, and I wanted to see what some of the intermediate results looked like for myself.

First Steps - Basic Population Model Assumptions in Rohde et al 2004

This brings me to the first steps I have taken towards understanding, if not actually eventually replicating, the details in Rohde et al's paper.

Summary of Model in Rohde et al.

The image below summarizes the main aspects of the model from Rohde et al. that I have been exploring implementing first. I have been using node/js for exploring the initial implementation.

Summary of Basic Population Model Used in Rohde et al (neglecting migration)

Modeling Lifespan

The probability $p(s)$ that an individual dies at age $s$ (in years), conditional on not having died before age $s$, is assumed to follow a "discrete Gompertz-Makeham form":

$p(s) = \alpha + (1-\alpha) \exp\{ (s - maxAge)/\beta \} $

where the parameters and the values used are shown in the following table.

Parameter	Definition	Rohde et al 2004
$\alpha$	Accident rate	0.01
$maxAge$	Maximum Lifespan	100
$\beta$	Death Rate	12.5

Note: I still have a lingering TODO to confirm if (and how) this should be incorporated when determining the input value for the average number of children per woman.

Modeling Births

This has seemed to be a little tricky to implement exactly as described, as there is room for interpretation.

The basic assumptions are as follows:

Women can give birth between the ages of 16 and 40, inclusive
There is an equal probability $p_{birth}$ of giving birth in each of these years
There is an equal probability of the child being male or female

The (surprisingly) tricky part is how to properly set $p_{birth}$ to obtain the desired average number of children who reach adulthood per woman.

This is tricky because one needs to also take into account how deaths are modeled; for example, simply using $$p_{birth} = {C\over 25}$$ with $C=2$ for ages 16 to 40 might seem to result in an average of 2 children per woman, but there are two factors that must also be taken into account:

that a woman might not live to age 40, and so she would not have 25 chances for birth
that some children will not reach adulthood (considered to be age 16 in Rohde's model)

Both of these factors require that the specified $C$ be larger than 2 (when writing things in terms of the 25 years), and the connection with the actual average number of children who survive to adulthood is via a sequence of other equations involving the Gompertz model for mortality.

Since $p_{birth}$ is constant, simple trial and error with model runs could provide guestimates on the "proper" constant value to use for $p_{birth}$. However, this removes a way to assess consistency with the Rohde et al. model, because they explicitly suggest that "average number of children who reach adulthood per woman" is a parameter in their model.

I still need to work through this.

Fatherhood

For my initial model implementations, the logic in the choice of the fathers does not seem to be important, but it is intended to be consistent as much as possible with the method described by Rohde.

In Rohde's model, he has addressed continents, countries, and towns, and the locality is taken into account when choosing the father of a child. I have not gotten to that point yet. The choice of the father could then be interesting in terms of tracking how the populations may disperse across the globe, and I guess may impact the MRCA somehow (TBD).

As in Rohde et al., once it has been determined that a woman will have birth in a given year, the father is chosen. The father of the first child is chosen to be at least as old as the woman. Rohde does this "primarily for computational reasons" - but I admit that I do not appreciate what those are yet.

As in Rohde et al., if the woman has already had a child before, then there is an 80% chance father will be the same as the father of her previous child (if he is still alive). Otherwise, the father is chosen at random from the other living males. However, those males who have not been a father yet are twice as likely to be selected.

Males can be a father throughout adulthood (ages 16 to 100).

Population Group

For my initial experiments, there is a single population. This population is initialized with an equal number of males and females, born in years prior to when the simulation starts.

Time

The implementation marches through with a yearly time step for a specified length of time. Each year, the births and then the deaths are determined, with the internal data structures updated. A lingering question I have is if it matters much if you calculated the deaths first and updated data structures prior to determining births (or vice versa) for a given year. I do not know the answer to that yet.

Some Initial Results - Family Trees of Simulated Populations

While I am still debugging on the implementation, I thought it would be interesting to take a look at what the "family trees" look like for some of the simulations. This was done using a family tree viz tool I have been playing with (available online here)

The first ones below are for a case where the population dies out in a few hundred years.

Sample Family Tree where Population Dies Out
(simulation run for 500 years)

Family Tree of the Last Child
This also demonstrates the polyamory aspects of the model
(when shown, vertical bars indicate lifespans)

Sample Family Tree where Population Remains Stable
(there is nothing particularly special I am pointing out here -
I just like the big swooping curves back towards the first ancestors)

Lots More to Do

This has been a fun little project to start. As simple as the basic model is, there seem to be a surprising number of little nuances that are hard to appreciate without actually implementing the model.

Here are a few note-to-self pending TODOs I am in the middle of:

The large number of links in a family tree generated from a simulation when the population is stable has revealed some needs for optimization of the family tree visualization
I need to refine how the specified average number of children per woman is determined. But, however the average number of children per woman is determined, the probability is going to be the same for each year a woman can give birth in the Rohde et al model. In early experiments, there is a surprising sensitivity of the population's fate - stability, growth, or extinction - to a "tiny" range for the average number of children per woman. A difference of 0.1 can matter! What does this mean for how lucky we may be to still be here? And I need to explore this sensitivity more.
Joseph Lachance looked at this problem in 2009, using a simpler approach based on recurrence relations (and he seems to work backwards in time). He even included the MATLAB code he used as supplemental data for the paper. I need to take a closer look at this.

Nowhere Near Ithaca

Modeling our Most Recent Common Ancestor - Very First Steps in Understanding Rohde et al.'s 2004 Population Model

No comments:

Post a Comment

Popular Posts