Documentation for DFSC version 1.1 ----------------------------------- DFSC is a program for testing Tajima's D and Fu and Li's D, D*, F, and F* statistics under varying assumptions about population size change. The tests performed by this program make all of the assumptions of the original D, D*, F, and F* tests formulated by Tajima (1989) and Fu and Li (1993), except the assumption of constant population size. The most important assumptions are that recombination has not occurred between polymorphisms and that homoplasy is rare. If recombination has occurred, these tests are overly conservative. This means that it is okay to use them, but you might accidentally fail to reject the null hypothesis when it is false. The program will automatically perform all of the tests that are possible for the input data. Notes: Effective Population Size ------------------------- The program measures effective population sizes in haploid genomes, not individuals. For example, when working with autosomal loci in humans, it measures effective population size as the number of orthologous sequences examined, not the number of humans. Mutation Rate ------------- The program measures mutation rates under an "infinite sites" model. The infinite sites mutation rate is the rate at which mutations occur, across the examined region of the genome in aggregate, per generation. Thus, this rate depends on the size of the region. We often examine DNA sequences, in which case the infinite-sites mutation rate is simply the substitution rate per site per generation times the number of sites examined. PREREQUISITES: Most users will need to have a version of the Java Runtime Environment installed on their computers. I installed the latest Sun (Official) JRE on my Windows XP system by going to http://www.java.com, choosing "Free Download" and following the instructions. INSTALLATION: 1. Unpack the DFSC.zip file 2. Choose "All Programs"->"Accessories"->"Command Prompt" from the "Start" menu. 3. Change into the DFSC directory using the "cd" command. 4. Type "java DFSC" and see if the program runs. If yes, follow the detailed instructions below. If no, make sure that you have installed the JRE correctly and that you are in the DFSC directory. 5 (Optional. If you have a recent version of the gcc compiler, type "make" and a native binary version of the program will be built. USAGE: java DFSC -n -S -[dfiqst] - or, if you have compiled DFSC using gcc - ./DFSC -n -S -[dfiqst] REQUIRED ARGUMENTS: -n Sample size. Sample size measured as the number of haploid genomes (e.g. sequences) examined. -S Segregating sites. The number of variable positions in your data. OPTIONS: -i Iterations. The number of simulations to run in performing the test (DEFAULT is 1000). -d Mean pairwise difference. The mean pairwise number of differences between sequences randomly chosen from the sample. -f Factor. The factor by which the population expanded from its original size of Theta0 at time Tau (DEFAULT is 1.0, or no growth). -g Print the simulated statistics. These can be pasted into a spreadsheet and used to make graphs of the distribution of values under various assumptions about population history. -q Theta0 (=2Nu). The ancient theta value in the population history, where N is the effective population size, measured in haploid genomes and u is the infinite-sites mutation rate (DEFAULT is 1.0). -t Tau (=2ut). The time of population expansion, where u is the infinite-sites mutation rate and t is generations before present (DEFAULT is 1.0). -s Singletons. The number of derived variants observed once in the sample. EXAMPLE: Consider the data of Wooding et al. (2004), which is composed of DNA sequences from the PTC gene (1002bp) in humans. The size of the region is small, and recombination is probably rare, so we ignored it. * Sequences were gathered from 165 people, for a total of 330 haploid genomes. n = 330 * In this data we found five polymorphisms, two of which were found in just one sequence out of the 330. The mean pairwise difference among the sequences was 1.43bp. So: S = 5 s = 2 d = 1.48 * Evidence suggests that human populations increased at least 100-fold, from an original size of around 10,000 individuals (i.e., 20,000 haploid genomes) about 100,000 years ago (Rogers 1995). We wanted to test the assumption of neutrality in the PTC gene while taking human population history into account, and we assumed that this population history was correct. * We assumed that the nucleotide substitution rate in the PTC gene is 1 x 10-9 per site per year, and that the human generation time is 20 years. Then: Infinite-sites mutation rate was then: = 1e-9 substitutions/site/year * 1002 sites * 20 years/generation = 2e-5 substitutions/generation The ancient theta value was then: = 2Nu = 2 * 20000 * 2e-5 = 0.8 The time of expansion (tau) was then: t = 100000 years ago / 20 years/generation = 5000 generations ago tau = 2ut = 2 * 2e-5 * 5000 = 0.2 And we assumed that the human population grew 100-fold. To test the hypothesis of neutrality in our dataset using the DFSC program, we first tested the hypothesis of neutrality under the assumption of constant population size---the standard assumption---by typing: ./DFSC -n 330 -d 1.48 -s 2 -S 5 -i 1000 -q 0.8 -t 0.2 -f 1.0 The output looked something like this: /============================================================================\ DFSC version 1.1 ================ Test parameters: Data: n = 330 haploid genomes s = 2 singletons S = 5 segregating sites mpd = 1.48 nucleotides iterations = 1000 History: Ancient theta = 0.8 (=2Nu) Tau = 0.2 (=2ut) Growth factor = 1.0 Modern theta = 0.8 (=2Nu) - where - N = effective population size (in haploid genomes) u = infinite sites mutation rate t = is number of generations before present Tajima's D = 1.5950 p(greater) = 0.0590 Fu and Li's D = -1.4633 p(greater) = 0.8990 Fu and Li's D* = -1.4531 p(greater) = 0.8990 Fu and Li's F = -0.4912 p(greater) = 0.7020 Fu and Li's F* = -0.4705 p(greater) = 0.6900 p(greater) = fraction of simulated statistics that exceeded the observed statistic. Note: A p(greater) of 0.05 or less indicates that the observed statistic is GREATER than expected, while a p(greater) of 0.95 or greater indicates that the observed statistic is LESS than expected. \==========================================================================/ These tests did not show a significant departure from expectation for any of the statistics (although the departure of Tajima's D from expectation was very close to being significant). We then tested the hypothesis of neutrality under the more realistic assumption of population growth by typing: ./DFSC -n 330 -d 1.48 -s 2 -S 5 -i 1000 -q 0.8 -t 0.2 -f 100.0 The result looked something like this: /============================================================================\ DFSC version 1.1 ================ Test parameters: Data: n = 330 haploid genomes s = 2 singletons S = 5 segregating sites mpd = 1.48 nucleotides iterations = 1000 History: Ancient theta = 0.8 (=2Nu) Tau = 0.2 (=2ut) Growth factor = 100.0 Modern theta = 80.0 (=2Nu) - where - N = effective population size (in haploid genomes) u = infinite sites mutation rate t = is number of generations before present Tajima's D = 1.5950 p(greater) = 0.0000 Fu and Li's D = -1.4633 p(greater) = 0.0000 Fu and Li's D* = -1.4531 p(greater) = 0.0000 Fu and Li's F = -0.4912 p(greater) = 0.0000 Fu and Li's F* = -0.4705 p(greater) = 0.0000 p(greater) = fraction of simulated statistics that exceeded the observed statistic. Note: A p(greater) of 0.05 or less indicates that the observed statistic is GREATER than expected, while a p(greater) of 0.95 or greater indicates that the observed statistic is LESS than expected. \==========================================================================/ These results showed that under reasonable assumptions about human population history, the observed values of all five tests statistics were significantly greater than expected. Such a pattern is consistent with the action of balancing selection, population subdivision, or something similar. We went on to show that population subdivision in our sample is not likely to be due to population subdivision, and concluded that the pattern is best explained by the action of balancing natural selection. REFERENCES Fu, Y.-X., Li, W.-H. 1993. Statistical tests of neutrality of mutations. Genetics 133:693--709. Rogers, A. R. 1995. Genetic evidence for a Pleistocene population explosion. Evolution 49:608-615. Tajima, F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595. Wooding, S., Kim, U.-k., Bamshad, M., Larsen, J., Jorde, L. B., Drayna, D. 2004. Natural selection and molecular evolution in PTC, a bitter-taste receptor gene. Am. J. Hum. Genet. 74:637-646.