Practical advice

Some practical advice when running Haplin


You may occasionally run into problems when using Haplin. Although I’ve tried to make Haplin respond with proper messages as often as possible, sometimes an error may occur, and the error message may be cryptic. Warnings may also appear, even if Haplin completes its run. You may also have problems with Haplin running very slowly. Below I list a few pieces of advice to avoid problems. Some of this will be built into Haplin in the future, so that it produces appropriate warnings along the way.

  • Too low threshold: The threshold parameter decides how rare a haplotype has to be to be excluded from the analysis. After a temporary haplotype frequency estimation Haplin removes all haplotypes with frequency below this limit. The default is set to 0.01. This default is fairly low, and means that you will sometimes have too many haplotypes in the analysis. This may cause Haplin to run slowly or even crash. It will also cause many of the double-dose estimates to have very wide confidence limits since there is little data to estimate double doses for rare haplotypes. You see that the threshold is (too) low if Haplin has a large number of parameters to estimate in each EM step. Try running Haplin with, for instance, threshold = 0.05. This will usually reduce or eliminate the problem. You can tune the threshold parameter once you get an idea of the haplotype distribution.
    A side effect of increasing the threshold parameter is that Haplin has to remove a number of triads that only contain the rare (excluded) haplotypes. This leads to a loss of data, but not necessarily a serious one. The first part of the Haplin printout reports how many triads were actually removed due to rare haplotypes.
    An alternative, and sometimes better, solution, is to choose response=“mult”, which forgoes the double dose estimation and assumes a multiplicative (dose-response) model, which is usually more stable.

  • Too much missing information: It usually works fine to include triads where, for instance, all information on the father is missing. However, you should watch out for triads that lack information on all markers for several of the family members. These contain little information but a lot of ambiguity. So Haplin will have to work hard to make sense of them with little extra power in return. Haplin should detect these in the future, but for the time being it is a good idea to try and remove the “hopelessly incomplete” triads.

  • Too many markers: Since the number of markers included decides how many possible haplotypes there are, the workload of Haplin increases strongly with the number of markers. Haplin 2.0 handles this pretty well, but you should keep in mind that if you run more than, say, 6-7 SNP markers at a time there will be a lot of rare haplotypes that Haplin will need to get rid of. This may lead to a data loss that sometimes becomes serious. So it is probably a good idea to try and limit the number of markers in each run. The markers you want to include can be picked from the data using the markers argument. Setting, for instance, markers = c(2,3,5,6) picks markers 2,3,5 and 6 from the file, so that you don’t have to make a separate file for each combination you want to try.

  • Too much printout: If you think Haplin has a tendency of producing too much printout during the EM process, you may be right. Even though I do recommend checking convergence by looking at the parameter estimates printed during EM, the amount of printout is considerably reduced by setting the verbose argument to F (=FALSE), as in:

    haplin(prepdata, use.missing = T, verbose = F).