HAPLIN data format

HAPLIN requires data to be in a ASCII file in a specific format.
(For the convenience of the user, other separators and missing data indicator can be specified using the arguments sep, allele.sep and na.strings, see below)

In addition, the file structures slightly differ from design to design:

For the case-parent triad design (triad)

Each line represents one triad. There are three columns for each locus, one for the mother (M), one for the father (F) and one for the child (C). The columns are placed in the following sequence (where the numbers indicate marker):

M1  F1  C1  M2  F2  C2  ...etc.

Important: Make sure the sequence is correct, this is the only way for HAPLIN to figure out which is which.

To illustrate, for 2 loci with 4 and 2 alleles, respectively, one would have a setup of the type

  marker 1     marker 2
 4;4  4;4  4;4  T;T  C;T  C;T    <- Triad no. 1
 2;4  2;4  2;4  T;T  T;T  T;T    <- Triad no. 2
 2;4  2;4  2;4  T;T  T;T  T;T      etc.
 2;4  2;2  2;4  T;T  T;T  T;T
 2;3  4;4  3;4  T;T  C;T  T;T
 4;4  2;4  4;4  T;T  T;T  T;T
  M    F    C    M    F    C

Assuming there could also be missing data, the first four lines of data might look like

4;4 4;4 4;4 T;T C;T C;T
2;4 2;4 2;4 T;T T;T NA
2;4 NA  2;4
2;4 2;4 2;4 T;T T;T T;T

(Note the NAs that indicate missing genotype at the first marker of the father in the third triad, and at the second marker of the child in the second triad.

For the combined case-parent triad and control-parent triad design (cc.triad)

The data format is identical to the triad format, but there must be an extra column to the left of the genetic data, specifying the case/control status of the triad.
Important: The case/control column must be numeric with two different values. The largest one will always be used to denote a case triad!

So, for example, the first four lines could look like

0 4;4 4;4 4;4 T;T C;T C;T

1 2;4 2;4 2;4 T;T T;T NA
1 2;4 NA  2;4 T;T T;T T;T

0 2;4 2;4 2;4 T;T T;T T;T

which would indicate that lines 2 and 3 are case triads, lines 1 and 4 are controls. Note that, if for instance only the control child has been genotyped, not the parents, one can use a file like

0 NA  NA  4;4 NA  NA  C;T

1 2;4 2;4 2;4 T;T T;T NA
1 2;4 NA  2;4 NA  NA  T;T

0 NA  NA  2;4 NA  NA  T;T

i.e. the control parents have been set to missing.

For the standard case-control design (cc)

The format should be as for the cc.triad data above, but columns relating to parents should be completely removed, as in

0 4;4 C;T

1 2;4 NA
1 2;4 T;T

0 2;4 T;T

Note (for all designs)

Using other separators

To improve the flexibility of Haplin for the user, there are two arguments to haplin, sep and allele.sep, which can be used to set the separators between columns and within columns, respectively. For instance, space can be used for both, and the file could then look like

      marker 1         marker 2
  4  T  C  T  C  T
  2  4  2  4  2  4  T  T  T  T  T  T
  2  4  2  4  2  4  T  T  T  T  T  T
  2  4  2  2  2  4  T  T  T  T  T  T
  2  3  4  4  3  4  T  T  T  T  T  T
  4  2  4  4  4  T  T  T  T  T  T
  M     F     C     M     F     C

Or, for instance, with allele.sep="" (empty) and sep=" " (space) it would be

44 44 44 TT CT CT
24 24 24 TT TT TT
24 24 24 TT TT TT
24 22 24 TT TT TT
23 44 34 TT TT TT
44 24 44 TT TT TT

Marker selection

If you intend to run HAPLIN on various selections of the markers, you don’t have to create a separate file for each selection. The markers argument in haplin can be set to, for instance, c(2,3), which will use only the second and third markers in the data file, as in the command haplin("filename", markers = c(2,3)).
Note that if the argument use.missing is set to FALSE, HAPLIN will exclude all triads with any form of missing data (all non-complete triads). However, it will only look at markers chosen by the "markers" argument, so that triads with missing data on unused markers will not be removed as longs as they are completed on the selected markers.