HAPLIN data format

HAPLIN requires data to be in a ASCII file in a specific format.

The data can contain a number of leading columns with covariates (such as a case/control variable), followed by columns containing the genetic data.
Each line represents a case-parent triad (or, in the case-control design, a single individual).
Columns should be separated by white space.
Within each column the two alleles for that individual in that locus are separated by a semi-colon, such as 1;2, C;T, A;A etc.
Missing data is coded as NA.
There should be no row- or column names in the file.

(For the convenience of the user, other separators and missing data indicator can be specified using the arguments sep, allele.sep and na.strings, see below)

In addition, the file structures slightly differ from design to design:

For the case-parent triad design (triad)

Each line represents one triad. There are three columns for each locus, one for the mother (M), one for the father (F) and one for the child (C). The columns are placed in the following sequence (where the numbers indicate marker):

M1 F1 C1 M2 F2 C2 ...etc.

Important: Make sure the sequence is correct, this is the only way for HAPLIN to figure out which is which.

To illustrate, for 2 loci with 4 and 2 alleles, respectively, one would have a setup of the type

marker 1     marker 2
|-------------||------------|
4;4 4;4 4;4 T;T C;T C;T    <- Triad no. 1
2;4 2;4 2;4 T;T T;T T;T    <- Triad no. 2
2;4 2;4 2;4 T;T T;T T;T    etc.
2;4 2;2 2;4 T;T T;T T;T
2;3 4;4 3;4 T;T C;T T;T
4;4 2;4 4;4 T;T T;T T;T
|---||---||---||---||---||---|
M    F C M F C

Assuming there could also be missing data, the first four lines of data might look like

4;4 4;4 4;4 T;T C;T C;T
2;4 2;4 2;4 T;T T;T NA
2;4 NA 2;4 T;T T;T T;T
2;4 2;4 2;4 T;T T;T T;T

(Note the NAs that indicate missing genotype at the first marker of the father in the third triad, and at the second marker of the child in the second triad.

For the combined case-parent triad and control-parent triad design (cc.triad)

The data format is identical to the triad format, but there must be an extra column to the left of the genetic data, specifying the case/control status of the triad.
Important: The case/control column must be numeric with two different values. The largest one will always be used to denote a case triad!

So, for example, the first four lines could look like

0 4;4 4;4 4;4 T;T C;T C;T
1 2;4 2;4 2;4 T;T T;T NA
1 2;4 NA 2;4 T;T T;T T;T
0 2;4 2;4 2;4 T;T T;T T;T

which would indicate that lines 2 and 3 are case triads, lines 1 and 4 are controls. Note that, if for instance only the control child has been genotyped, not the parents, one can use a file like

0 NA NA 4;4 NA NA C;T
1 2;4 2;4 2;4 T;T T;T NA
1 2;4 NA 2;4 NA NA T;T
0 NA NA 2;4 NA NA T;T

i.e. the control parents have been set to missing.

For the standard case-control design (cc)

The format should be as for the cc.triad data above, but columns relating to parents should be completely removed, as in

0 4;4 C;T
1 2;4 NA
1 2;4 T;T
0 2;4 T;T

Note (for all designs)

For all the designs, the file can contain an arbitrary number of columns to the left of the genetic data. The number of columns should be specified using the argument n.vars. The default, n.vars=0, applies to the triad design, if no covariate columns are present.
The design should be specified with the design argument, which takes the values "triad" (default), "cc.triad" and "cc".

Using other separators

To improve the flexibility of Haplin for the user, there are two arguments to haplin, sep and allele.sep, which can be used to set the separators between columns and within columns, respectively. For instance, space can be used for both, and the file could then look like

marker 1 marker 2
|----------------||---------------|
4 4 4 4 4 4 T T C T C T
2 4 2 4 2 4 T T T T T T
2 4 2 4 2 4 T T T T T T
2 4 2 2 2 4 T T T T T T
2 3 4 4 3 4 T T T T T T
4 4 2 4 4 4 T T T T T T
|----||----||----||----||----||----|
M F C M F C

Or, for instance, with allele.sep="" (empty) and sep=" " (space) it would be

44 44 44 TT CT CT
24 24 24 TT TT TT
24 24 24 TT TT TT
24 22 24 TT TT TT
23 44 34 TT TT TT
44 24 44 TT TT TT

Marker selection

If you intend to run HAPLIN on various selections of the markers, you don’t have to create a separate file for each selection. The markers argument in haplin can be set to, for instance, c(2,3), which will use only the second and third markers in the data file, as in the command haplin("filename", markers = c(2,3)).
Note that if the argument use.missing is set to FALSE, HAPLIN will exclude all triads with any form of missing data (all non-complete triads). However, it will only look at markers chosen by the "markers" argument, so that triads with missing data on unused markers will not be removed as longs as they are completed on the selected markers.