HAPLIN data format
HAPLIN requires data to be in a ASCII file in a specific format.
- The
data can contain a number of leading columns with covariates (such as a
case/control variable), followed
by columns containing the genetic data.
- Each line represents a case-parent triad (or, in the case-control
design, a single individual).
- Columns should be
separated by white space.
- Within each column the two alleles for that individual in that
locus
are
separated by a semi-colon, such as 1;2, C;T, A;A etc.
- Missing data is coded as NA.
- There should be no row- or column names in the file.
(For the convenience of the user, other separators and missing data
indicator can be specified using the arguments sep, allele.sep and na.strings, see
below)
In addition, the file structures slightly differ from design to design:
For the case-parent triad design (triad)
Each
line represents one triad. There are three columns for each locus, one
for the mother (M), one for the father (F) and one for the child (C).
The columns are placed in the following sequence (where the numbers
indicate marker):
M1 F1 C1 M2 F2 C2 ...etc.
Important: Make sure the
sequence is correct, this is the only way for
HAPLIN to figure out which is which.
To illustrate, for 2 loci with 4 and 2 alleles,
respectively, one would have a setup of the type
marker
1 marker 2
|-------------||------------|
4;4 4;4 4;4 T;T C;T C;T
<- Triad no. 1
2;4 2;4 2;4 T;T T;T T;T
<- Triad no. 2
2;4 2;4 2;4 T;T T;T T;T
etc.
2;4 2;2 2;4 T;T T;T T;T
2;3 4;4 3;4 T;T C;T T;T
4;4 2;4 4;4 T;T T;T T;T
|---||---||---||---||---||---|
M F C M
F C
Assuming there could also be missing data, the
first four lines of data might look like
4;4 4;4 4;4 T;T C;T C;T
2;4 2;4 2;4 T;T T;T NA
2;4 NA 2;4 T;T T;T T;T
2;4
2;4 2;4 T;T T;T
T;T
(Note the NAs that indicate missing genotype at the first marker of the
father in the third triad, and at the second marker of the child in the
second triad.
For the combined case-parent triad and control-parent triad design (cc.triad)
The data format is identical to the triad format, but there must be
an extra column to the left of the genetic data, specifying the
case/control status of the triad.
Important: The case/control
column must be numeric with two different values. The largest one will
always be used to denote a case
triad!
So, for example, the first four lines could look like
0 4;4 4;4 4;4 T;T C;T C;T
1 2;4 2;4 2;4
T;T T;T NA
1 2;4 NA 2;4 T;T T;T T;T
0
2;4 2;4 2;4 T;T T;T T;T
which would indicate that lines 2 and 3 are case triads, lines 1 and 4
are controls. Note that, if for instance only the control child has
been genotyped, not the parents, one can use a file like
0 NA NA 4;4 NA NA C;T
1 2;4 2;4 2;4 T;T T;T NA
1 2;4 NA 2;4 NA NA T;T
0 NA NA 2;4 NA NA T;T
i.e. the control parents have been set to missing.
For the standard case-control design (cc)
The format should be as for the cc.triad
data above, but columns relating to parents should be completely
removed, as in
0 4;4 C;T
1 2;4 NA
1 2;4 T;T
0 2;4 T;T
Note (for all designs)
- For all the designs, the file can contain an arbitrary number of
columns to the left of the
genetic data. The number of columns should be specified using the
argument n.vars. The
default, n.vars=0, applies
to the triad design, if no
covariate columns are present.
- The design should be specified with the design argument, which takes the
values "triad" (default), "cc.triad" and "cc".
Using other separators
To improve the flexibility of Haplin for the user,
there are
two arguments to haplin, sep
and allele.sep,
which
can be used to set the separators between columns and within columns,
respectively. For instance, space can be used for both,
and the file could then look like
marker 1 marker 2
|----------------||---------------|
4 4 4 4 4 4 T T C T
C T
2
4 2 4 2 4 T T T T T T
2
4 2 4 2 4 T T T T T T
2
4 2 2 2
4 T T T T T T
2
3 4 4 3
4 T T T
T T T
4 4 2
4 4
4 T T T T T T
|----||----||----||----||----||----|
M F
C M F C
Or, for instance, with allele.sep="" (empty) and sep=" " (space) it would be
44 44 44 TT CT CT
24 24 24 TT TT TT
24 24 24 TT TT TT
24 22 24 TT TT TT
23 44 34 TT TT TT
44 24 44 TT TT TT
Marker selection
If you intend to
run HAPLIN on various
selections of the markers,
you don’t have to create a separate file for each selection. The markers
argument in haplin can be set to, for instance, c(2,3), which will use
only the
second and third markers in the data file, as in the command
haplin("filename",
markers = c(2,3)).
Note that if the argument use.missing
is set to FALSE, HAPLIN will
exclude all triads with any form of missing data (all non-complete
triads). However, it will only look at markers chosen by the "markers"
argument, so that triads with missing data on unused markers will not
be removed as longs as they are completed on the selected markers.