Tnt Htm
Linked from TNTwiki
This "tnt.htm" file is included in the TNT download. It is a help file, but incomplete by
itself; but it can help on some issues, so it might as well be in put online for
google-tasticness. I have put the text here: tnt_htm
See the original file, inside the download, for original formatting etc.
T. N. T.
Tree Analysis Using New Technology
Version 1.1 a P. Goloboff, J. S. Farris, and K. Nixon
License Agreement
Introduction-Getting Started Saving suboptimal trees
What TNT doesn't do... Memory Management
Printing & Metafiles Tree Collapsing
Data Input Command-Line
Continuous characters Batch Menus
Merging Data Files Measures of support
Basic Character Settings Linux versions
Scopes, sets, and blocks Consensus estimation
Character and block names Search algorithms
Step-Matrices Setting search parameters
Ancestral states Implementation notes...
Nexus files Scripting(advanced users)
Implied Weighting Citations
Overview
General:
Windows versions with GUI; all versions can be interactively driven with commands
a simple but powerful scripting language allows customization and "programming" of complex
routines or simulations
Linux/Mac versions can run in parallel
Reads Nona/Hennig input data, as well as basic Nexus files.
Can export data and results easily to Nexus format
can overimpose legends on tree-branches (e.g. different support measures).
Optimality criteria:
analyses under pre-defined weights
analyses under implied-weights, either with standard or user-defined weighting functions
(the latter including definition of clique-like weighting)
analyses under self-weighted optimization (i.e. dynamic weighting of character-state
transformations)
standard (Fitch, Farris) characters, easily definable branched characters
(=character-state trees), Sankoff (=step-matrix characters), continuous characters (from 0
to 65, with up to three decimals)
searches optionally collapse zero-length branches under different criteria (including
SPR-TBR collapsing), or retain all distinct dichotmous trees
Search options:
exact ("branch-and-bound") searches applicable to relatively small data sets
random addition sequences plus TBR, applicable to medium sized data sets.
the fastest SPR-TBR swapping algorithms of any program
variety of search methods for large/complex data sets (sectorial searches, ratcheting and
tree-drifting, tree-fusing and tree-hybridization)
recursive sectorial searches, using user-defined commands to analyze sectors
searches highly customizable via simple scripts
positive and negative constraints, flexibly defined
time-bound searches
can search for suboptimal trees (based on either fit difference and/or relative fit
difference).
Tree diagnosis:
parsimonious mapping of character states
lists of synapomorphies
can summarize mappings or lists of synapomorphies based on sets of trees rather than one
tree at a time
can display all most parsimonious reconstructions
counts of specific types of transformations
through the scripting language, it allows finding best state assignments with a fixed
state at one or more nodes (for a given tree)
Tree comparisons:
strict, combinable components, majority, and frequency difference consensus trees
Semi-strict consensus supertrees, easy creation of MRP matrices (for subsequent supertree
analysis under either parsimony or cliques)
finds taxa to prune to improve strict and combinable consensus trees
tree-distances: distortion coefficient, SPR-distances, agreement subtrees. Others can be
easily calculated in conjunction with scripting language (e.g. number of "flippings",
distance under MRP, triplets).
algorithms that help find taxa to prune to improve majority rule or frequency difference
consensus trees, and group supports
easy comparisons of most similar groups between trees
comparisons of groups present in one tree but absent in another
Group supports, randomization:
generation of random trees (for testing in combination with scripts)
data permutation (alla PTP)
generation of random or simulated (jukes-cantor only) data sets
highly customizable jacknifing, bootstrapping, or symmetric resampling (searches can use
any command defined by the user, or access scripts for analyzing resampled data sets in a
special way)
Absolute and relative Bremer supports implemented natively (or easily with scripts)
Scripting
decision-making and control of loops
definition of variables (arrays with up to 5 dimensions)
easy access to internal variables to be used in decision making (number of taxa, trees,
tree-scores, degree of resolution, branch lengths, tree-distances, etc.)
In Windows version, customizable dialogs
formatted output
easy access to taxon names, character names, branch-legends, etc., for use in formatted
output
controls runs in parallel (Linux/Mac only), launching, monitoring, and controlling slave
tasks.
tree-manipulations (including easy definition of branch-swapping routines)
allows manipulation of character-state reconstructions
Introduction - Getting Started (back to index)
TNT is a program for phylogenetic analysis under parsimony. It provides fast
tree-searching algorithms, as well as extensive capabilities for tree diagnosis,
consensus, and manipulation. This documentation explains some important things that are
not immediately apparent (or cannot be done easily) from the menus themselves. For details
on how to use commands, users must refer to the on-line help of TNT (accessed typing help
<enter> at the command prompt or the command dialog opened with File/CommandLine). This
file, tnt.htm, provides a general description of the program; if the file is copied to
your Windows directory, then when selecting the Help menu option, the Explorer is called
to read this file. Throughout this help file, commands that can be typed at the command
line (or included in a file), are marked with italics and bold; menu options are
underlined.
The basic thing you need to do to run TNT is to prepare a data input file. You can either
create this from the program itself, with the File/New and Data/Edit menu options (Windows
only), or in a text editor. This file should contain the data matrix (format specified in
the next section, Data Input); it may contain also specifications of character names and
settings, like weights and additivities (this can be later specified once within the
program, but it is advisable to have this in the file, so you do not forget to change
something you need to change; see under Basic Character Settings, Step-Matrices, Ancestral
states, and Character and block names). Check the files zilla.tnt, example.tnt, and
contin.tnt, three example data files provided with the program. Note that all the input
files must be text-only files.
Once you have read the data in, you are ready to start doing searches, etc. Like most
parsimony programs (i.e. Hennig86, Paup, or Pee-Wee/NONA), TNT has an internal (memory)
buffer for trees, and many of the commands and options (optimization, consensing, etc.)
are available only when there are trees in memory, as these commands are performed on
pre-existing trees. Some search algorithms create trees de novo (others work on
preexisting trees, modifying them to try to find better trees; see under Tree-searching
algorithms). The trees in the memory buffer can be saved to disk, so they are not lost
when you exit the program; you can subsequently read the trees into the program. The text
output produced by the program (i.e. results and timings of runs, tree-diagrams, etc.) is
written onto an internal text buffer, which can be later saved to disk.
If you have a Windows version, make sure you install the font provided with the program,
tred.ttf, in your system. This font is required for TNT to draw trees properly. If the
tred font is not installed, TNT will attempt to use the font terminal (which also draws
trees, if not as nice as tred's).
What TNT doesn't do ...or does poorly... or doesn't do yet... (back to index)
There are many things the program does not do. First, the program is conceived as a tool
for standard parsimony analysis. It does not do maximum likelihood, it does not use
specific models of evolution, it does not do bayesian analyses. Although we anticipate
incorporating some likelihood algorithms, they will be quite rudimentary.
Second, the program does not do sequence alignment of any kind. Sequence
comparison is (computationally) quite a different problem from standard parsimony
analysis. If you need sequence alignment, get a program that does it properly.
Third, the program does not incorporate methods which are potentially useful but
are somewhat experimental. The "support weighting" method of Farris (implemented in
Farris' program zac) is not included in TNT. That method (which is based on mapping
character changes on a tree in order to evaluate it) requires operations quite different
from the ones used in TNT; we do not anticipate including that method in the program.
Fourth, the program uses exact algorithms for Sankoff characters. These work
fine for low-medium numbers of states, but they are not very efficient for very large
numbers of states. Thus, for step matrix characters with many states, TNT is not very
efficient. Note that this applies only to Sankoff ("step-matrix") characters; additive
and nonadditive characters with many states are treated efficiently.
Fifth, the exact solution algorithms in TNT ("implicit enumeration") are not
specially efficient -their speed is about the same as that of PAUP* or the 15-year-old,
16-bit Hennig86. Since the program is conceived mostly as a tool for analysis of large
data sets, there seemed to be no point in using energy in improving the implicit
enumeration algorithms.
Sixth, TNT does not implement any kind of multiple-cut-swapping (commands mswap,
swap, tswap, and altswap in Nona/Pee-Wee). Those swappers were less effective than
multiple addition sequences. Likewise, the swap command of Nona/Pee-Wee (i.e. report fit
differences when moving specified clade around in the tree) is not directly implemented in
TNT (it would probably be easy to implement that using the sprit or tbrit options of the
scripting language). These are probably the only options present in Nona/Pee-Wee but
absent in TNT.
Printing or saving to a Metafile, output or trees (Windows only) (back to index)
The Windows version of TNT has rudimentary printing capabilities. You can print the
display buffer as a text file, using File/Output/PrintDisplayBuffer (optionally, you can
save the text in the display buffer to a file, and print it later with your favorite word
processor; choose File/Output/OpenOutputFile, and all newly produced output will be
written to the file; optionally, with File/Output/SaveDisplayBuffer, you can save
previously produced output).
To print tree diagrams, you have to set the pre-view for trees as ON (under
Format/PreviewTrees). If you want to map (=optimize) characters in color, choose
Format/MapCharactersInColor (set the colors you like with Format/ColorsForMapping). Then,
draw the trees and (when in pre-view screen) press "P." When in pre-view screen, pressing
"M" saves the tree diagram to a metafile (which can be later processed/printed with Corel
Draw, or exported to Microsoft Word or Power Point).
Data Input (back to index)
TNT reads matrices in Hennig86 format, with some refinements. Normally, the data will be
included in a file. The basic format for data input is:
xread
'optional title, starting and ending
with quotes (ASCII 39)'
nchar ntax
Taxon0 0000000000
Taxon1 0010111000
Taxon2 1011110000
Taxon3 1111111000
.....
TaxonN 1111111000
;
the semicolon at the end is normally not required but is useful to detect errors (e.g. is
the number of taxa set too high?).
Polymorphisms can be read, by enclosing the state(s) in square brackets. By default, up
to 16 states are allowed by xread, using symbols 0-9 to indicate states 0-9, and A-F to
indicate states 10-15. This can be changed with the nstates command, which defines the
number of states (and data type) to be subsequently read; in Windows versions, this can be
changed with Format/DataType. If the data are defined as DNA (with Format/DataType under
menus, or using the command nstates dna;) the IUPAC nucleotide codes (including
polymorphisms) are recognized and used throughout. The states are internally stored as
A=0, G=1, C=2, D=3, and gap=4. If the data are instead defined as proteins (with the
menus, or using nstates prot;), the IUPAC codes are recognized and used throughout. For
32 states, non-protein data, the symbols 0-9 and A-V indicate states 0-31.
If the data are defined as smaller numbers of states, they occupy less memory. This may
be necessary for large data sets.
The data can also be read as interleaved. For this, all that is required is that each
block of data is preceded by either the & symbol or the ASCII character number 4 (the
latter is what PAUP* uses). The taxa in each block must have the same exact names
(including capitals). The taxa in each block can be arranged in a different order; the
taxon numbers correspond to the first occurence of a name. A taxon which is not included
in a given block, will have only missing entries for the characters in that block; the
same number of characters must be specified for all the taxa within a given block. Each
row of characters must end with a carriage return. The end of the data is indicated with
a semicolon; if some characters were never specified, they are left as missing entries
(and a warning is issued). If some taxon names were never defined, a warning is issued.
The interleaved format allows mixing of different data types, by simply enclosing the
corresponding string (prot, num, or dna) in square brackets right after the & symbol.
When the data are being read as interleaved, it is possible for different blocks of data
to have different formats. For this, specify between square brackets, immediately
following the '&' that indicates a new block, the format of the data (numeric, dna,
proteins, continuous). To specify how to read a state in subsequent commands that refer
to states (cost, change, or usminmax), use the '/' option of the nstates command.
Continuous characters (back to index)
While other programs require that continuous characters be discretized, TNT can deal with
continuous characters as such. This avoids the use of the rather ad hoc methods that have
been proposed to discretize continuous distributions in phylogenetic analysis (gap-coding,
Thiele's method, etc.; see Farris, 1990 for discussion of some of these methods). If there
is significant variability in one of the terminals, it will probably be best represented
by a range of one (or two) standard deviations around its mean. For normal distributions,
this automatically takes care of non-significant differences.
Continuous characters in TNT can have states 0-65, with 3 decimals (thus, when not doing
implied weighting, tree scores are reported always with 3 decimals). They are optimized
always as additive, i.e., using Farris optimization. They cannot be turned to nonadditive
or sankoff. The continuous characters must be the first ones in the matrix (i.e. the
first block or blocks, followed by blocks with other data formats). Each value must be
separated by a blank or whitespace; ranges are indicated by two values separated by a dash
(-). Missing entries are indicated only as question marks (?).
The treatment for the data when there are continuous characters is rather transparent.
The commands/options that show reconstructions (recons command, or
Optimize/Characters/Reconstructions) or count number of specific transformations (change
command, Optimize/CountSpecificChanges) cannot be applied to continuous characters.
Continuous characters can be edited with the xread= command, and with the menu option
Data/Edit/Taxon, but not with Data/Edit/Character. Continuous characters can be named
with the cnames command, but their states cannot.
Standardization of continuous characters (i.e. rescaling so that the values are on a
common scale) is left to the user. Under pre-defined weights, standardization will
normally be an issue. Under implied weighting, only the default formula can be applied
(i.e. user-weighting functions are not allowed for continuous characters). Since
continuous characters tend to have a larger s-m difference (s: steps, m: minimum
possible), implied weighting should normally penalize them rather heavily. Thus,
standardization of continuous characters is probably not so important under implied
weighting.
Merging Data Files (back to index)
The dmerge command merges data files, and saves them as blocks; the character names and
settings are also merged (in Windows version, File/MergeDataFiles, using Ctrl+Left Mouse
Button to select multiple files). The dmerge command also merges (by using * as argument)
the characters with identical character names (in Windows version, with
Data/MergeHomonymousCharacters). Some basic rules of correspondence between states and
state names must be satisfied.
Basic Character Settings (back to index)
In the Windows version, characters settings can be determined with Data/CharacterSettings.
Current settings can be displayed with Data/ShowStatus. Alternatively, settings can be
included in the data file, with the ccode command.
The ccode command is similar to Hennig86's. It uses specifiers to determine actions on
subsequently selected characters:
+ make additive
- make nonadditive
[ activate
] deactivate
/N give weight N (if no number immediately
follows the slash, set weight to 1)
( apply step-matrix costs
) make additive or nonadditive again
The ccode command always must end with a semicolon. The ccode command does not recognize
character names, or character groups (it does recognize the @ option to select from a
particular block). By default, TNT considers all characters nonadditive (this is the
oppossite of NONA and Hennig86), active, and of weight 1.
The costs of character-state transformations can be defined with the cost, smatrix, or
cstree commands (see below); note that the costs for character-state trees are stored in
the costs, so they are a type of "step-matrix" characters. Step-matrix characters for
which the transformation costs fit a character-state tree (i.e. additive relationships
between states) are (for implied weighting purposes) internally recoded as binary.
Scopes, sets, and blocks (back to index)
For many commands, it is necessary to specify series of numbers (representing sets of
trees, taxa, or characters). Numbering for everything starts from 0 (as in Hennig86 and
Pee-Wee/NONA). The characters (if they have been named, see below) and taxa can be
accessed using either their numbers or their names. For any command that takes ranges,
two numbers separated by a period (X.Y) indicate a range (from X to Y, inclusive). A
blank space before the period indicates that the range starts from the first possible
value, and a blank after indicates that the range ends in the last possible value. Note
that names cannot be separated by a period to indicate ranges. Except for the ccode
command (where they have a special meaning), the + (or - ) symbols indicate inclusion (or
exclusion) from the list. Also, up to 32 groups of trees, characters, and taxa can be
defined. After having defined a group, enclosing in curly braces ({}) the group number
(or name, if named) is equivalent to specifying each of the members of the group. For
taxon specifications, using @X Y refers to all the terminals included in node Y of tree X;
under Windows versions, the equivalent is accomplished by selecting nodes under
tree-viewing mode (Trees/View; see below). For character specifications, using @X Y Z
refers to the Yth and Zth characters in block X (note that Y and Z can themselves be
ranges; if the data actually contain blocks, the first one is numbered as 1, and block 0
refers to the entire set of characters). The blocks can be referred to by their names (if
defined, see below; "all" is always used as name for the entire matrix). Within execution
of a given command, the @X for characters remains in effect untill deactivated explicitly
with @0 or @all. If the data are read as interleaved, the blocks of data are
automatically preserved (although they can be changed with the blocks command).
Character and block names (back to index)
Character names can be defined with Data/EditData/CharacterNames, or alternatively, with
the cnames command. The syntax for cnames is as follows:
cnames
[0 First_block ;
[1 Second_block ;
{0 zero state_zero state_one state_two ;
{1 one state_zero state_one state_two ;
{2 two state_zero state_one state_two ;
{+ three state_zero state_one state_two ;
{+ four state_zero state_one state_two ;
;
An additional semicolon is needed to end the cnames statement. The names must all be
defined together; character name definitions can be changed with cnames +N charname
statename(s); (this changes character/state names, only one at a time). The names preceded
by opening square brackets, [, are names for blocks, and the names preceded by curly
braces, {, are character and state names. Not all characters (or blocks) must be named in
a cnames statement; name what you need. The + symbol means "the character (or block)
following the last one named" (if no character or block had been named before, this is the
first character or block). The characters within a block can be named according to their
in-block numbering sequence, by using the @ option:
cnames
[+ First_block ;
[+ Second_block ;
@First_block
{0 zero_of_first .... ;
{1 one_of_first .... ;
{2 two_of_first .... ;
{3 three_of_first .... ;
{4 four_of_first .... ;
@Second_block
{0 zero_of_second ... ;
{1 one_of_second ... ;
{+ two_of_second ... ;
{+ three_of_second ... ;
;
Changing the reading frame with @blockname resets to 0 the value of the + symbol for
character names.
Step-Matrices (back to index)
The step-matrices can be defined from the menus (with Data/CharacterSettings), or
via the cost or smatrix commands. The first defines transformation costs for sets of
characters in particular; the second defines and stores them, for later application to
sets of characters. The cost command has the following syntax:
cost N = ...costs... ;
where "costs" (see below) are the step-matrix costs of character(s) N; note that the cost
command does not make characters step-matrix, only defines their costs (the ccode command
is necessary for this, with "(" making the character step-matrix, and ")"
non-step-matrix). The smatrix command has the following synthax:
smatrix =S (xxxx) ...costs... ;
where "costs" are saved in internal step-matrix number S (named xxxx). Once defined, the
step-matrix can be applied to character(s) N, using
smatrix +S N ;
or
smatrix +xxxx N ;
The syntax for the costs themselves is similar for both the cost and smatrix command:
X>Y Z U/V W ;
defines as Z the cost of transformation from (any state in) X to (any state in) Y, and as
W the cost of transformation between (any states in) U or V. X, Y, U or V can be several
states, enclosed in square brackets (as in the xread command); if the data have been
defined as DNA, the IUPAC codes for polymorphisms apply; using ? indicates all possible
states. If Z or W is replaced by the letter i, the cost is set to 1000 (presumably,
"impossible," but this really depends on the other costs). When the > symbol is replaced
by a slash, /, this indicate symmetry in transformation costs.
The minimum possible number of steps for step-matrix characters is calculated
with approximate algorithms (in PAUP* the minimum cannot be calculated at all for
step-matrix characters, and therefore implied weighting cannot be applied). Beware that
for very complex transformation costs (e.g. random numbers between 1 and 30 for each
cost), the algorithms used by TNT can sometimes overestimate the minimum possible length
in a few steps. These algorithms are more prone to error when there are polymorphic
taxa. If in doubt of whether the program is calculating accurately the minimum possible
steps for a sankoff character, use the usminmax command to determine a user-defined
minimum. If the user-minimum is set to 1, all sankoff characters will be considered
"informative." Then, you either optimize all characters, or you can do a search for each
character, so as to make sure what the real minimum for each character is.
The step calculations during searches are exact for any type of transformation.
The algorithms used consider as possible assignments for optimizations all the states that
occur between 0 and the largest state (states beyond the largest one occurying in the
matrix are not considered; states between 0 and the largest state are considered even if
they are not present in any terminal taxon). This is more or less equivalent to to the
"allstates" options of PAUP*, with the difference that PAUP* tries (as I understand it)
all the states defined in the step-matrix (instead of the largest state observed in the
terminals). It is possible (although unlikely) that in the case of complex transformation
costs, if some state larger than the ones present in the terminals is intermediate between
others, it is required to obtain the minimum length. For example, supposse you have DNA
some characters where the only observed states are A, C and G, where all transformations
cost 10, except to or from T, which costs 6; whether or not T is included among the
possible states determines whether the minimum cost can be 20 or 18 (perhaps even
influencing whether the character is considered informative or not). Then, you can tell
TNT that the maximum number of states to consider for all sankoff characters is some state
(in the case of DNA data, the largest site is T = 3 and a gap is a fifth state = 4; in the
case of protein data the largest state is U = 20, and in the case of 32 alphanumeric
states the maximum state is V = 31). This must be done prior to the the definition of
costs. This obviously slows down calculations, so TNT pays attention to this only in those
cases when it does make a difference (i.e. only when length could change by including the
larger state(s)).
Note that the branch-swapping algorithms only look at trees where the currently
designated outgroup is outside everything else. If you have asymmetric costs for some
characters, and you want the program to choose the rooting that minimizes the overall
cost, you have to add an all-missing ancestor. This is different from the way PAUP* works:
when step-matrix asymmetric characters are present, PAUP* also tries rerootings of the
tree (i.e. it does moves which put the outgroup "inside"). If you do not include an
all-missing ancestor in TNT, or do not force monophyly of all but the outgroup taxon in
PAUP*, the two programs may produce different results.
Ancestral states (for step-matrix asymmetric characters) (back to index)
For characters with symmetric transformation costs, the root states will always include
the states in the outgroup taxon among the possible states. This may not be the case for
step-matrix asymmetric characters; for example, suppose that going from 0 to 1 is very
expensive and going from 1 to 0 is very cheap; if the outgroup has state 0, it may be
better to assign state 1 to the root, and then postulate several independent changes from
1 to 0 than a single change from 0 to 1. That may or may not be reasonable, depending on
what the outgroup states represent. If the outgroup represents a series of successive
sister groups with the same state (example: a wingless ancestor for hexapods, with wing
gain being more expensive than wing loss), assigning to the root a state different from
the one in the outgroup taxon would imply a number of uncounted events. In those cases,
it is possible to consider the states in the outgroup as ancestral -that is, choose for
possible state assignments for the root only from among the states in the outgroup taxon.
This can be done by choosing Data/CharacterSettings, then clicking on the "ancestral"
option, and selecting the characters you want o set. Alternatively, you can use the
ancstates command, and include this in the data file itself. Note that the designation of
a character as "ancestral" has no effect if the character is not an asymmetric step-matrix
character.
Nexus files (back to index)
The basic features of the Nexus format are supported in TNT. When parsing a file, if the
first string in the file is "#NEXUS" (and the macro language is turned off), then TNT will
parse the file as a Nexus file. So, you simply open the nexus file in the same way you
would open a native TNT file. The only blocks that will be read are those corresponding
to data, trees, and assumptions (commands ctype, usertype, include, exclude, and weights).
The data can be read in the "dna," "protein" or "standard" formats only (e.g. format
datatype = standard;). Unrecognized commands or blocks are simply ignored or skipped.
Note that within the NEXUS file, numbering for taxa or characters starts from 1 (one), not
from 0 (zero). You can include TNT commands in the nexus file, at the end of the file,
preceded by "begin tnt;" (with an end at the end, which is ignored by TNT; otherwise,
PAUP* complains that end of file was found when reading a TNT block).
TNT can also export data or trees in Nexus format. For this, choose the Data/Export menu
option (or the export command). The export command allows embedding commands (for PAUP*
or TNT) at the end of the file, using + after the file name, followed by the commands (a
semicolon terminates; to embedd a semicolon in the file, use a period followed by a comma,
as in the quote command).
Implied Weighting (back to index)
TNT implements the implied weighting method of Goloboff (1993), using floating-point
(=exact) fit calculations. The fit for characters is (by default) calculated as f = k /
( e + k ) (where e = extra steps, and k = constant of concavity, changed with piwe=k).
During the searches, the program reports the score as an increasing function of the
homoplasy (or "distortion" to be minimized), which is 1 - f = e / ( e + k ). If you
prefer the score reported as fit, it can be calculated once the search is finished.
Since floating points calculations are used, it is very important that you evaluate group
support when doing implied weighting. Since exact ties are very unlikely, this criterion
may produce overresolved trees if poorly supported groups are not collapsed.
The fit for discrete additive characters (including those with character-state-trees) is
calculated by decomposing the character into binary variables. Note that this may produce
fits different from those of Pee-Wee or PAUP* (which do not do this conversion); the fits
will always be measured as higher in TNT; it has been suggested (De Laet, 1997) that the
recoding produces a more meaningful evaluation of the relative weights. This affects only
data sets with discrete additive characters. If the same characters are defined as
continuous, the fit is calculated without decomposing into binary variables (thus
producing the same fit values that Pee-Wee or PAUP* produce for the additive coding);
treating the additive characters as continuous obviously slows calculations.
In the case of extra internal steps defined by the user, the same number of extra steps
will be added for each of the variables into which the character is decomposed. If you
want some of the variables to have different numbers of steps, you have to recode the
character as binary yourself.
The method is applied to Sankoff characters as well, in which case the fit for a character
with e extra steps and a minimum cost among states of c is calculated as k / ( ( e/c ) +
k ) (where k is the constant of concavity). When the costs are defined as all unity, the
formula produces results identical to those of the standard formula used by Goloboff
(1993). If the user has defined some characters has having extra internal steps, these
are added to e.
It is possible to define a user weighting function, by specifying the values of
weights for different numbers of extra steps, relative to no extra steps. For this, the
data must be read with implied weighting ON. The only requirement for user defined
weighting functions is that no (relative) weight is negative. Weighting functions were
the reliability increases with the first extra steps, and then decreases, are (from the
numerical point of view) perfectly valid. To define a weighting function, type (or
include in a file to be parsed) "piwe[" followed by as many numbers as extra steps you can
have in your data set. The numbers are read as the relative costs of transforming from 0
to 1, 1 to 2, etc. If the relative weights defined are less than the maximum possible
extra steps, the undefined transformations are given a weight equal to that of the last
transformation defined. If more transformations than the maximum possible extra steps are
defined, they are simply ignored. The user defined weighting functions are applicable
only to discrete, non-sankoff characters. When user-defined weights are in effect, the fit
cannot be calculated (i.e. only the score as an increasing function of the homoplasy is
reported). Note that user-defined weighting functions can be easily used for clique
analysis (i.e. defining the function as piwe[1 0 ;).
Tree-view: Selecting Nodes and Editing Tree (Windows only) (back to index)
The tree-view mode is toggled with Trees/View. Under tree-viewing, it is
possible to select nodes by clicking at them (with the left mouse button). The node
selection toggles between red/green/unselected. All the terminals included in nodes
selected with red will be considered as "selected," unless they belong to a node selected
in green which is a member of the red selected node (combinations of red/green selections
can be used to easily define paraphyletic groups). Note that a green selection serves no
purpose unless the green node itself is within a red node.
When the tree is locked to manual edition (and regardless of nodes selected with
the left button), positioning the cursor at a node and clicking on the right mouse button
produces a list of synapomorphies of the node (using character/state names, if defined and
in effect).
When the tree is unlocked, it can be edited by clicking on the right mouse button. If a
single node (red or green) is selected, and the right button is clicked while the cursor
is on an unselected node, the selected node (=source) is clipped and moved to be sister of
the unselected (=target) node. If the target node is the ancestor of source node, the
branch leading to the source node is collapsed. If the right mouse button is clicked on a
red selected node (regardless of whether other nodes are selected), all the terminals
belonging to the node are pruned from the tree (unless they in turn belong to green
selected nodes, in which case they are left in the tree). If the right mouse button is
clicked on a green selected node, and the tree is incomplete, the user can select a
terminal to add to the tree (as sister to the green selected node).
When the tree is edited, the changes are irreversible. If you are not sure of
whether you want to keep the changes, you should save the tree to file before editing it.
To avoid mistakes, the trees are by default locked to manual edittion; they can be
unlocked with Settings/Lock trees.
Saving suboptimal trees (back to index)
The program can save, during branch-swapping, suboptimal trees. The values of suboptimal
are set with Analyze/Suboptimal, or with the suboptimal command. Two values can be set:
the absolute and the relative fit difference to accept a rearrangement; the relative fit
difference (Goloboff & Farris, 2002) is calculated between the tree being swapped and the
rearrangement. These two values take effect also when collapsing trees for consensus
calculation with temporary collapsing under SPR or TBR. The relative fit difference
between the tree being swapped and the rearrangement is calculated exactly (see Goloboff &
Farris, 2002, for details on how this can be done quickly), except for TBR with asymmetric
step-matrix characters, where the relative fit difference can be either over or under
estimated by the algorithms used.
Note that the new technology algorithms (sectorial searches, tree-fusing), since they are
not designed primarily to save multiple trees, do not have a well defined behaviour in
terms of the suboptimal trees saved; some trees will be saved, others won't. Thus, if you
want suboptimal trees, do branch-swapping after having concluded a search with new tech
algorithms.
Memory management (back to index)
Except for the stuff that windows itself uses, memory management in TNT uses special
memory allocators written by Steve Farris. This memory manager insures that TNT has no
memory leaks. For example, reading a new data set is guaranteed to first set the whole
system back to its original state, and then reallocate memory. The same applies to
macros, which use a different memory segment. The only thing that can cause memory
fragmentation when using TNT is repeating some of the following operations many times: (1)
re-setting the maximum amount of "general RAM" the program can use, (2) re-setting the
size of the display buffer, (3) re-setting the total amount of RAM to be used by the macro
language, or (4) re-setting the number of nested input files. These things use normal
memory allocators. Other than this, TNT can be used for any period of time, or for any
number of operations, without causing any memory fragmentation.
Tree collapsing (back to index)
TNT implements several rules for tree collapsing (see Coddington and Scharff, 1994, for
discussion). For rule 1, which may produce collapsed trees longer than the original
resolution, the length-calculating commands (for example, bremer support or the tree
filter when suboptimal trees are to be discarded) calculate the length before collapsing;
note that the final collapsed trees may be longer. The collapsing for consensuses
(including the bremer support) is calculated only temporarily. Note that rule 4 cannot be
used during consensus calculation (because it may multiply the number of trees), but the
results for the (strict) consensus under rule 1 will by necessity be identical to those
for rule 4.
The implementation of most rules requires no special comments. The exception is rule 4,
implemented in TNT as follows:
1) take the tree to be collapsed
2) collapse all nodes for which ancestor and descendant state sets are identical
3) mark all nodes for which ancestor and descendant state sets share state(s).
4) for each combination of possible collapsings for the nodes marked in 3, if the length
of the new tree equals that for original tree, and no node in the new tree remains
unsupported under rule 1, add the new tree to memory.
5) if any of the trees added to memory at point 4 is a duplicate of a previously collapsed
tree, discard it.
The trees produced will thus not be further collapsed applying rule 1 to them; when the
trees are only condensed, existing duplicates are not eliminated automatically (they are
when the trees are "filtered"). This implementation of rule 4 is almost always more time
consuming than the other rules (except SPR or TBR, naturally), in some cases, much more
time consuming. For some input trees, this implementation of rule 4 may produce a larger
number of output trees. Coddington and Scharff (1994) did not prove that all rule 4 trees
can be found by collapsing all rule 1 trees, but it is easily proven (De Laet, pers.
comm.) that, if the trees are a set of all the trees distinct under rule 1, just deleting
the trees that become longer when collapsing produces a set identical to the rule 4 set.
However, if the trees to be collapsed are a subset of the rule 1 trees, discarding the
trees that become longer may produce spurious results -and that's why rule 4 is
implemented as it is. If you are certain that your trees are all the distinct rule 1
trees, you can get the set of rule 4 trees by simply (a) collapsing the trees under rule
1, and then (b) getting rid of the trees that are longer (filtering without collapsing).
The branch-swappers use only binary trees as input. If some non-binary trees
exist when a search is invoked, they are automatically resolved before the search
proceeds. This may cause some trees that were distinct to become identical when resolved.
Additionally, collapsing under rule 1 (i.e. collapse all ambiguously supported branches)
may make the trees longer than the initial binary trees. When any of these two things
happens, a search may deliver N optimal trees of minimum length, but if a subsequent
search is done, only a reduced number will be used as input for subsequent searches (this
happens very often in PAUP*, which always collapses the trees after a search). To prevent
this, TNT by default retains the trees as binary after the search (note that the trees are
anyway compared to make sure that they are distinct if they are collapsed under the
criterion in effect). Using these trees for consensus, bremer support calculation, etc.,
requires that they be collapsed temporarily. The temporary collapsing is set with
collapse+; (or under Settings/ConsenseOptions).
When trees are temporarily collapsed, eliminating taxa requires special
considerations. The characters of the taxa still influence the optimizations; only the
taxon positions are ignored. For SPR and TBR collapsing, the results produced are those
that would be obtained by (1) saving all the binary trees that could be found by swapping
on the tree to be collapsed (if a better tree is found, swapping continues nonetheless
from the original tree), (2) pruning all the taxa to be eliminated, and (3) consensing the
pruned trees. For optimization-based collapsing, evaluation of support for a branch
considers whether merging two branches in a singe branch when clipping a taxon produces a
supported branch (note this is not the same as first collapsing the trees under rule 1,
and then clipping taxa). Eliminating taxa when resampling, estimating consensuses, or
searching untill the consensus becomes stable, is done in the same way. If you want to
remove the taxon from the analysis (as opossed to removing it from the consensus),
deactivate the taxon (with the taxcode command, or the Data/Taxa menu option).
Beware that, although tree-collapsing involves no special theoretical
considerations, it is of practical importance. Misusing the collapsing possibilities
provided by the program may produce absurd results.
Command-Line (back to index)
With File/CommandLine, it is possible to enter commands for TNT. <Tab> switches focus
between main window and command line; <enter> always takes you to the command line. When
focus is on command line, pressing - or + moves screen one page up or down (respectively).
The keys / and \ move through the buffer that remembers the last commands executed. A
list of the commands (with their syntax) is obtained entering "help" at the command line.
You can also give TNT (essentially) any commands you want when invoking it from the DOS
prompt (in the Windows version, commands must be preceded by a semicolon; strings before a
semicolon are interpreted as names of files to open). The cost command takes the redirect
output symbol, >, as a specifier; in the command line the > symbol can be replaced by a
closing curly brace, }.
Batch Menus (Windows only) (back to index)
TNT allows instructions to be stored (in memory or in file), for later execution, under
Settings/Batch. Note that some purely interactive functions (like searching for text in
the text buffer, or selecting tree nodes under tree viewing mode) cannot be stored as
batch instructions. To store instructions, choose Set menus to batch, and then simply
click on the options you want (you can later edit them). When the instructions are
subsequently executed, they are run in order. Make sure the warnings are disconnected
before running a set of instructions (as otherwise the program may halt when it finds a
warning).
Measures of character support (back to index)
TNT implements two types of character support measures: Bremer supports and
resampling. The Bremer supports are calculated from the trees in memory: it is up to the
user to find suboptimal trees. The resampling can be of different types, some of which
are preferrable to others. Standard bootstrapping is influenced by uninformative
characters (and by characters irrelevant to monophyly of a given group). Bootstrapping
(both standard and Poisson) and jacknifing (except for a resampling probability of 50%)
are affected by character weight and transformation costs (e.g. additive characters).
Symmetric resampling is affected by none. Outputting results with the frequency may
produce alterations in the apparent support for groups with very low support (and it
cannot measure support for groups with very low support). Both GC and frequency slopes
solve this problem (for slopes, supported groups have negative slopes; the closer to 0,
the better supported the group). The supports can be measured on the groups present in
any given tree (normally, a most parsimonious tree, but using other trees provides a way
to measure how strongly contradicted some groups are). The slopes must be measured by
reference to a pre-existing tree.
If some taxon has its position in the tree very poorly defined, it may strongly
decrease the support for many groups in the tree. If you wish to find out whether the
rest of the tree is well supported, you can eliminate the taxon from the consensus when
resampling. To find out which taxon is floating around you may have to first do a more
superficial estimation, saving the trees, and trying to prune different taxa (note that in
such case you have to take the precaution of not collapsing the trees with SPR or TBR, as
otherwise you may never find the floating taxon: once you found it, you turn collapsing
back to SPR or TBR -the most effective--and do a more careful estimation of the resampling
frequencies eliminating the evil terminals).
A note on Linux-Mac OS X versions (back to index)
Linux and Mac OS X versions of TNT have a couple differences with the standard versions.
First, those versions can take a comma ( , ) instead of a semicolon ( ; ). This is
because the shells there use semicolons to separate commands, and giving arguments to TNT
from the command line becomes difficult. Second difference is that those versions can run
in the background. If you want to start TNT in the background, use bground as first
argument, then give the normal arguments you would give the program (do not activate
reports and do not change the silent status, but you can set report+ to report every
certain number of trees, replications, or seconds, so that you can monitor progress by
inspecting the output file; make sure you do have output files, otherwise the program
would run and exit without doing anything!), and end with the usual '&'. TNT will not
write to the console, or expect console input (at least if no errors occur). It will write
its results to a file and exit smoothly (exit code is 0; if the program runs out of
commands while in the background it exits with code 1). Should some error occur during the
search, the program writes a message to the ouptut file, and exits with code 255. Try
./tnt bground p zilla.tnt, echo=, log peep, rep+1, mu10=ho3, le, ne, quit, &
and do "tail -f peep" to monitor search progress.
Running TNT in parallel
If you install the PVM system in your machine, you can run TNT in parallel. If you have
difficult data sets, this may be advisable even if you have a single-processor,
single-machine, as the parallelization interface of TNT allows for a lot of control in
defining search routines (e.g. you can re-analyze sectors of the tree using any search
routine you want). Unlike other phylogeny programs which can search in parallel (e.g.
POY; Wheeler et al., 2003), TNT does not have pre-established routines for working in
parallel. Rather, TNT allows to parallelize routines, interacting with the scripting
language; although harder to use, this is much more flexible and can be tailored to
specific needs. The command that handles most of the options for parallelization is ptnt.
With the command ptnt begin TNT can launch jobs in parallel; each job can consist of one
or several tasks. Each task in the job receives a set of instructions; the instructions
are determined by the user, and can be (essentially) any set of TNT commands (possibly
including instructions to launch other jobs in parallel). After launching a job, the
master TNT process continues operation normally (i.e. asynchronously). The results from
the slaves are brought back to the master with the command ptnt get; the master does not
receive results from the slaves untill the master itself executes a ptnt get command;
slaves that finished running remain dormant (i.e. not using processor time) until the
master requests results from them (or until the master sends the slaves a new set of
instructions). The master may launch several jobs at the same time; up to 32 jobs can be
running at the same time, and the master may handle exchanges of trees/information between
them. When a job is launched, it automatically receives from its parent the data (and,
optionally, starting trees), as well as its identification number (for t tasks, numbering
goes from 0 to t-1). Since a slave task knows its own number, this can be used to have
some tasks perform differently (alternatively, the same thing can be accomplished by
launching other job(s)). A TNT process that launched some jobs can monitor (with ptnt wait
or ptnt gwait) the progress of the jobs, letting the user access to numbers of trees,
replications, scores, etc., in the slave tasks; based on this information, the parent
process can then recall jobs back (ptnt get), tell the slaves to skip some instructions
and move to a specific part of the instructions (ptnt goto and ptnt skipto), or
send new sets of instructions to the slaves (ptnt again). The parent process can get
trees from its slave tasks, at any point (with ptnt spy).
A very simple example shows how to parallelize resampling, in a job that receives
the name psample ("parallel sampling"):
ptnt begin psample
10 /* run 10 slave tasks */
+. -myself = /* run in all hosts but master node */
/* ...these are the slave instructions... */
collapse tbr ;
resample rep 10 savetrees ; /* resample */
tchoose 1. ; /* choose individual trees */
return ;
ptnt wait ( tasksleft[psample] = = 0 ); /* wait for all tasks to end */
keep 0 ;
ptnt get psample ; /* get results from psample */
collapse notemp ; /* trees are already collapsed by the slaves */!
freqdif ; /* calculate freqdifs */
If the number of tasks (10 in this case) is not specified, as many tasks as hosts minus
one are used (and no task is launched on master node); if no host selection is specified,
the default selection is used (i.e. the same one indicated in the example).
Consensus estimation (back to index)
TNT implements the consensus estimation method of Goloboff and Farris (2001). This is
based on making a certain number of independent searches, and checking whether the end
results of all of them share groups. A group which is present in all (or most) of the
searches, even if the search failed to produce optimal trees, is likely to be present in
the set of optimal trees (and thus the consensus). This can be used to have a quick idea
of the approximate results for your matrix, or to create constraints to speed up searches.
The consensus estimation is invoked with the qnelsen command, or under
Analyze/ConsensusEstimation.
Tree-searching algorithms (back to index)
The program implements several types of search algorithms; in Windows versions, all of
them are accessed with the Analyze menu option. For really small data sets (i.e. below
20-30 taxa), exact solutions can be obtained via implicit enumeration (or
"branch-and-bound"). For larger data sets, heuristic ("trial-and-error") methods are
required. The basic algorithms implemented in the program are wagner trees (with
sequential addition of the taxa, in the best available position, to form a tree as short
as possible), and different types of swappers and algorithms that make rearrangements on
preexisting trees. Branch-swaping can be of type SPR or TBR.
For relatively messy, but not very big data sets, the best algorithm consists of multiple
random addition sequences plus TBR (RAS+TBR); this is invoked with
Analyze/TraditionalSearch, using wagner trees as "starting trees," or with the mult
command. Whether or not you are likely to have found the best trees, or all the tree
islands, depends on the ease with which the algorithm converges to the minimum length
found. For this reason, the program reports the number of replications that reached the
best score found. The program also reports whether some replications had partial
overflows of the tree file, in which case, some most parsimonious trees may not have been
found during the search; further branch-swapping starting from the trees in memory ("RAM")
can be used to find the additional trees.
For more difficult data sets, the special and combined algorithms will be
required. Note that instead of wasting search effort, it is often preferrable to make one
very aggressive search to make sure what the best score for your data set is, and then
produce additional (independent) hits to that score until the strict consensus has become
stable (as discussed in Goloboff, 1999). There are four basic types of special algorithms
implemented in TNT: ratchet, drifting, sectorial searches, and fusing. The ratchet in TNT
is slightly different from the ratchet as originally described by Nixon (1999), and as
implemented in Nona/Pee-Wee. The ratchet consists of two phases, perturbation and search,
which are sequentially repeated. In either phase, the ratchet in TNT does not look for
multiple trees (in Pee-Wee/Nona it was necessary to set the number of trees to save in
each phase, but the best number is always just one); multiple trees can be found only as
the search phase finishes. The ratchet always uses TBR branch swapping, and it alternates
the search phase (S) with three types of perturbation phase: original weights (O),
upweighting (U), and deleting (D). Thus, the cycles will be: O,S,U,S,D,S,O,S,U,S,D,S,O...
etc. Any of the O, U, or D phases can be skipped (from the Ratchet settings dialog, or
with the ratchet command). During the perturbation phase, rearrangements of score equal
or better than the tree being swapped are always accepted, untill a certain number of
rearrangements have been made (or untill a certain percentage of the total swapping on the
tree has been completed). This seems to provide better results than the original ratchet.
The "autoconstrained" option calculates the (strict) consensus of the previous tree and
the tree resulting from the rearrangement phase, and during the subsequent search phase,
only rearrangements that do not violate monophyly of the shared groups are done. The
rationale for this is that improvements are likely to be found in areas that have changed
during the perturbation phase. Every certain numer of constrained cycles, an
unconstrained search phase is done (to insure global optimality). If this is used in
combination with user-defined constraints, both are combined.
Tree-drifting is quite similar to the ratchet, but the perturbation phase, instead of
reweighting the characters, accepts the moves with a probability that depends on both the
relative fit difference (Goloboff and Farris, 2001; see Goloboff, 1999) and the absolute
fit difference between the tree being swapped and the new tree. It has the advantage that
memory managing is easier, and for that reason is the only perturbator that
sectorial-searches (see below) can use. The perturbation phase also alternates between
acceptance of optimal trees only (O), and suboptimal trees (U): O,S,U,S,O,S,U... etc. Any
of the O or U phases can be skipped (the O phase, with drift:noequal; or with the skip
optimal-only drifting option in the tree-drift settings dialog box; the U becomes
effectively an O by setting the score difference to a very low number).
Sectorial-searches are of two basic types: constraint-based and random; a combination of
both is the mixed sectorial searches. Sectorial searches take a portion of the tree,
create a reduced data set, and produce a mini-analysis of that data set. If a better
solution for that portion of the tree is found, it is replaced back into the original
tree. Every certain number of (user-determined) rearrangements, a round of global TBR is
done (to insure global optimality). If the sectors are below a certain size, the
mini-analysis consists of three RAS+TBR (random addition sequence wagner trees plus TBR);
if the three sequences end up in trees of the same score, the mini-analysis is interrupted
(i.e. the reduced data is "clean" and not worth of further analysis); otherwise, three
additional RAS+TBR are done. If the sectors are above that size, a certain number of
iterations of tree-drfiting is done to the reduced sector. The three types of
sectorial-search differ in how the sectors are selected. Constraint-based searches choose
the sectors by reference to constraints: the nodes that are resolved as polytomies in the
constraint tree, connecting to no less than N branches (where N is some number defined by
the user; default is 10). A reduced data sets corresponding to each of the polytomies in
the constraint tree is analyzed, in turn; this is repeated a certain number of times
(default=3), because changing some sector may imply that other sectors should change as
well. Random sectors are chosen at random, to be no smaller and no bigger than a certain
(user defined) size. The mixed sectorial searches use both constraint-based and random
sectors; they can only be used with multiple addition sequences. For the mixed sectorial
search, a temporary constraint tree is created by consensing the tree found at some stage
of the present addition sequence (user defined: this can be the wagner tree, which is the
default, or optionally the tree produced by SPR or TBR), and this constraint tree is used
to do a constraint-based sectorial search. Once this is finished, a random sectorial
search is done (unconstrained, or using only the user-defined constraints if they are in
effect).
Tree-fusing mixes trees, producing better trees. If the trees comes from
different searches, the score is often considerably improved in very short time.
Tree-fusing produces (so to speak) a synergistic effect for all those previous searches;
this also means that the rate of score improvement is not necessarily lineal (i.e. all the
searches before fusing were not producing a score improvement beyond some bound, but were
nonetheless accumulating "momentum" for the subsequent fusing).
Determining best search parameters (back to index)
For many data sets, the time to find best scores during searches may depend drammatically
on the aggresivenes and parameters used during the search. For example, using very
superficial searches for difficult data sets will keep grinding for a very long time
before the best score is actually found; using more aggressive parameters will find best
score sooner. The parameters to change are size and number of sectors to run under random
sectorial searches, number of initial replications to submit to fusing, and rounds of
tree-drifting or ratchet.
A careful consideration of how the data set behaves to changes in parameters is
the best way to proceed, but if you have no idea of how a data set behaves, or don't know
how to set the parameters for a search, you may trust to the program the choice of
parameters. This is done with the set initial level option of the New Technology dialog
box (or the level option of the xmult command). If entered as a command, the level must
be a number between 0 and 10 (0 is very superficial; 10 is extremely exhaustive); if
entered in the dialog box for search level, it must be a number between 0 and 100 (again,
100 is the heaviest search). If using a driven search (i.e. more than a single hit of the
xmult command), you may have the program check whether best score is being found easily or
not (and then decrease or increase the search level accordingly). This is set with the
chklevel option of the xmult command. Again, letting the program try to determine the
best parameters during the search is a poor substitute for finding the best parameters and
letting them fixed during the search, and is intended for cases where you have no
time/experience to determine best parameters by yourself.
Implementation notes for tree-searching (back to index)
General
This section describes some details on how the searches are implemented. It is intended
to give the user clues on how to best use the program, or at least understand some of the
ways in which the program behaves under different circumstances.
When all the characters have equal weights, the swapper proceeds the fastest. If some
characters are weighted differently, the search algorithms proceed somewhat slower (i.e.
it takes longer to complete a wagner tree or swap completely through one tree; note that
finding a shortest tree might still be easier if the weighting increases the structure of
the data set). If the characters are weighted using very different weights (e.g. every
character is assigned a weight from 1 to 50, at random), the search algorithms reach a
floor (depending on the data sets, about 50% slower than equal weigths).
Searching under implied weights, since it uses floating-point calculations, is the
slowest, being 2 or 3 times slower than equal weights (note than in PAUP*, using implied
weighting slows down the searches by orders of magnitude). The largest speed differences
with PAUP* are in running large data sets (between 20 and 50 times, sometimes more than
100 times), and implied weighting (over 500 times, for large data sets).
The numbers of states also affect running times. The more states the characters can have,
the slower the program runs. The program internally packs characters in groups of
characters with 2 states, 3 to 4 states, 5 to 8 states, 9 to 16 states, and 17 to 32
states. Sankoff characters are packed always separately. This implies that the time
differences will come when the numbers of characters in each of those categories changes
significantly. For sankoff-style characters, the time increases rapidly with the square
of the number of states (in some portions of the code, the time increases actually with
the cube). Thus, the algorithms used react badly to increasing numbers of states; for
Sankoff characters with 16 states or more, TNT is about as slow as POOP* (or even slower).
The implementation of searches under implicit enumeration is similar to that in other
programs, and there is nothing special about it. As usual, exact searches take the
longest for the least poorly structured data, and the times increase drammatically with
the numbers of taxa. Since TNT is conceived to work mostly with large data sets, there
seemed to be no much point in using energy to optimize the implicit enumeration
algorithms.
The tree swappers can complete swapping on a tree faster if no additional equally
parsimonious trees are being found. This is because in such a case length calculations
for a rearrangement can be abandoned as soon as the program discovers that the tree is as
long as the best one found so far; otherwise, additional characters have to be checked to
make sure the tree is no longer than the best one found so far (and, to make things worse,
the tree has to be optimized for collapsing and compared to all the pre-existing trees, to
make sure it is not identical as collapsed). This is used to speed up searches under
multiple random addition sequences, where a reduced number of trees is retained for each
replication (such that most of the time the search is being done with a full memory
buffer).
Wagner trees
Except in the case of constraints, the addition sequence for wagner trees is totally
randomized, except for the outgroup taxon (which is always the first taxon added). The
insertion sequence of new taxa to the growing tree is (by default) not totally random: the
possible locations for the new taxon are tried from the root of the tree to the tips, or
from the tips to the root (which option is chosen is determined at random, for each taxon
to add; both are equiprobable). In some extreme cases, this may be undesirable. If the
data are entirely uninformative (e.g. a matrix with no characters) and no tree collapsing
is in effect, then pectinate trees result from such insertion sequence. It is possible to
randomize the insertion sequence for new taxa, so that all locations are tried in a random
order. This is set with rseed[; the alternative is rseed]; (default, up/down sequence).
For large data sets, building a wagner tree with a randomized insertion sequence may
actually take shorter than with the up/down sequence (that is probably because the best
placement is often found somewhat in the middle of the tree, not in the tips or root;
checking the tips or root first provides worse bounds to give up length calculations, for
a greater portion of the time). For data with no informative characters (and with
collapsing turned off), wagner trees with a random insertion sequence are (more or less)
equivalent to random trees. If your data are more or less structured, setting one or the
other is irrelevant.
Note that randomization of the insertion sequence has no effect when wagner trees are
built under constraints (i.e. the insertion sequence is always up/down in this case).
Swappers
The two types of swapper are designed to work as best as possible under diferent
circumstances.
The time to complete swapping on a tree (that is, on a tree which does not lead to better
trees by swapping) changes with about the square of the number of taxa for both SPR and
TBR (this is true even when TBR actually "looks" at a number of rearrangements which
increases with the cube of the taxa).
SPR is designed to work best at the initial stages of a search, when better trees are
being found frequently; it is not as fast as it could be for trees which are already
optimal or near-optimal.
TBR is designed instead to work best for large numbers of taxa, when the trees being
swapped are optimal or near-optimal, that is, when better trees are not being found
frequently, and also to work best on well-structured data (i.e. data where there are
numerous relatively well supported groups). If the data are very poorly structured,
swapping through a tree takes longer. For Kaellersjo et al.'s 2500-taxon matrix (1999),
the time to complete TBR swappping on a near-optimal tree (running on a 800 MHz PIII
machine) is about 12 secs.; PAUP* on the same machine takes over 18 minutes to complete
TBR swapping on the same tree (i.e. a 90 times speed difference; PAUP* takes about 6
minutes to complete SPR on the same tree, so that completing TBR for 2500 taxa in TNT is
30 times faster than completing SPR with PAUP*).
Neither type of swapper is highly optimized for low numbers of taxa. For small data sets,
expect no big differences ( i.e. no more than 5-10 times) in running times for TNT and
other programs. These take no time anyway, so it seemed best to preserve an
implementation focused on large data sets.
Effects of constraints on running times
Heuristic searches.- There are three types of constraints: positive tree-constraints,
positive group constraints, and negative group constraints. Tree constraints are defined
with the = option of the force command (or, from the menus, as "positive constraints" with
no floating taxa). The latter are defined with all other options of the force command
(or, from the menus, with positive constraints with floating taxa, or negative
constraints). The three types can be used simultaneously (e.g. "search for the best trees
where A+B+C from a monophyletic group, but A+B do not").
The tree constraints are implemented in such a way that the moves violating the
groups established are never attempted. This type of constraint therefore speeds up the
searches (if many groups are defined, by a significant factor). Note that the number of
trees examined to SPR-swap or TBR-swap through a single tree is the same under constraints
(i.e. the moves are counted as if effected and rejected).
The group constraints (positive and negative) are implemented internally by
creating a matrix of group membership variables (where taxa belonging to a group have
state 1, taxa outside the group have state 0, and floating taxa have a missing entry).
For group constraints, all the same moves are evaluated for length. If the tree is within
the current length bound, then the group constraints are evaluated, and if the new tree
does not pass, it is rejected (even if shorter). Thus, using group constraints does not
speed up searches at all. If the group constraints make the tree much longer (i.e. if
they forbid many groups present in the consensus of MPT's), they will actually slow down
the search (i.e. many moves will produce better or equally good trees, which have to be
subsequently rejected because they violate constraints).
For wagner trees, the program adds each taxon in a position such that no
constraint is violated at any point. For negative constraints, this means that the first
two taxa which are part of a group constrained for non-monophyly must be placed as
non-sister groups. Of course, a better tree might be found if the non-monophyly of the
group is delayed (to be broken later, for example, by adding another taxon in between
them), but then if non-monophyly has been delayed, it may now be impossible to create
non-monophyly without at the same time violating some positive constraints. This means
that, for negative constraints that increase tree length considerably, the wagner trees
may be quite far from the best length achievable under the constraints (e.g. it is
advisable not to use wagner trees to estimate consensuses or jacknife frequencies under
those circumstances). Additionally, the random addition sequence may change somewhat
when positive (tree or group) constraints are used in combination with negative
constraints. The first two members of a group to be non-monophyletic must effectively be
non-monophyletic when they are both added to the tree, but this may be impossible if those
two are in turn the only two members of a group constrained for monophly added so far.
The addition sequence is then checked when positive and negative groups are enforced,
changing it to prevent this. The addition sequence is therefore not completely random
when there are positive and negative constraints in effect.
Exact searches.- Both positive tree constraints and positive group constraints speed up
searches under implicit enumeration (the more so if they constrain groups which are anyway
present in the optimal trees). Negative tree constraints slow them down, to the extent
that they make the best possible trees longer.
Scripting (for programmers or advanced users) (back to index)
TNT has a scripting language, which allows the user to automate complex tasks, or
"program" TNT to do things it normally doesn't do. The scripting language has flow
control, loops, and internal variables that can be accessed by the user. The scripting
language is normally disabled; it is enabled by entering "macro=;" at the command prompt.
The macro command also controls the basic settings for the macro language, like total RAM
for macros, number of variables, number of nested loops, and protection level for
accessing variables/arrays (for details, see online help for macro).
The variables and flow control use numbers (actual values or internal variables)
or expressions (compound expressions, always must be enclosed in parentheses to be
properly parsed).
Examples of scripting language can be found in the files contained in
zipdruns.exe (a self-extracting excutable). Exactsol calculates trees by implicit
enumeration (using a complete down-pass optimization for each tree: it is very slow; the
addtax routine is used as a function from within exactsol). Macswap does SPR swapping on
a tree, and prints out a message every time it finds a better tree (it also uses down-pass
optimization to evaluate trees). Sank does sankoff-optimization; it is more elaborate
than the other two (it can be used only on small trees; make sure you have set up the
number of variables to a large enough number).
User variables.- User variables can be accessed in any context, by enclosing the number of
variable to be accessed in single quotes. Thus, if variable number 0 has a value of 1200,
inputtting the string hold '0' is equivalent to inputting hold 1200. Variables can be
given names, in which case their values are accessed by enclosing their name in single
quotes. Using double quotes instead of the single quotes is equivalent to using single
quotes, except in the case of parallel versions of TNT, where double quotes passed as
instructions to slaves are interpreted on slave, and single quotes are always interpreted
locally. Using the quotes or double quotes properly, the user variables can act as
pointers. User variables can be defined (when declaring their name) as arrays (maximum is
5-dimensional arrays). The indices are then enclosed in brackets, separated by a comma.
Thus, 'n[j]' is the j-th value for variable n (note that the arrays are zero-based), and
'p[x,y,z]' is the z-th value for row x, column y, of variable p. Note that 'n'[j] will be
parsed as the number corresponding to variable n followed by the expression [j] (which,
according to context, may or may not be a legal expression). The indexing works by
substituting values; each time a new "dimension" is found, the cell is considered to be
that for the base cell plus the value of the dimension; the user could set the values so
that a "matrix" is properly organized, but declaring a variable as multi-dimensional when
it is given a name automatically sets up the arrays.
It is also possible to enclose in the single or double quotes an expression
(which must be enclosed in parentheses); what is inside the parentheses may, in turn,
contain references to user variables. Thus,
set 5 '(%1)' ;
will set variable 5 as equal to the variable numbered as the first argument. Similarly,
if variable 0 equals 10, variable 1 equals 20, and variable 30 equals 100, then the
expression
set 5 '('0'+'1')' ;
will assign the value of 100 to variable 5 (i.e. assigns to variable 5 the value of the
variable whose number equals the sum of the values of variables 0 and 1).
User variables can also be used to store strings, accessed by preceding the
number (or name) of the variable by a dollar sign, $. Strings can be stored as literal
(in which case, expressions like '0' within the string are replaced by the value that
variable 0 has at the time of expanding the string; variables can be expanded only as
numbers, as well as arguments; no string expansion is allowed within strings).
The values for the user variables can be changed only with the set command. Set
followed by a variable number (or name), followed by an expression, sets the value for the
variable to the value for the expression. The number of the variable to be set can itself
be an expression (but if so, remember, it must always be enclosed in parentheses). The
expression can be an expression as in the C language; logical expressions are evaluated to
either 0 (FALSE) or 1 (TRUE). The expression can be replaced by either ++ , -- , += , -=,
*=, /= (with meaning as in C; note that only post-decrements and post-increments are
allowed). Thus
if ( length [ 0 ] > ' maxlen' )
set 1 '1' + 1 ;
else set 0 '0' + 1 ;
end
will increase in one the variable number 1 if the length of tree 0 is greater than the
value stored in the variable named by the user as maxlen, and otherwise increase in one
the variable 0. The same is achieved with the more compact expression
set ( length [ 0 ] > 'maxlen' ) ++ ;
Expressions.- Expressions can be any combination of operations. Valid operators are + - *
and /, as well as bit-operators ( & | ^ ). Valid logical comparisons are ==, > , < , >= ,
<= , ! , and !=. Since precedence is evaluated always from left to right, be careful to
delimit subexpressions as desired:
if ( 'maxlen' <= 1000 + 500 )
is actually equivalent to
if ( ( 'maxlen' <= 1000 ) + 500 )
so that it will first evaluate if maxlen is less than 1000, and then sum the result to 500
(if maxlen is less than 500, result is 501, otherwise result is 500; the expression always
evaluates "true" ... probably not what you wanted). Either remember the precedence and
invert the order of expressions:
if ( 1000 + 500 >= 'maxlen' )
or use the original order and parentheses :
if ( 'maxlen' <= ( 1000 + 500 ) )
Note that parentheses must always be balanced.
Changing values of variables.- The values of variables are changed with the set command:
set N (expression) ;
The "expression" may in this context not be enclosed in parentheses (this is the only
exception to the rule given above). The expression in this context always must end with a
semicolon. The example above will set variable number N to the value of the expression.
You can set the i-th variable (where i itself is any expression) from N, as if N was an
array, by using:
set N[i] ( expression) ;
If some private variables have been defined in previous files (see "Private variables"
below), the number N is set automatically to the number within that file (use names to
access variables from previous files).
To set variables as strings, use:
set N $text ;
text can be any number of characters, with the proviso that the string starts at variable
N, and continues to subsequent variables; every 4 letters of the string occupy one user
variable (which is a long int), so that the actual number of user variables needed to
store a string is the length of the string divided by 4. The semicolon indicates the end
of the string (which can contain blanks or spaces; a period followed by a comma is
replaced by a semicolon, as in the quote command). If "text" contains expressions to be
replaced (i.e. single quotes or symbols for arguments), it is replaced on reading. It is
possible to define the string as "literal" by including an equal (=) sign before the
number of variable:
set =N $text ;
A literal string has expressions expanded when it is itself expanded (so that it can be
expanded to different strings). No string expansion is allowed within strings (although
it may work in some cases, its results are not predictable).
A string can be copied from one variable onto another easily, with the set
command:
set 10 $$0 ;
copies string 0 onto string 10 (make sure string 0 doesn't extend to string 10 before
doing this).
Private variables.- When input files are nested, all the variables named in preceding
files are "private" -that is, an internal numbering starting from the last unnamed
variable is used. Accessing variables by their number is then relative to a file. If a
name is given for a variable, the variable names are checked from latest to earliest, so
that if variable names are duplicated in nested files, the most internal files will refer
to their own variable (for example, simple counters like "i" or "a"); names defined in a
preceding file will be recognized only if not defined later. The number of private
variables is determined by the largest unnamed variable, so if you want to refer to some
variables by their number (for simple counters), it is wise to reserve the lowest numbers
for those, and start naming from variable 10 or more (this allows to keep "memory
management" easy and clean). You can also keep more variables than the named ones as
"private," simply by issuing a private n; command (where n is the number of private
variables) at some point before calling the next input file.. At the start of each file,
you can also include a safe vars n; statement (which checks whether n user variables are
actually available to the user from that file, and calls an error message otherwise).
In this way, each input file can have its own internal numbering of the variables; this
allows to define "functions" in files, that are safely called from other files. The only
way in which a file can access a variable named in a previously opened file is by
referring to it by its name (and only if the same name did not exist in the file itself:
in that case, the variable corresponding to the file itself is accessed). When an input
file is closed, all the variables that had been named in that file are denamed; if you
want to internally rename variables within a file, you have to issue first a var-;
command, which denames all non-private variables (var-N; denames all variables above N, N
included).
Changing values for arrays.- The setarray command sets arrays. This can be used to read
in matrices (of any dimension):
setarray 5,4 matrix
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
;
this will set the values for a variable called matrix. The numbers can be any valid
expression; note that 0/1 matrices must have a blank space between the numbers.
Naming user variables.- User variables can be named, and accessed through their name:
var =
0 tree_number
1 result
2 other ;
If private variables have been declared in previous files, the numbers (which can
themselves be expressions) correspond to the internal numbering for current file. If any
number is replaced by a + this is interpreted as "the lowest unnamed variable" (if there
are private variables, the first time + is used within a file, it refers to the lowest
possible number that can be accessed within that file). If you are going to store values
starting from variable number N, you can define the number of variables to store by using
a comma and the number of cells, after the number:
var =
N, (ntax+1) Taxon_list
+, (ntrees+1) Tree_list
+, (nchar+1) Char_list
+,(ntax+1),(nchar+1) Matrix
;
After defining the name for variable 10 as "Taxon_list", it will consider that the next
variable to be named is variable number 10+ntax+1, so that the variables do not overlap.
If there are 12 taxa and 15 trees, then the name "Tree_list" is assigned to variable
number 22, and the name "Char_list" is assigned to variable 37. Since it is always
possible to refer to variables only by their names, the user does not need to worry to
which particular variable a name has been assigned.
An alternative (equivalent) format for declaration of multidimensional arrays is:
var =
N Taxon_list [ ( ntax + 1 ) ]
+ Tree_list [ ( ntrees + 1 ) ]
+ Char_list [ ( nchar + 1 ) ]
+ Matrix [ (ntax +1 ) , (nchar +1 ) ]
;
When using the square-bracket notation for declaration, the commas are optional.
Declaring the variables as multidimensional automatically sets up the arrays in a proper
way; changing the value for the arrays can "disassemble" the matrix and make further
accesses difficult or impossible. To prevent this, it is possible to have the macros
check whether all the accesses to arrays are legal (and then permitted only if the
variable has been declared as such). Without protection, the user is free to access and
change any values (e.g. to set up his own arrays). The protection level is set to N with
macro prot N.
Floating point printing.- By default, all the numbers in user variables are stored as
floating-point. When printing the values, you can use different numbers of significant
digits. Note that many operations under scripting involve (internally) expanding the
numbers and then parsing the strings, so that changing the number of significant digits
will change the precision with which many routines are executed.
The command macfloat N; sets the number of digits to be N (if 0, the printed numbers look
like integers). The option macfloat e; uses exponential notation (not readable by
commands, useful only to print out tables).
Storing the values as floating point uses 64 bits per value; if you want to use
only 32 bits (integer calculations, to save memory), issue a macfloat-; command before
turning the macros on.
Internal variables.- The user can access a number of internal variables (i.e. number of
taxa, characters, etc.). You can get a list of those entering help+;at the command
prompt. Internal variables can be accessed only in certain contexts in the macro
language. The general rule is that user variables can be accessed from any command (by
enclosing their number or name in single quotes), while internal variables can be accessed
only by the macro commands (loop, if, set, var, etc.).
Note.- The internal variables are a key component of the TNT macro language; if accessing
other internal variables, besides the ones already included, could be useful to some
users, get in touch with PAG and ask him about the possibility of including those
variables in TNT (given the design of the macro language, it is really easy to add new
internal variables).
Flow control.- The flow control is established with the if and loop commands. There are
two possible syntaxes for the if command; one is:
if ( condition )
action(s)
end
where "condition" is any valid expression, and "action(s)" is any set of valid TNT
commands. If the condition is "satisfied" (i.e. evaluates to non-zero), then action(s)
are executed; otherwise, all actions before the end are skipped (note that TNT interprets
its scripts as it reads them, so that the action(s) may themselves contain syntax errors,
which won't be discovered if the action(s) are skipped; the errors will only be discovered
when the actions are executed). The alternative is
if ( condition )
action(s) A ;
else
action(s) B ;
end
which will execute action(s) A if the condition is satisfied, action(s) B otherwise. Each
if must always be matched by else or end; each else must always be matched by end. Nested
if commands are legal, without restrictions as to the number of nestings.
The other command for flow control is loop. The syntax is,
loop i+j k action(s) stop
the numbers i, j, and k can be numbers, variables, or expressions. This will repeatedly
execute action(s), k - i + j / j times. If the symbol '#' followed by the number of loop
(see below) is included in action(s), it is replaced by the value for the iteration being
done. Thus, the symbol will the first time be replaced by i, incrementing in subsequent
iterations by j, until value exceeds value k (if i is greater than k, the loop is
decreasing, and j is taken to be negative, even if preceded by +).
Nested loops are possible (the maximum nesting is determined with macro*). To
refer to the value for each loop, use the '#' followed by the corresponding number. Each
loop must always be matched by a stop. Within the loop, the commands endloop, setloop n,
and continue, are recognized (endloop ends the loop, setloop resets the value of the loop
to n, and continue skips all the action(s) between the current point and the stop that
ends the loop).
Essentially any action can be done from within the loops, except denaming user
variables. The loops use RAM from the macros, so if some long loop cannot be executed
under certain conditions, switch the macro option to off and increase the RAM for macros.
Return values.- When a file with instructions is closed, it may be desirable for it to
have some kind of return value (having it to store the exit value in a user variable may
be cumbersome and make the file less flexible). The command return N; closes the input
file and sets the internal variable exstatus to the value N.
Recursive calls.- Input files with instructions can be called recursively. To call a
file from itself, use the recurse command (followed, if desired, by arguments, just like
used for the run command). This (together with the declaration of private variables)
allows to define quite complex routines. The number of times a file can call itself with
recurse depends on the maximum number of input files (default = 10); the number of input
files allowed is changed with mxproc (the file "exactsol.run" does this change
automatically). An example is below, in the files "exactsol.run" and "addtax.run." That
finds an exact solution for any number of taxa (it uses a complete down-pass optimization
for each partial tree, so it is rather inefficient). File "exactsol.run" must contain:
silent =all ;
set 0 ntax + 2 ;
mxproc '0' ; /* make sure you have enough input calls */
var = 0 attax
1 bound
2 wkdone ; /* attax , bound , and wkdone are private */
set bound 10000000; /* use a big initial bound */
keep 0 ;
tread ( 0 ( 1 2 ) ) ; /* set an initial network */
set attax 3 ; /* attax = next taxon to add */
report- ;
addtax ; /* this is the "recursive" function */
progress/ ;
report= ;
tc 1. ;
silent - all ;
proc/ ;
File "addtax.run" must contain:
set 0 'attax' ;
set 1 ntax + 1 ;
private 2 ;
loop 1 nnodes [ 0 ]
if ( #1 == '1' || !isintree [ 0 #1 ] )
continue ;
end
if ( 'attax' == 4 )
set wkdone ++ ;
progress (('wkdone'*100)/15) 100 Working ; /* progress */
end
edit ] 0 '1' 'attax' #1 ; /* attach new taxon */
set 2 length [ 0 ] ;
if ( '2' <= 'bound' )
if ( 'attax' < ntax )
set attax ++ ;
recurse ; /* call yourself !! */
set attax -- ;
else
if ( '2' < 'bound' ) /* update bound */
keep 1 ;
set bound '2' ;
end
if ( '2' == 'bound' && ( ( ntrees + 1 ) < maxtrees ) )
copytree 0 ; /* save multiple trees */
end
end
end
pruntax 0 / 'attax' ; /* deattach taxon */
stop
proc/;
Note that the C style comments in the two example files are perfectly valid.
Goto.- The instruction goto filename tag [arguments] parses file filename starting from
the point labeled tag (with label tag). The file can be the same file, or a different
one. This can be used to have related routines in the same file (a sort of "library").
A file can be defined as the default file to parse, with goto = filename. Subsequent
calls to goto must specify only the tag (and arguments, if any).
Branch-swapping.- In a way similar to the loop command, the commands sprit and tbrit
initiate a loop, with the difference that in each iteration, the tree given as argument N
is changed (under spr or tbr branch-swapping):
sprit N action(s) stop
Action(s) can be any valid TNT command, plus additional commands recognized only from
within sprit or tbrit:
continue skip remaining actions and do next rearrangement
endswap end the swapping (if never saved, tree N is unmodified)
resetswap save current rearrangement and restart swapping
While within a sprit or tbrit loop, the internal variable percswap equals the percentage
of rearrangements done to complete swapping on current tree. Together with the progress
command, this can be used to report partial progress of the macro instructions.
Sprit or tbrit can be nested with loops, but no sprit or tbrit can be nested within a
sprit or tbrit.
Defining dialogs (Windows only).- It is possible to define templates for dialog boxes,
with the command opendlg. Names for files can be written onto variables by using the
getfname command. See online help for syntax; the file "dialog.run" (included in the
self-extracting file zipdruns.exe) contains an example, which emulates the dialog box for
Analyze/TraditionalSearch. A more elaborate script, "bremer.run", allows calculating
bremer supports using various options.
Citations (back to index)
This section provides citations for the sources of some methods used in TNT. If you use
the method in a published paper, you should (besides citing TNT itself) cite the proper
source for the method.
Bremer, K. 1990. Combinable component consensus. Cladistics 6:369-372 (as one might
guess from the title, this paper describes the combinable component consensus).
Bremer, K. 1994. Branch support and tree stability. Cladistics 10:295-304 (a formal
description of the "bremer support").
Coddington, J., and N. Scharff. 1994. Problems with zero-length branches. Cladistics
10:415-423 (a discussion of the problems associated with tree-collapsing, and proposal of
"rule 4").
De Laet, J. 1997. A reconsideration of three-item analysis, the use of implied weights in
cladistics, and a practical application in Gentianaceae. Ph.D. Thesis, Katholieke
Universiteit Leuven, Faculteit der Wetenschappen, Leuven, Belgium, 214 pp. (this contains
discussion of implied weighting in general, and it discusses the idea of recoding additive
characters in binary form for implied weighting).
Farris, J., V. Albert, M. Kaellersjoe, D. Lipscomb, and A. Kluge. 1996. Parsimony
jackknifing outperforms neighbor joining. Cladistics 12:99-124 (this paper describes the
idea of resampling characters with independent probability of removal, to eliminate
influence of uninformative or irrelevant characters, as well as the idea of superficial
searches for consensus estimations).
Farris, 1970. Methods for computing Wagner trees. Syst. Zool. 19:83-92 (the paper
describing the first formal approach to Wagner trees and optimization for additive and
continuous characters).
Farris, J. 1990. Phenetics in camouflage. Cladistics 6:91-100 (a discussion of problems
associated with some methods to discretize continuous characters in phylogenetic
analysis).
Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap.
Evolution 39:783-791 (the original paper proposing to use the bootstrap on phylogenies).
Fitch, 1971. Toward defining the course of evolution: minimal change for a specific tree
topology. Syst. Zool. 20:406-416 (this paper describes optimization for non-additive
characters).
Goloboff, P. 1993. Estimating character weights during tree search. Cladistics 9:83-91
(description of the method of "implied weights").
Goloboff, P. 1994. Character optimization and calculation of tree lengths. Cladistics
9:433-436 (description of the basic algorithm for faster calculation of tree lengths
during searches; instead of making T-2 node comparisons for each tree of T taxa, the
algorithm requires making only one node comparison per tree).
Goloboff, P. 1996. Methods for faster parsimony analysis. Cladistics 12:199-220 (this
paper describes the "two pass incremental optimization" which is the basis for
calculations in TNT. It also describes the "qcollapse" algorithm for pre-collapsing of
trees during searches).
Goloboff, P. 1998. Tree searches under Sankoff parsimony. Cladistics 14:229-237 (this
paper describes calculation of tree lengths for Sankoff characters; the method used in TNT
is a combination of the exact method described in this paper, with the incremental
optimization described in Goloboff, 1996).
Goloboff, P. 1999. Analyzing large data sets in reasonable times: solutions for composite
optima. Cladistics 15:415-428 (description of tree-drifting, sectorial searches,
tree-fusing, and the idea of searching until the consensus stabilizes).
Goloboff, P. 2002. Optimization of polytomies: state set and parallel operations.
Molecular Phylogenetics and Evolution 22:269-275 (a description of the method used in TNT
to optimize polytomies for non-additive characters).
Goloboff, P. 2007. Calculating SPR-distances between trees. Cladistics (in press).
Goloboff, P., C. Mattoni, and S. Quinteros. 2005. Continuous characters analyzed as such.
Cladistics 22: 589-601. (a description of the method used to optimize continuous
characters on polytomous trees, and a description of the implementation of continous
characters in TNT; see also Farris, 1970).
Goloboff, P., and D. Pol. 2002. Semi-strict supertrees. Cladistics 18:514-525 (description
of the supertree method used in TNT, and some cautionary remarks about the use and
interpretation of supertrees).
Goloboff, P., and J. Farris. 2001. Methods for quick consensus estimation. Cladistics
17:S26-S34 (description of the methods to estimate consensuses, the relative bremer
support, and tree collapsing using branch swapping).
Goloboff, P., J. Farris, M. Kaellersjoe, B. Oxelmann, M. Ramirez, and C. Szumik. 2003.
Improvements to resampling measures of group support. Cladistics 19:324-332 (description
of symmetric resampling, frequency differences, and frequency slopes).
Maddison, W. 1989. Reconstructing character evolution on polytomous cladograms.
Cladistics 5: 365-377. (the paper describing the method used to calculate length of "soft"
polytomies).
Nixon, K. 1999. The parsimony ratchet, a new method for rapid parsimony analysis.
Cladistics 15:407-414 (the original description of the ratchet, already a citation
classic).
Hendy, M, and D. Penny. Branch and bound algorithms to determine minimal evolutionary
trees. Mathematical Biosciences 59:277-290 (description of the implicit enumeration
algorithm, used for solutions guaranteed to find all trees of best score).