Technical information related to phiGT1 protein families

Alignment by UCSC Sequence Alignment and Modeling System (SAM)

Program source: https://users.soe.ucsc.edu/~karplus/projects-compbio-html/sam2src/
Citations:

Hughey, R., Krogh, A., 1996. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci. 12, 95–107. DOI: 10.1093/bioinformatics/12.2.95.
Karplus, K., Barrett, C., Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856. DOI: 10.1093/bioinformatics/14.10.846.

The program system is still available, but hasn't been maintained and may be difficult to install.

General use:
target2k -seed {fasta file, or prior .a2m file} -homologs {fasta file of proposed homologs} -full_seq_align 1 -out {initial alignment name} >log 2>&1
target2k -seed {initial alignment.a2m} -tuneup -full_seq_align 1 -out {tuned alignment name} >log 2>&1

Generally I retain alignments and may add additional sequences from homolog sets from a variety of sources:

An updated psiblast set
Sequences from a proposed matching family from a protein family database
Sequences from a psiblast set of a proposed homolog selected by any other logic (synteny, analogous function, etc.)
A subset of the sequences already in the alignment labeled differently to facilitate postprocessing for presentation to the treemaking algorithm. The newly added sequences will be at the end of the .a2m file where they can be easily extracted.

Characteristics of the output:

SAM iteratively constructs an HMM of the input alignment (or initially of an individual sequence) and screens the homologs set for statistical significance. Sequences accepted are added to the alignment and the whole of the alignment is reoptimized for use in the next iteration. The alignment algorithm is not progressive like it is with clustal, but rather uses the Baum-Welch algorithm. That confers a capability to improve its ascertainment of where gaps should be excluded as more sequences are added with the result that it more effectively sweeps gaps out of regions with conserved secondary structure. Within the aligned sequences, segments not meeting a threshold for significant alignment across at least half the entries are demoted to "insert" status. These are marked in the .a2m file in lower case, and will not be passed on to the tree making algorithm. However, they may be rescued by the addition of nw sequences in a later step. There is typcally a marked increase in aligned residues during the tuneup operation.

Post processing:
After tuneup the product .a2m file has two copies of each sequence.
Either:

Use an editor to remove the second half of the file.
Run uniqueseq {processed alignment name} -alignfile {tuned alignment name} to remove all but one of any sequences identical in all aligned residues
Run uniqueseq {thinned alignment name} -alignfile {tuned alignment name} -percent_id {fraction} to thin to only one sequence per group defined by having "fraction" identity.

Output conversion to nexus format

The SAM utility prettyalign converts the .a2m file to a clustal-like format. All it needs to be a legal .aln file is to change the header.

prettyalign {.a2m file} -m0 -l100 >{prettyalign file} directs the deletion of the "insert" columns.

From there there are a variety of alignment converters to get to nexus format.

Creation and use of HMM models

In SAM w0.5 {file.a2m} {file.mod} creates a SAM formatted HMM

hmmscore {result name} -i {file.mod} -db {multifasta file or another .a2m file} -sw 0 scores a sequence collection against the SAM HMM

To convert to a Hmmer3 HMM

hmmconvert(from SAM package) {modelname.asc} -model_file {file.mod} converts the SAM binary file to an ascii format

S3H2convert.pl is:
#Program obtained from http://www.mrc-lmb.cam.ac.uk/genomes/julian/convert/convert.html
#Authors: Martin Madera, Julian Gough

S3H2convert.pl {SAM ascii model} produces a hmmer2 formatted file with extension .con.hmm

hmmconvert(from hmmer package) {hmmer2 formated HMM} >{hmmer3 formatted HMM}

The hmmer3 HMM can be used with hmmeralign in the hmmer package or clustal omega to align an arbitrary set of homologs to be consistent with the alignment from which the original HMM was made.

To convert to a HHpred-style HMM

addss (from the hhsuite package) {.a2m file} {.a3m file} -a3m Adds secondary structure to the alignment with Psipred, but using the .a2m alignment itself rather than doing a psiblast search.

hhmake -i {.a3m file} -o {hhpred model.hhm}

hhsearch -cal -i {hhpred model.hhm} -d {scope database from hhpred libraries} -o {hpred.cal.hhr} calibrates the model
hhsearch -i {one hhpred model} -d {another hhpred model} -o {model1x2.hhr} Gives a detailed report of how well two models correspond.

Tree checking

When alignments are expanded to the limits of similarity detection, there is a risk that a cluster of nonhomologous sequences will be included, or that portions of the alignment are of insufficient accuracy to support quality tree production.

In these cases, I typically make an NJ tree for the full alignment, and select representative of each major clade for a MrBayes tree. Then for the most distantly linked clades, I extract the sequences for each clade, make a new alignment, and then do the HMM to HMM comparison to discover what part of the sequence if any have good posterior residue alignment scores. Based on control experiments where there is structural homology determined as an arbiter of "correct" alignment, segments with mostly 7, 8, or 9 scores tend to correspond to regions of structural homology.