way. SAM and CRAM files created with updated tools aware of the workaround are not expected to
contain this tag. See also the footnote in Section 4.2 of the SAM spec for details.
CP:i:pos Leftmost coordinate of the next hit.
E2:Z:bases The 2nd most likely base calls. Same encoding and same length as SEQ. See also U2 for
associated quality values.
FI:i:int The index of segment in the template.
FS:Z:str Segment suffix.
H0:i:count Number of perfect hits.
H1:i:count Number of 1-difference hits (see also NM).
H2:i:count Number of 2-difference hits.
HI:i:i Query hit index, indicating the alignment record is the i-th one stored in SAM.
IH:i:count Number of alignments stored in the file that contain the query in the current record.
MC:Z:cigar CIGAR string for mate/next segment.
MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and
SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been
mapped. This can enable variant calling without requiring access to the entire original reference.
The MD string consists of the following items, concatenated without additional delimiter characters:
• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;
• [A-Z], identifying a single reference base that differs from the SEQ base aligned at that position;
• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.
As shown in the complete regular expression above, numbers alternate with the other items. Thus if two
mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating
a 0-length run) must be used to separate them in the MD string.
Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not
represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped
SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the
CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence
cannot be reconstructed.)
For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are
10 matches followed by an A on the reference which is different from the aligned read base; the next 5
reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC;
the last 6 bases are matches.
MQ:i:score Mapping quality of the mate/next segment.
NH:i:count Number of reported alignments that contain the query in the current record.
NM:i:count Number of differences (mismatches plus inserted and deleted bases) between the sequence and
reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential
matches, with everything else being a mismatch. Note this means that ambiguity codes in both
sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and
‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a
match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and
3