SAM/BAM文件格式标签详解

需积分: 33 182 浏览量更新于2024-08-05 收藏 338KB PDF 举报

"bam文件标签含义.pdf" 在生物信息学领域，BAM（Binary Alignment/Map）文件是一种用于存储高通量测序数据比对结果的高效二进制格式，它是SAM（Sequence Alignment/Map）文件的压缩版本。SAM文件格式规范定义了如何组织和表示这些比对记录，包括可选字段，而BAM文件则提供了更紧凑的存储方式。这个文档——“BAM文件标签含义.pdf”——详细解释了BAM文件中各种预定义的标准标签及其含义。文档中提到了“Optional fields”，这些字段通常以TAG:TYPE:VALUE的形式显示，其中TYPE可以是以下六种类型之一： 1. A（字符）：单个字符数据。 2. B（通用数组）：一般形式的数组数据。 3. f（浮点数）：实数或浮点数值。 4. H（十六进制数组）：十六进制表示的数组。 5. i（整数）：整数数据。 6. Z（字符串）：字符串数据。在预定义的标准标签中，有以下几个关键的标签举例： - AM：i 类型，表示模板独立的最小映射质量。这个质量分数是不考虑模板结构时，所有读段中的最低映射质量。 - AS：i 类型，表示对齐得分。这是一个基于比对算法的分数，反映了序列与参考序列匹配的程度。除了这些，文档还可能涵盖其他标签，如： - NM：i 类型，表示序列修饰数，即与参考序列的差异数目。 - MD：Z 类型，提供了无质量值的序列差异描述，用于快速计算序列差异。 - cigar：Z 类型，CIGAR（Compact Idiosyncratic Gapped Alignment Report）字符串，描述了读段在参考序列上的比对情况，包括匹配、插入、删除等操作。文档还讨论了创建新标签的约定，以确保兼容性和一致性。当研究人员或开发者需要扩展格式来包含自定义信息时，这些约定是至关重要的。总结来说，“bam文件标签含义.pdf”是关于如何理解和解析BAM文件中附加信息的重要指南，对于处理和分析高通量测序数据的生物信息学家和科研人员来说，这份文档是不可或缺的参考资料。通过深入理解这些标签，用户能够更有效地分析比对数据，进行基因组分析、变异检测以及后续的生物学研究。

way. SAM and CRAM ﬁles created with updated tools aware of the workaround are not expected to

contain this tag. See also the footnote in Section 4.2 of the SAM spec for details.

CP:i:pos Leftmost coordinate of the next hit.

E2:Z:bases The 2nd most likely base calls. Same encoding and same length as SEQ. See also U2 for

associated quality values.

FI:i:int The index of segment in the template.

FS:Z:str Segment suﬃx.

H0:i:count Number of perfect hits.

H1:i:count Number of 1-diﬀerence hits (see also NM).

H2:i:count Number of 2-diﬀerence hits.

HI:i:i Query hit index, indicating the alignment record is the i-th one stored in SAM.

IH:i:count Number of alignments stored in the ﬁle that contain the query in the current record.

MC:Z:cigar CIGAR string for mate/next segment.

MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*

String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and

SEQ ﬁelds to reconstruct the bases of the reference sequence interval to which the alignment has been

mapped. This can enable variant calling without requiring access to the entire original reference.

The MD string consists of the following items, concatenated without additional delimiter characters:

• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;

• [A-Z], identifying a single reference base that diﬀers from the SEQ base aligned at that position;

• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.

As shown in the complete regular expression above, numbers alternate with the other items. Thus if two

mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating

a 0-length run) must be used to separate them in the MD string.

Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not

represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped

SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the

CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence

cannot be reconstructed.)

For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are

10 matches followed by an A on the reference which is diﬀerent from the aligned read base; the next 5

reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC;

the last 6 bases are matches.

MQ:i:score Mapping quality of the mate/next segment.

NH:i:count Number of reported alignments that contain the query in the current record.

NM:i:count Number of diﬀerences (mismatches plus inserted and deleted bases) between the sequence and

reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential

matches, with everything else being a mismatch. Note this means that ambiguity codes in both

sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and

‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a

match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and

剩余10页未读，继续阅读

风风是超人

粉丝: 36
资源: 7

SAM/BAM文件格式标签详解

01-第1章BAM服务器[汇编].pdf

MPC5643LRM.pdf

无锡赛晶电容器尺寸[归纳].pdf

hts-specs:SAMBAM 规格及相关高通量测序文件格式

基于springboot+Web的毕业设计选题系统源码数据库文档.zip

垃圾分类数据集：四大类垃圾，有害垃圾、可回收垃圾、厨余垃圾、其他垃圾，共四千张左右，包含小米电池等不寻常的垃圾

C#ASP.NET在线培训考试系统源码数据库 SQL2000源码类型 WebForm

onnxruntime-1.15.1-cp310-cp310-win_amd64.whl

numpy-1.16.0-cp35-cp35m-linux_armv7l.whl

AI视频数据集（交通路口摄像头视频）

最新资源