"理解物理文件必读：SAM文件规格定义解析指南"

需积分: 5 9 浏览量更新于2024-01-10 收藏 534KB PDF 举报

SAM文件的规格定义对于帮助理解物理文件具有很大的好处。SAM文件是一种TAB分隔的文本格式，由头部和对齐部分组成。头部是可选的，对齐部分是必需的。头部提供了关于对齐文件的元数据信息，而对齐部分包含了对齐的序列片段和对应的参考基因组位置信息。 SAM文件的规格定义由Sequence Alignment/Map Format Specification工作组制定。这个工作组于2023年5月24日发布了最新的版本（版本号为0dd3e0d），并且可以在https://github.com/samtools/hts-specs找到。 SAM文件的头部部分是由以"@"为前缀的多行记录组成。每一行的记录以标签和标签值的形式来表示。这些记录包含了对齐信息的描述，比如参考序列的描述、测序平台、软件版本等。头部的主要作用是提供后续对齐信息的上下文，方便解析和处理。对齐部分是SAM文件中最重要的部分，它描述了测序片段和参考序列片段之间的对齐关系。每一行对齐记录由各个字段构成，字段之间使用制表符进行分隔。主要字段包括：序列名、标志位、参考序列名称、参考序列位置、映射质量、CIGAR字符串、插入片段长度、序列、质量分数等。这些字段提供了对测序片段在参考序列上的定位、对齐质量、匹配信息等细节描述。 SAM文件的规格定义还包括了各个字段的具体定义和取值范围。它提供了详细的说明和示例，帮助用户理解和解析对齐文件。例如，映射质量字段表示对齐的可信度，取值范围为0-255，0表示没有对齐，255表示最高可信度。同时，规格定义还提供了一些特殊的符号和约定，比如CIGAR字符串中的各个操作码，用于描述插入、删除、错配和匹配等操作。 SAM文件的规格定义不仅有助于用户理解对齐文件的结构和含义，还对于开发基于SAM格式文件的软件工具和算法有着重要意义。通过遵循规格定义，开发者可以更准确地解析和处理SAM文件，保证了软件工具之间的互操作性。总之，SAM文件的规格定义是对于理解和处理SAM格式文件至关重要的一份文档。它提供了对齐文件的结构、元数据和对齐信息的详细描述，有助于用户理解物理文件的含义和上下文，同时也为开发者开发基于SAM格式文件的软件工具和算法提供了技术指导。

DT Date the run was produced (ISO8601 date or date/time).

FO Flow order. The array of nucleotide bases that correspond to the nucleotides used for each

ﬂow of each read. Multi-base ﬂows are encoded in IUPAC format, and non-nucleotide ﬂows by

various other characters. Format: /\*|[ACMGRSVTWYHKDBN]+/

KS The array of nucleotide bases that correspond to the key sequence of each read.

LB Library.

PG Programs used for processing the read group.

PI Predicted median insert size, rounded to the nearest integer.

PL Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI),

ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Paciﬁc Bio-

sciences), SINGULAR, SOLID, and ULTIMA. This ﬁeld should be omitted when the technology is

not in this list (though the PM ﬁeld may still be present in this case) or is unknown.

PM Platform model. Free-form text providing further details of the platform/technology used.

PU Platform unit (e.g., ﬂowcell-barcode.lane for Illumina or slide for SOLiD). Unique identiﬁer.

SM Sample. Use pool name where a pool is being sequenced.

@PG Program.

ID* Program record identiﬁer. Each @PG line must have a unique ID. The value of ID is used in the

alignment PG tag and PP tags of other @PG lines. PG IDs may be modiﬁed when merging SAM

ﬁles in order to handle collisions.

PN Program name

CL Command line. UTF-8 encoding may be used.

PP Previous @PG-ID. Must match another @PG header’s ID tag. @PG records may be chained using PP

tag, with the last record in the chain having no PP tag. This chain deﬁnes the order of programs

that have been applied to the alignment. PP values may be modiﬁed when merging SAM ﬁles

in order to handle collisions of PG IDs. The ﬁrst PG record in a chain (i.e., the one referred to

by the PG tag in a SAM record) describes the most recent program that operated on the SAM

record. The next PG record in the chain describes the next most recent program that operated

on the SAM record. The PG ID on a SAM record is not required to refer to the newest PG record

in a chain. It may refer to any PG record in a chain, implying that the SAM record has been

operated on by the program in that PG record, and the program(s) referred to via the PP tag.

DS Description. UTF-8 encoding may be used.

VN Program version

@CO One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be

used.

1.3.1 Deﬁned sub-sort terms

While the SS sub-sort ﬁeld allows implementation-deﬁned keywords, some terms are predeﬁned with speciﬁc

meanings.

lexicographical sort order is deﬁned as a character-based dictionary sort with the character order as

deﬁned by the POSIX C locale. For example “abc”, “abc17”, “abc5”, “abc59” and “abcd” are in

lexicographical order.

natural sort order is similar to lexicographical order except that runs of adjacent digits are considered to

be numbers embedded within the text string, ordered numerically when compared to each other and

ordered as single digits when compared to the surrounding non-digit characters. Runs that diﬀer only

in the number of leading zeros (thus are numerically tied) are ordered by more-zeros coming before

fewer-zeros. The characters ‘-’ and ‘.’ are considered as ordinary characters, so apparently negative or

fractional values are not treated as part of an embedded number. For example, “abc”, “abc+5”, “abc-

5”, “abc.d”, “abc03”, “abc5”, “abc008”, “abc08”, “abc8”, “abc17”, “abc17.+”, “abc17.2”, “abc17.d”,

“abc59” and “abcd” are in natural order.

umi is a lexicographical sort by the UMI tag. The MI tag should be used for comparing UMIs. The RX tag

may be used in its absence but is not guaranteed to be unique across multiple libraries.

剩余22页未读，继续阅读

悟世者

粉丝: 5359
资源: 160

"理解物理文件必读：SAM文件规格定义解析指南"

SAM文件压缩包释放指南及使用说明

Python脚本实现对sam文件的DNA统计分析

如何高效查找磁盘中的sam文件_压缩包内容解析

SAM-BA 1.8帮助文件

sam无密码 XP SAM 文件

读取SAM文件密码

唱歌专用SAM文件

xp系统sam文件

windows xp sam文件

磁盘文件的查找sam_sp_53.zip_sam_sam文件_磁盘

最新资源