我有GRCh38_latest_genomic.gtf,想要一个python脚本,输出多个exon的种类RNA的bed文件。具体步骤如下: 1.提取第三列为“transcript”的行,如果gene_id相同,transcript_id有多个,将长度最大(即第五列减去第四列最大)的一个transcript_id留下到下一步,如果只有一个transcript_id,则保留。 2.提取第三列为“exon”的行,按照上一步得到的transcript_id提取信息。 3.将exon行信息,转为bed文件。第一列转换按照 "NC_000001.11": "chr1", "NC_000002.12": "chr2", "NC_000003.12": "chr3", "NC_000004.12": "chr4", "NC_000005.10": "chr5", "NC_000006.12": "chr6", "NC_000007.14": "chr7", "NC_000008.11": "chr8", "NC_000009.12": "chr9", "NC_000010.11": "chr10", "NC_000011.10": "chr11", "NC_000012.12": "chr12", "NC_000013.11": "chr13", "NC_000014.9": "chr14", "NC_000015.10": "chr15", "NC_000016.10": "chr16", "NC_000017.11": "chr17", "NC_000018.10": "chr18", "NC_000019.10": "chr19", "NC_000020.11": "chr20", "NC_000021.9": "chr21", "NC_000022.11": "chr22", "NC_000023.11": "chrX", "NC_000024.10": "chrY" 转换为bed文件的第一列,如果gtf的第一列不为NC开头,则过滤掉。gtf文件第四列为bed文件第二列,gtf文件第五列为bed文件第三列,gtf文件第九列中“gene_id”后第一列为bed文件第四列,bed第五列赋值为1000,gtf文件第七列为bed文件第六列。
时间: 2023-07-17 19:03:35 浏览: 490
以下是根据您的要求编写的Python脚本,用于将GRCh38_latest_genomic.gtf文件转换为多个exon的种类RNA的bed文件:
```python
import re
# 定义染色体转换字典
chromosome_dict = {
"NC_000001.11": "chr1",
"NC_000002.12": "chr2",
"NC_000003.12": "chr3",
"NC_000004.12": "chr4",
"NC_000005.10": "chr5",
"NC_000006.12": "chr6",
"NC_000007.14": "chr7",
"NC_000008.11": "chr8",
"NC_000009.12": "chr9",
"NC_000010.11": "chr10",
"NC_000011.10": "chr11",
"NC_000012.12": "chr12",
"NC_000013.11": "chr13",
"NC_000014.9": "chr14",
"NC_000015.10": "chr15",
"NC_000016.10": "chr16",
"NC_000017.11": "chr17",
"NC_000018.10": "chr18",
"NC_000019.10": "chr19",
"NC_000020.11": "chr20",
"NC_000021.9": "chr21",
"NC_000022.11": "chr22",
"NC_000023.11": "chrX",
"NC_000024.10": "chrY"
}
# 读取gtf文件
with open('GRCh38_latest_genomic.gtf', 'r') as gtf_file:
lines = gtf_file.readlines()
transcripts = {} # 存储符合条件的transcript
# 提取符合条件的transcript
for line in lines:
if line.startswith('#'): # 跳过注释行
continue
columns = line.strip().split('\t')
if columns[2] == 'transcript':
attributes = re.findall(r'(\w+)\s+"([^"]+)"', columns[8])
gene_id = None
transcript_id = None
for attr in attributes:
if attr[0] == 'gene_id':
gene_id = attr[1]
elif attr[0] == 'transcript_id':
transcript_id = attr[1]
if gene_id and transcript_id:
if gene_id not in transcripts:
transcripts[gene_id] = transcript_id
else:
# 比较长度并保留最大的transcript_id
max_length = int(columns[4]) - int(columns[3])
curr_length = int(lines[transcripts[gene_id]].split('\t')[4]) - int(lines[transcripts[gene_id]].split('\t')[3])
if max_length > curr_length:
transcripts[gene_id] = transcript_id
# 提取exon行信息并转换为bed格式
bed_lines = [] # 存储bed文件的内容
for line in lines:
if line.startswith('#'): # 跳过注释行
continue
columns = line.strip().split('\t')
if columns[2] == 'exon':
attributes = re.findall(r'(\w+)\s+"([^"]+)"', columns[8])
transcript_id = None
for attr in attributes:
if attr[0] == 'transcript_id':
transcript_id = attr[1]
if transcript_id and transcript_id in transcripts.values():
chromosome = columns[0]
start = columns[3]
end = columns[4]
gene_id = re.findall(r'gene_id\s+"([^"]+)"', columns[8])[0]
strand = columns[6]
# 检查染色体是否需要转换
if chromosome in chromosome_dict:
chromosome = chromosome_dict[chromosome]
else:
continue
bed_line = f"{chromosome}\t{start}\t{end}\t{gene_id}\t1000\t{strand}"
bed_lines.append(bed_line)
# 写入bed文件
with open('output.bed', 'w') as bed_file:
bed_file.write('\n'.join(bed_lines))
print("转换完成,bed文件已生成。")
```
请将您的`GRCh38_latest_genomic.gtf`文件与上述代码放在同一目录下,并执行该脚本,即可生成名为`output.bed`的bed文件。希望这能满足您的需求!如有其他问题,请随时提问。
阅读全文