我有GRCh38_latest_genomic.gtf，想要一个python脚本，输出多个exon的种类RNA的bed文件。具体步骤如下： 1.提取第三列为“transcript”的行，如果gene_id相同，transcript_id有多个，将长度最大（即第五列减去第四列最大）的一个transcript_id留下到下一步，如果只有一个transcript_id，则保留。 2.提取第三列为“exon”的行，按照上一步得到的transcript_id提取信息。 3.将exon行信息，转为bed文件。第一列转换按照 "NC_000001.11": "chr1", "NC_000002.12": "chr2", "NC_000003.12": "chr3", "NC_000004.12": "chr4", "NC_000005.10": "chr5", "NC_000006.12": "chr6", "NC_000007.14": "chr7", "NC_000008.11": "chr8", "NC_000009.12": "chr9", "NC_000010.11": "chr10", "NC_000011.10": "chr11", "NC_000012.12": "chr12", "NC_000013.11": "chr13", "NC_000014.9": "chr14", "NC_000015.10": "chr15", "NC_000016.10": "chr16", "NC_000017.11": "chr17", "NC_000018.10": "chr18", "NC_000019.10": "chr19", "NC_000020.11": "chr20", "NC_000021.9": "chr21", "NC_000022.11": "chr22", "NC_000023.11": "chrX", "NC_000024.10": "chrY" 转换为bed文件的第一列，如果gtf的第一列不为NC开头，则过滤掉。gtf文件第四列为bed文件第二列，gtf文件第五列为bed文件第三列，gtf文件第九列中“gene_id”后第一列为bed文件第四列，bed第五列赋值为1000，gtf文件第七列为bed文件第六列。

时间: 2023-07-17 12:03:35 浏览: 594

以下是根据您的要求编写的Python脚本，用于将GRCh38_latest_genomic.gtf文件转换为多个exon的种类RNA的bed文件： ```python import re # 定义染色体转换字典 chromosome_dict = { "NC_000001.11": "chr1", "NC_000002.12": "chr2", "NC_000003.12": "chr3", "NC_000004.12": "chr4", "NC_000005.10": "chr5", "NC_000006.12": "chr6", "NC_000007.14": "chr7", "NC_000008.11": "chr8", "NC_000009.12": "chr9", "NC_000010.11": "chr10", "NC_000011.10": "chr11", "NC_000012.12": "chr12", "NC_000013.11": "chr13", "NC_000014.9": "chr14", "NC_000015.10": "chr15", "NC_000016.10": "chr16", "NC_000017.11": "chr17", "NC_000018.10": "chr18", "NC_000019.10": "chr19", "NC_000020.11": "chr20", "NC_000021.9": "chr21", "NC_000022.11": "chr22", "NC_000023.11": "chrX", "NC_000024.10": "chrY" } # 读取gtf文件 with open('GRCh38_latest_genomic.gtf', 'r') as gtf_file: lines = gtf_file.readlines() transcripts = {} # 存储符合条件的transcript # 提取符合条件的transcript for line in lines: if line.startswith('#'): # 跳过注释行 continue columns = line.strip().split('\t') if columns[2] == 'transcript': attributes = re.findall(r'(\w+)\s+"([^"]+)"', columns[8]) gene_id = None transcript_id = None for attr in attributes: if attr[0] == 'gene_id': gene_id = attr[1] elif attr[0] == 'transcript_id': transcript_id = attr[1] if gene_id and transcript_id: if gene_id not in transcripts: transcripts[gene_id] = transcript_id else: # 比较长度并保留最大的transcript_id max_length = int(columns[4]) - int(columns[3]) curr_length = int(lines[transcripts[gene_id]].split('\t')[4]) - int(lines[transcripts[gene_id]].split('\t')[3]) if max_length > curr_length: transcripts[gene_id] = transcript_id # 提取exon行信息并转换为bed格式 bed_lines = [] # 存储bed文件的内容 for line in lines: if line.startswith('#'): # 跳过注释行 continue columns = line.strip().split('\t') if columns[2] == 'exon': attributes = re.findall(r'(\w+)\s+"([^"]+)"', columns[8]) transcript_id = None for attr in attributes: if attr[0] == 'transcript_id': transcript_id = attr[1] if transcript_id and transcript_id in transcripts.values(): chromosome = columns[0] start = columns[3] end = columns[4] gene_id = re.findall(r'gene_id\s+"([^"]+)"', columns[8])[0] strand = columns[6] # 检查染色体是否需要转换 if chromosome in chromosome_dict: chromosome = chromosome_dict[chromosome] else: continue bed_line = f"{chromosome}\t{start}\t{end}\t{gene_id}\t1000\t{strand}" bed_lines.append(bed_line) # 写入bed文件 with open('output.bed', 'w') as bed_file: bed_file.write('\n'.join(bed_lines)) print("转换完成，bed文件已生成。") ``` 请将您的`GRCh38_latest_genomic.gtf`文件与上述代码放在同一目录下，并执行该脚本，即可生成名为`output.bed`的bed文件。希望这能满足您的需求！如有其他问题，请随时提问。

阅读全文

相关推荐

Perl脚本实现GRCh38到GRCh37基因组映射指南

Python脚本汇总基因计数表的新方法

R语言包 SNPlocs.Hsapiens.dbSNP144.GRCh37 详细解析

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz

已知一个gtf文件：Homo_sapiens.GRCh37.75.gtf，将它转为bed文件，只需前六列，并打印出前三列。使用python

已知一个gtf文件：Homo_sapiens.GRCh37.75.gtf，将它转为bed文件，只需前六列，并打印出前三列。使用python。最后统计chr19中基因个数和每个基因中转录本的个数

用命令行下载v29的gnome.gtf文件

grch_git_sourcetree:这是练习 git 和 sourcetree 的示例存储库

EnsemblMapping：将基因组装配图GRCh38映射到GRCh37的脚本

人类参考基因组hg19和GRCh38测序盲区坐标文件

人类GRCh38第14修订版基因注释与食蟹猕猴的同源基因信息参考

解析人类基因组hg19与GRCh38测序盲区的坐标文件

GRCh38如何本地安装

根据GRCh38的，列出APOE基因的第18到20个外显子的ACGT排列顺序

深度学习工具SpliceAI：注释遗传变异的Python代码

大家在看

网络游戏中人工智能NPC.pdf

c语言编写的jpeg解码源代码

Noise-Pollution-Monitoring-Device

ggplot_Piper

海康最新视频控件_独立进程.rar

最新推荐

sblim-gather-provider-2.2.8-9.el7.x64-86.rpm.tar.gz

虚拟串口软件：实现IP信号到虚拟串口的转换

【Python进阶篇】：掌握这些高级特性，让你的编程能力飞跃提升

后端调用ragflow api

IE6下实现PNG图片背景透明的技术解决方案

【欧姆龙触摸屏故障诊断全攻略】

Educoder综合练习—C&C++选择结构

VBS简明教程：批处理之家论坛下载指南

【欧姆龙触摸屏：新手必读的10个操作技巧】

阿里云物联网平台不支持新购