构建DNA搜索引擎：借鉴谷歌技术

需积分: 3 58 浏览量更新于2024-09-25 收藏 347KB PDF 举报

"如何构建一个像谷歌一样的DNA搜索引擎？" 构建一个类似于谷歌的DNA搜索引擎是一项具有挑战性的任务，它涉及到生物信息学、计算机科学以及搜索引擎技术的交叉应用。本文由王亮提出，他来自腾讯SOSO，位于中国深圳，探讨了一种利用网络搜索引擎技术来建立大规模DNA序列搜索系统的新方法。首先，根据Zipf定律，研究发现12bp（碱基对）可能是DNA中的典型"单词"长度。Zipf定律是语言学中的一个理论，表明在自然语言中，最频繁出现的词通常是最短的。在DNA序列中，这个规律意味着12bp的DNA片段可能最常出现，成为识别和搜索的基础。接着，通过N-gram统计方法构建DNA的"词汇表"。N-gram是一种统计语言模型，用于分析连续的DNA序列片段。例如，一个2-gram（也称为bigram）会考虑相邻的两个碱基，3-gram（trigram）则会考虑三个相邻的碱基，以此类推。这种方法有助于识别DNA序列中的模式和重复。有了词汇表后，可以对DNA序列进行分词，就像搜索引擎对网页文本进行分词一样。然后，使用成熟的搜索引擎技术建立倒排索引，这使得能够快速定位到包含特定DNA序列的记录。倒排索引是一种高效的数据结构，它允许在大量数据中快速查找包含特定关键字的所有位置。此外，该研究还构建了基于此词汇表的DNA统计语言模型。这种模型有助于理解和预测DNA序列的模式，从而帮助科学家们更好地理解DNA的秘密。随着生物数据的指数级增长，传统的生物信息学方法面临着新的挑战。在后基因组时代，处理和分析海量的DNA数据变得至关重要。如果想要在当前所有的DNA数据库中搜索一个特定的DNA序列，传统方法可能会非常困难。然而，如果将搜索引擎技术应用于DNA，那么在数十亿份文档中搜索一个词序列的概念就可以转化为在同样数量级的DNA序列中寻找特定序列，极大地提高了搜索效率和准确性。构建一个DNA搜索引擎需要结合生物信息学的知识，如DNA序列分析和统计语言模型，以及计算机科学的搜索引擎技术和数据结构。这种方法不仅提供了高效的数据检索服务，也为揭示DNA的奥秘开辟了全新的途径。

How to build a DNA search engine like Google?

Wang Liang

Tencent, SOSO, Shenzhen, 430074, P.R. China

*To whom correspondence should be addressed. E-mail:wangliang.f@gmail.com

[Abstract] This paper presents a novel method to build the large scale DNA sequences search system

based on web search engine technology. Firstly, we find 12 bp may be the length of most DNA “words”

by Zipf’s laws. Then the “vocabulary” of DNA is constructed by N-grams statistical method. After

having a vocabulary, we could easily segment the DNA sequence, build the inverted index and provide

the search services by mature search engine technology. Such system could provide the ms level

search services in billions of DNA sequences. The DNA statistical language model is also built based

on this vocabulary. This research may pave a completely new avenue to discover the secret of DNA.

The exponentially increasing volumes of biological data pose new challenges for bioinformatics

in the post-genome era. Now if you want to search a DNA sequence in all of current DNA databases,

you will find it’s a very tough mission. But if you want to search a word sequence in billions of

documents in Internet, you only need several ms through Google. So could we build a DNA search

engine like Google? This paper just discusses this “simple” question.

Now most DNA search and comparing methods are designed based on BLAST/ FASTA

algorithm, which compare one sequence with all the other sequences on by one (1,2). So it’s very

difficult to improve the speed of these methods greatly, especially in this DNA information explosion

period. Many researchers agree that the biological need the completely new search algorithm.

We may find some inspirations from the history of text information retrieval. If we need search a

word in several documents, we could also scan each document and find the match words. But if there

are millions of documents, we need scan all the words of these documents. The search time will

become intolerable, especially in Internet period. So almost all current search systems evolved into the

inverted index based search systems.

The idea of inverted index is very simple (3). It uses the words as the field to index the

documents. For example, three documents:

T0= "it is what it is";

T1= "what is it";

T2 = "it is a banana";

Their inverted index:

word

Document ID

banana

T0, T1, T2

T0,T1,T2

下载后可阅读完整内容，剩余6页未读，立即下载

Augusdi

粉丝: 1w+
资源: 5737

构建DNA搜索引擎：借鉴谷歌技术

MIT How to Build a Beowulf Clusters

Quantitative Trading - How to Build Your Own Algorithmic Trading Business.

Enterprise Cybersecurity Study Guide How to Build a Successful Cyberdefense epub

藏经阁-How to Build a Successful Data Lake.pdf

ASP.NET Core 1.1 For Beginners: How to Build a MVC Website

Product-Led Growth: How to Build a Product That Sells Itself

Enterprise Cybersecurity Study Guide How to Build a Successful 无水印原版pdf

ASP.NET Core 1.1 Web API For Beginners How To Build a Web API

How to Build Blockchain App

How to Build Apktool from source

最新资源