Web挖掘：从超文本中发现知识的关键技术

需积分: 10 195 浏览量更新于2024-07-25 收藏 1.72MB PDF 举报

"《挖掘网络：从超文本数据中发现知识》是一本由Morgan Kaufmann出版社出版的专业书籍，隶属于数据管理系统系列，由Jim Gray担任系列编辑，该系列专注于深化理解数据库和数据处理技术。本书的核心主题是Web挖掘，这是信息技术领域的一个重要分支，旨在从互联网上的海量超文本数据中提取有价值的信息和知识。作者Soumen Chakrabarti以其深入浅出的方式，详细探讨了Web挖掘的各种技术，包括如何利用机器学习、数据挖掘算法和技术来解析网页结构、抓取链接、分析用户行为等。书中可能会涉及的内容包括爬虫技术、PageRank算法、模式识别以及关联规则挖掘等关键概念。此外，书中可能还会对比和解释与Web挖掘相关的其他领域，如搜索引擎优化（SEO）和搜索引擎架构，这些都是支撑Web挖掘的基础。对于数据库技术的支持，读者可以了解到如何通过高级SQL（如对象关系模型和复杂查询功能）来管理和处理挖掘过程中产生的大量数据。对于数据库管理和优化，如《高级SQL：理解对象关系和其他高级特性》、《数据库调优：原则、实验和故障排除技巧》等，这些著作可能为Web挖掘提供有效的数据存储和查询策略。同时，书中也可能探讨信息可视化在数据挖掘中的应用，帮助用户更直观地理解和展示挖掘结果。《事务性信息系统：理论、算法及并发控制和恢复实践》则可能涵盖了处理Web数据中的并发问题和数据一致性挑战。空间数据库和地理信息系统（GIS）的应用，如《空间数据库：与GIS应用》也可能是书中的亮点，强调了在Web挖掘中如何处理地理位置相关数据。最后，书中的《信息建模与关系数据库：从概念分析到逻辑设计》可能提供了理论基础，讲解如何将复杂的业务需求转化为可操作的数据模型，这对于Web挖掘项目的整个生命周期都至关重要。而《组件数据库系统》的编辑工作，强调了数据管理的系统性和整体性。《挖掘网络》是一本全面而深入的指南，不仅覆盖了Web挖掘的核心技术，还涵盖了与之相关的数据库管理、搜索引擎优化和信息处理等多个方面，对于那些希望在互联网大数据时代探索知识的人来说，这是一本不可多得的参考资料。"

PREFACE

This book is about ﬁnding signiﬁcant statistical patterns relating hypertext docu-

ments, topics, hyperlinks, and queries and using these patterns to connect users to

information they seek. The Web has become a vast storehouse of knowledge, built

in a decentralized yet collaborative manner. It is a living, growing, populist, and

participatory medium of expression with no central editorship. This has positive

and negative implications. On the positive side, there is widespread participation

in authoring content. Compared to print or broadcast media, the ratio of content

creators to the audience is more equitable. On the negative side, the heterogeneity

and lack of structure makes it hard to frame queries and satisfy information needs.

For many queries posed with the help of words and phrases, there are thousands

of apparently relevant responses, but on closer inspection these turn out to be

disappointing for all but the simplest queries. Queries involving nouns and noun

phrases, where the information need is to ﬁnd out about the named entity, are the

simplest sort of information-hunting tasks. Only sophisticated users succeed with

more complex queries—for instance, those that involve articles and prepositions

to relate named objects, actions, and agents. If you are a regular seeker and user

of Web information, this state of affairs needs no further description.

Detecting and exploiting statistical dependencies between terms, Web pages, and

hyperlinks will be the central theme in this book. Such dependencies are also called

patterns, and the act of searching for such patterns is called machine learning,ordata

mining. Here are some examples of machine learning for Web applications. Given

a crawl of a substantial portion of the Web, we may be interested in constructing

a topic directory like Yahoo!, perhaps detecting the emergence and decline of

prominent topics with passing time. Once a topic directory is available, we may

wish to assign freshly crawled pages and sites to suitable positions in the directory.

In this book, the data that we will “mine” will be very rich, comprising text,

hypertext markup, hyperlinks, sites, and topic directories. This distinguishes the

area of Web mining as a new and exciting ﬁeld, although it also borrows liberally

from traditional data analysis. As we shall see, useful information on the Web is

accompanied by incredible levels of noise, but thankfully, the law of large numbers

kicks in often enough that statistical analysis can make sense of the confusion. Our

xvi Preface

goal is to provide both the technical background and tools and tricks of the trade

of Web content mining, which was developed roughly between 1995 and 2002,

although it continues to advance. This book is addressed to those who are, or

would like to become, researchers and innovative developers in this area.

Prerequisites and Contents

The contents of this book are targeted at fresh graduate students but are also

quite suitable for senior undergraduates. The book is partly based on tutorials at

SIGMOD 1999 and KDD 2000, a survey article in SIGKDD Explorations, invited

lectures at ACL 1999 and ICDT 2001, and teaching a graduate elective at IIT

Bombay in the spring of 2001. The general style is a mix of scientiﬁc and statistical

programming with system engineering and optimizations. A background in

elementary undergraduate statistics, algorithms, and networking should sufﬁce

to follow the material. The exposition also assumes that the reader is a regular

user of search engines, topic directories, and Web content in general, and has

some appreciation for the limitations of basic Web access based on clicking on

links and typing keyword queries.

The chapters fall into three major parts. For concreteness, we start with some

engineering issues: crawling, indexing, and keyword search. This part also gives

us some basic know-how for efﬁciently representing, manipulating, and analyzing

hypertext documents with computer programs. In the second part, which is the

bulk of the book, we focus on machine learning for hypertext: the art of creating

programs that seek out statistical relations between attributes extracted from Web

documents. Such relations can be used to discover topic-based clusters from a

collection of Web pages, assign a Web page to a predeﬁned topic, or match a

user’s interest to Web sites. The third part is a collection of applications that draw

upon the techniques discussed in the ﬁrst two parts.

To make the presentation concrete, speciﬁc URLs are indicated throughout,

but there is no saying how long they will remain accessible on the Web. Luckily,

the Internet Archive will let you view old versions of pages at www.archive.org/,

provided this URL does not get dated.

Omissions

The ﬁeld of research underlying this book is in rapid ﬂux. A book written at this

juncture is guaranteed to miss out on important areas. At some point a snapshot

Acknowledgments xvii

must be taken to complete the project. A few omissions, however, are deliberate.

Beyond bare necessities, I have not engaged in a study of protocols for representing

and transferring content on the Internet and the Web. Readers are assumed to be

reasonably familiar with HTML. For the purposes of this book, you do not need

to understand the XML (Extensible Markup Language) standard much more deeply

than HTML. There is also no treatment of Web application services, dynamic site

management, or associated networking and data-processing technology.

I make no attempt to cover natural language (NL) processing, natural lan-

guage understanding, or knowledge representation. This is largely because I do

not know enough about natural language processing. NL techniques can now

parse relatively well-formed sentences in many languages, disambiguate polyse-

mous words with high accuracy, tag words in running text with part-of-speech

information, represent NL documents in a canonical machine-usable form, and

perform NL translation. Web search engines have been slow to embrace NL pro-

cessing except as an explicit translation service. In this book, I will make occasional

references to what has been called “ankle-deep semantics”—techniques that lever-

age semantic databases (e.g., as a dictionary or thesaurus) in shallow, efﬁcient ways

to improve keyword search.

Another missing area is Web usage mining. Optimizing large, high-ﬂux Web

sites to be visitor-friendly is nontrivial. Monitoring and analyzing the behavior of

visitors in the past may lead to valuable insights into their information needs, and

help in continually adapting the design of the site. Several companies have built

systems integrated with Web servers, especially the kind that hosts e-commerce

sites, to monitor and analyze trafﬁc and propose site organization strategies. The

array of techniques brought to bear on usage mining has a large overlap with

traditional data mining in the relational data-warehousing scenario, for which

excellent texts already exist.

Acknowledgments

I am grateful to many people for making this work possible. I was fortunate to

associate with Byron Dom, Inderjit Dhillon, Dharmendra Modha, David Gibson,

Dimitrios Gunopulos, Jon Kleinberg, Kevin McCurley, Nimrod Megiddo, and

Prabhakar Raghavan at IBM Almaden Research Center, where some of the

inventions described in this book were made between 1996 and 1999. I also

acknowledge the extremely stimulating discussions I have had with researchers

at the then Digital System Research Center in Palo Alto, California: Krishna

剩余363页未读，继续阅读

gentl

粉丝: 0
资源: 2

Web挖掘：从超文本中发现知识的关键技术

Mining the Web-Discovering Knowledge from Hypertext Data

Mining the Web: Discovering Knowledge from Hypertext Data

Mining.the.Web_Discovering.Knowledge.from.Hypertext.Data

mining the web

Mining_the_Web.pdf

Java.Data.Mining

【Advanced Chapter】Web Crawler Data Analysis and Visualization: Practical Implementation Using ...

1基于STM32的智能气象站项目.docx

技术资料分享SH-HC-05蓝牙模块技术手册很好的技术资料.zip

【路径规划】改进的人工势场算法机器人避障路径规划【含Matlab源码 1151期】.zip

最新资源