《机器学习for文本》免积分下载指南

需积分: 9 181 浏览量更新于2024-07-19 收藏 8.74MB PDF 举报

《机器学习for文本》是由Charu C. Aggarwal撰写的一本关于机器学习在文本处理领域的权威著作。这本书是IBM T.J. Watson Research Center位于美国纽约州约克敦高地的研究成果，出版于2018年。它提供了对文本挖掘、自然语言处理（NLP）和深度学习在文本分析中的深入探讨，旨在帮助读者理解并应用这些技术来解决实际问题。该书的ISBN号为978-3-319-73530-6和978-3-319-73531-3，电子版可以通过DOI链接访问：https://doi.org/10.1007/978-3-319-73531-3。根据版权信息，此作品受到Springer International Publishing AG和Springer Nature的版权保护，所有权利保留，包括翻译、再版、复制、朗读、广播、微缩胶片复制、电子形式的改编、计算机软件以及任何尚未知晓或未来开发的相似或类似方法的使用。书中涵盖了诸如特征工程、文本分类、情感分析、主题建模、文档摘要、命名实体识别等核心主题，展示了如何使用机器学习算法如朴素贝叶斯、支持向量机、神经网络等处理文本数据，以提取有价值的信息。此外，它还介绍了当时最新的技术和工具，比如深度学习模型（如卷积神经网络和循环神经网络）在文本处理中的应用。对于想要深入理解机器学习在文本分析中的专业人士和研究者来说，《机器学习for Text》是一本重要的参考书，不仅提供了理论知识，还提供了实践指导，有助于读者将理论转化为实际的文本处理解决方案。作者Charu C. Aggarwal以其丰富的经验和专业知识，使得本书成为机器学习和人工智能领域的宝贵资源。如果你对这个领域感兴趣，可以在作者的简书主页（<https://www.jianshu.com/u/3a2d89402aca>）上了解更多相关资源和更新。

xviii CONTENTS

9.2.5 Eﬃciency Optimizations ........................ 276

9.2.5.1 Skip Pointers ......................... 276

9.2.5.2 Champion Lists and Tiered Indexes ............ 277

9.2.5.3 Caching Tricks ........................ 277

9.2.5.4 Compression Tricks ...................... 278

9.3 Scoring with Information Retrieval Models .................. 280

9.3.1 Vector Space Models with tf-idf .................... 280

9.3.2 The Binary Independence Model ................... 281

9.3.3 The BM25 Model with Term Frequencies ............... 283

9.3.4 Statistical Language Models in Information Retrieval ........ 285

9.3.4.1 Query Likelihood Models .................. 285

9.4 Web Crawling and Resource Discovery .................... 287

9.4.1 A Basic Crawler Algorithm ...................... 287

9.4.2 Preferential Crawlers .......................... 289

9.4.3 Multiple Threads ............................ 290

9.4.4 Combatting Spider Traps ........................ 290

9.4.5 Shingling for Near Duplicate Detection ................ 291

9.5 Query Processing in Search Engines ...................... 291

9.5.1 Distributed Index Construction .................... 292

9.5.2 Dynamic Index Updates ........................ 293

9.5.3 Query Processing ............................ 293

9.5.4 The Importance of Reputation ..................... 294

9.6 Link-Based Ranking Algorithms ........................ 295

9.6.1 PageRank ................................ 295

9.6.1.1 Topic-Sensitive PageRank .................. 298

9.6.1.2 SimRank ........................... 299

9.6.2 HITS ................................... 300

9.7 Summary ..................................... 302

9.8 Bibliographic Notes ............................... 302

9.8.1 Software Resources ........................... 303

9.9 Exercises ..................................... 304

10 Text Sequence Modeling and Deep Learning 305

10.1 Introduction ................................... 305

10.1.1 Chapter Organization .......................... 308

10.2 Statistical Language Models .......................... 308

10.2.1 Skip-Gram Models ........................... 310

10.2.2 Relationship with Embeddings ..................... 312

10.3 Kernel Methods ................................. 313

10.4 Word-Context Matrix Factorization Models ................. 314

10.4.1 Matrix Factorization with Counts ................... 314

10.4.1.1 Postprocessing Issues ..................... 316

10.4.2 The GloVe Embedding ......................... 316

10.4.3 PPMI Matrix Factorization ...................... 317

10.4.4 Shifted PPMI Matrix Factorization .................. 318

10.4.5 Incorporating Syntactic and Other Features ............. 318

10.5 Graphical Representations of Word Distances ................ 318

CONTENTS xix

10.6 Neural Language Models ............................ 320

10.6.1 Neural Networks: A Gentle Introduction ............... 320

10.6.1.1 Single Computational Layer: The Perceptron ....... 321

10.6.1.2 Relationship to Support Vector Machines ......... 323

10.6.1.3 Choice of Activation Function ................ 324

10.6.1.4 Choice of Output Nodes ................... 325

10.6.1.5 Choice of Loss Function ................... 325

10.6.1.6 Multilayer Neural Networks ................. 326

10.6.2 Neural Embedding with Word2vec .................. 331

10.6.2.1 Neural Embedding with Continuous Bag of Words .... 331

10.6.2.2 Neural Embedding with Skip-Gram Model ......... 334

10.6.2.3 Practical Issues ........................ 336

10.6.2.4 Skip-Gram with Negative Sampling ............. 337

10.6.2.5 What Is the Actual Neural Architecture of SGNS? .... 338

10.6.3 Word2vec (SGNS) Is Logistic Matrix Factorization ......... 338

10.6.3.1 Gradient Descent ....................... 340

10.6.4 Beyond Words: Embedding Paragraphs with Doc2vec ........ 341

10.7 Recurrent Neural Networks ........................... 342

10.7.1 Practical Issues ............................. 345

10.7.2 Language Modeling Example of RNN ................. 345

10.7.2.1 Generating a Language Sample ............... 345

10.7.3 Application to Automatic Image Captioning ............. 347

10.7.4 Sequence-to-Sequence Learning and Machine Translation ...... 348

10.7.4.1 Question-Answering Systems ................ 350

10.7.5 Application to Sentence-Level Classiﬁcation ............. 352

10.7.6 Token-Level Classiﬁcation with Linguistic Features ......... 353

10.7.7 Multilayer Recurrent Networks .................... 354

10.7.7.1 Long Short-Term Memory (LSTM) ............. 355

10.8 Summary ..................................... 357

10.9 Bibliographic Notes ............................... 357

10.9.1 Software Resources ........................... 358

10.10 Exercises ..................................... 359

11 Text Summarization 361

11.1 Introduction ................................... 361

11.1.1 Extractive and Abstractive Summarization .............. 362

11.1.2 Key Steps in Extractive Summarization ............... 363

11.1.3 The Segmentation Phase in Extractive Summarization ....... 363

11.1.4 Chapter Organization .......................... 363

11.2 Topic Word Methods for Extractive Summarization ............. 364

11.2.1 Word Probabilities ........................... 364

11.2.2 Normalized Frequency Weights .................... 365

11.2.3 Topic Signatures ............................ 366

11.2.4 Sentence Selection Methods ...................... 368

11.3 Latent Methods for Extractive Summarization ................ 369

11.3.1 Latent Semantic Analysis ....................... 369

11.3.2 Lexical Chains .............................. 370

11.3.2.1 Short Description of WordNet ................ 370

11.3.2.2 Leveraging WordNet for Lexical Chains .......... 371

xx CONTENTS

11.3.3 Graph-Based Methods ......................... 372

11.3.4 Centroid Summarization ........................ 373

11.4 Machine Learning for Extractive Summarization ............... 374

11.4.1 Feature Extraction ........................... 374

11.4.2 Which Classiﬁers to Use? ........................ 375

11.5 Multi-Document Summarization ........................ 375

11.5.1 Centroid-Based Summarization .................... 375

11.5.2 Graph-Based Methods ......................... 376

11.6 Abstractive Summarization ........................... 377

11.6.1 Sentence Compression ......................... 378

11.6.2 Information Fusion ........................... 378

11.6.3 Information Ordering .......................... 379

11.7 Summary ..................................... 379

11.8 Bibliographic Notes ............................... 379

11.8.1 Software Resources ........................... 380

11.9 Exercises ..................................... 380

12 Information Extraction 381

12.1 Introduction ................................... 381

12.1.1 Historical Evolution ........................... 383

12.1.2 The Role of Natural Language Processing .............. 384

12.1.3 Chapter Organization .......................... 385

12.2 Named Entity Recognition ........................... 386

12.2.1 Rule-Based Methods .......................... 387

12.2.1.1 Training Algorithms for Rule-Based Systems ....... 388

12.2.1.2 Top-Down Rule Generation ................. 389

12.2.1.3 Bottom-Up Rule Generation ................. 390

12.2.2 Transformation to Token-Level Classiﬁcation ............. 391

12.2.3 Hidden Markov Models ......................... 391

12.2.3.1 Visible Versus Hidden Markov Models ........... 392

12.2.3.2 The Nymble System ..................... 392

12.2.3.3 Training ............................ 394

12.2.3.4 Prediction for Test Segment ................. 394

12.2.3.5 Incorporating Extracted Features .............. 395

12.2.3.6 Variations and Enhancements ................ 395

12.2.4 Maximum Entropy Markov Models .................. 396

12.2.5 Conditional Random Fields ...................... 397

12.3 Relationship Extraction ............................. 399

12.3.1 Transformation to Classiﬁcation .................... 400

12.3.2 Relationship Prediction with Explicit Feature Engineering ..... 401

12.3.2.1 Feature Extraction from Sentence Sequences ........ 402

12.3.2.2 Simplifying Parse Trees with Dependency Graphs ..... 403

12.3.3 Relationship Prediction with Implicit Feature Engineering:

Kernel Methods ............................. 404

12.3.3.1 Kernels from Dependency Graphs .............. 405

12.3.3.2 Subsequence-Based Kernels ................. 405

12.3.3.3 Convolution Tree-Based Kernels .............. 406

12.4 Summary ..................................... 408

CONTENTS xxi

12.5 Bibliographic Notes ............................... 409

12.5.1 Weakly Supervised Learning Methods ................. 410

12.5.2 Unsupervised and Open Information Extraction ........... 410

12.5.3 Software Resources ........................... 410

12.6 Exercises ..................................... 411

13 Opinion Mining and Sentiment Analysis 413

13.1 Introduction ................................... 413

13.1.1 The Opinion Lexicon .......................... 415

13.1.1.1 Dictionary-Based Approaches ................ 416

13.1.1.2 Corpus-Based Approaches .................. 416

13.1.2 Opinion Mining as a Slot Filling and Information Extraction Task . 417

13.1.3 Chapter Organization .......................... 418

13.2 Document-Level Sentiment Classiﬁcation ................... 418

13.2.1 Unsupervised Approaches to Classiﬁcation .............. 420

13.3 Phrase- and Sentence-Level Sentiment Classiﬁcation ............. 421

13.3.1 Applications of Sentence- and Phrase-Level Analysis ........ 422

13.3.2 Reduction of Subjectivity Classiﬁcation to Minimum Cut Problem 423

13.3.3 Context in Sentence- and Phrase-Level Polarity Analysis ...... 423

13.4 Aspect-Based Opinion Mining as Information Extraction .......... 424

13.4.1 Hu and Liu’s Unsupervised Approach ................. 424

13.4.2 OPINE: An Unsupervised Approach ................. 426

13.4.3 Supervised Opinion Extraction as Token-Level Classiﬁcation .... 427

13.5 Opinion Spam .................................. 428

13.5.1 Supervised Methods for Spam Detection ............... 428

13.5.1.1 Labeling Deceptive Spam .................. 429

13.5.1.2 Feature Extraction ...................... 430

13.5.2 Unsupervised Methods for Spammer Detection ........... 431

13.6 Opinion Summarization ............................. 431

13.6.1 Rating Summary ............................ 432

13.6.2 Sentiment Summary .......................... 432

13.6.3 Sentiment Summary with Phrases and Sentences .......... 432

13.6.4 Extractive and Abstractive Summaries ................ 432

13.7 Summary ..................................... 433

13.8 Bibliographic Notes ............................... 433

13.8.1 Software Resources ........................... 434

13.9 Exercises ..................................... 434

14 Text Segmentation and Event Detection 435

14.1 Introduction ................................... 435

14.1.1 Relationship with Topic Detection and Tracking ........... 436

14.1.2 Chapter Organization .......................... 436

14.2 Text Segmentation ............................... 436

14.2.1 TextTiling ................................ 437

14.2.2 The C99 Approach ........................... 438

14.2.3 Supervised Segmentation with Oﬀ-the-Shelf Classiﬁers ....... 439

14.2.4 Supervised Segmentation with Markovian Models .......... 441

剩余509页未读，继续阅读

qq_30870311

粉丝: 16
资源: 27

《机器学习for文本》免积分下载指南

Machine Learning for Text-Springer(2018).pdf

Machine Learning for Text epub

Machine Learning for Text 无水印原版pdf

Machine Learning for Text 免积分下载

An Introduction to Variational Calculus in Machine Learning.

Outlier Analysis 2nd Edition.pdf ——2积分系列

上市公司财务指标现金流分析1991-202406的网盘链接.docx

Jetpack编写桌面数据包捕获. Support Ethernet,ARP,ICMP,TCP,UDP packet（协议抓

JSP041航空订票系统毕业课程源码设计+论文资料+答辩ppt

Apache Apex 是一个用于大数据流和批处理的统一平台 使用案例包括摄取、ETL、实时分析、警报和实时操作

最新资源

Apache Apex 是一个用于大数据流和批处理的统一平台使用案例包括摄取、ETL、实时分析、警报和实时操作