Python文本分析实战：创建语料库与机器学习模型

1星需积分: 49 16 浏览量更新于2024-07-18 收藏 2.49MB PDF 举报

《Python应用文本分析实战》是一本由 Benjamin Bengfort、Tony Ojeda 和 Rebecca Bilbro 合著的专业书籍，旨在引导读者探索如何利用Python进行深度的文本处理和分析，从而开发具备语言理解能力的数据产品。本书涵盖了从基础操作到高级技术的全面内容，适合那些对自然语言处理（NLP）、文本挖掘和机器学习感兴趣的开发者。该书的核心部分围绕以下几个关键知识点展开： 1. **Python基础知识**：首先，作者会介绍Python的基础语法和库，确保读者对这个强大的编程语言有扎实的理解，这对于后续的文本分析至关重要。Python的简洁性和丰富的数据处理模块（如Numpy、Pandas和Matplotlib）将被深入讲解。 2. **文本预处理**：在文本分析过程中，数据清洗和预处理是关键步骤。本书会介绍如何去除噪声（如标点符号、停用词），进行分词、词干提取和词形还原，以及如何进行词频统计和文档向量化，以便于机器学习模型的构建。 3. **语料库创建**：如何从网络爬虫抓取数据、处理网页结构、下载和存储大规模文本数据，以及如何组织和管理这些语料库，都是书中不可或缺的部分。 4. **模型选择与应用**：书中会详细讨论各种常用的文本分析模型，如TF-IDF、词袋模型、n-gram、朴素贝叶斯、支持向量机（SVM）、深度学习（如RNN和LSTM）等，并通过实例演示如何使用这些模型进行情感分析、主题建模、命名实体识别等任务。 5. **实战项目**：为了帮助读者巩固所学，本书提供了多个实际项目的指导，比如新闻分类、社交媒体监控、用户评论分析等，使理论知识得以实践。 6. **版权与出版信息**：最后，书中包含了版权信息，确认了作者权益，并介绍了O'Reilly Media的出版流程，包括编辑、生产编辑、校对等环节，以及购买和在线获取电子版的途径。《Applied Text Analysis with Python》是一本实用且全面的指南，不仅适合初学者快速入门文本分析，也适合有一定经验的开发者进一步提升技能，将Python的强大功能应用于实际场景中。无论是对于个人学习还是企业项目开发，都具有很高的参考价值。

There are also several ways to crawl and scrape websites besides the methods we’ve

demonstrated here. For more advanced crawling and scraping, it may be worth look‐

ing into the following tools.

•

Scrapy - an open source framework for extracting data from websites.

• Selenium - a Python library that allows you to simulate user interaction with a

website.

• Apache Nutch - a highly extensible and scalable open source web crawler.

Web crawling and scraping can take us a long way in our quest to acquire text data

from the web, and the tools currently available make performing these tasks easier

and more efficient. However, there is still much work left to do after initial ingestion.

While formatted HTML is fairly easy to parse with packages like BeautifulSoup,

after a bit of experience with scraping, one quickly realizes that while general formats

are similar, different websites can lay out content very differently. Accounting for, and

working with, all these different HTML layouts can be frustrating and time consum‐

ing, which can make using more structured text data sources, like RSS, look much

more attractive.

Ingestion using RSS Feeds and Feedparser

RSS (Really Simple Syndication) is a standardized XML format for syndicated text

data that is primarily used by blogs, news sites, and other online publishers who pub‐

lish multiple documents (posts, articles, etc.) using the same general layout. There are

different versions of RSS, all originally evolved from the Resource Description

Framework (RDF) data serialization model, the most common of which is currently

RSS 2.0. Atom is a newer and more standardized, but at the time of this writing, a less

widely-used approach to providing XML content updates.

Text data structured as RSS is formatted more consistently than text data on a regular

web page, as a content feed, or a series of documents arranged in the order they were

published. This feed means you do not need to crawl the website in order to get other

content or acquire updates, making it preferable to acquiring data through crawling

and scraping. If the desired data resides in the body of blog posts or news articles and

the website makes them available as an RSS feed, you can merely parse that feed.

Another feature of RSS is its ability to synchronize or retrieve the latest version of the

content as articles on the source website are updated. Routine querying ensures any

changes to the content are reflected in the XML. However, the RSS format also has

some notable drawbacks. Most feeds give the content owner the option of displaying

either the full text or just a summary of each post or article. Content owners whose

revenue depends heavily on serving advertisements have an incentive to display only

12 | Chapter 1: Text Ingestion and Wrangling

summary text via RSS to encourage readers to visit their website to view both the full

content and the ads.

In the example below, we introduce the Python feedparser library to assist in ingest‐

ing the RSS feeds of a list of blogs, parsing them, extracting the text content, and then

writing that content to disk as XML files. After creating a list of feeds, the rss_parse

function uses the parse method to parse the XML for each of our feeds. From there,

the entries method retrieves the feed’s posts or articles. Next, we iterate through

each post, extracting the title for each, using the get_text method to extract the text

from inside any of the tags from our tag list, and writing that post’s text to a file.

import bs4

import feedparser

from slugify import slugify

feeds = ['http://blog.districtdatalabs.com/feed',

'http://feeds.feedburner.com/oreilly/radar/atom',

'http://blog.kaggle.com/feed/',

'http://blog.revolutionanalytics.com/atom.xml']

def rss_parse(feed):

parsed = feedparser.parse(feed)

posts = parsed.entries

for post in posts:

html = post.content[0].get('value')

soup = bs4.BeautifulSoup(html, 'lxml')

post_title = post.title

filename = slugify(post_title).lower() + '.xml'

TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']

for tag in soup.find_all(TAGS):

paragraphs = tag.get_text()

with open(filename, 'a') as f:

f.write(paragraphs + '\n \n')

When the code above is run, it generates a series of XML files, one for each blog post

or article belonging to the each RSS source listed in our feeds list. The files contain

only the text content from each posts or article.

The Baleen Ingestion Engine

The actual implementation of ingestion can become complex; APIs and RSS feeds can

change, and significant forethought is required to determine how best to put together

an application that will conduct not only robust, autonomous ingestion, but also

secure data management.

Data Ingestion of Text | 13

APIs: Twitter and Search

An API (Application Programming Interface) is a set of programmatic instructions

for accessing a web-based software application. Organizations frequently release their

APIs to the public to enable others to develop products on top of their data. Most

modern web and social media services have APIs that developers can access, and they

are typically accompanied by documentation with instructions on how to access and

obtain the data.

As a web service evolves, both the API and the documentation are

usually updated as well, and as developers and data scientists, we

need to stay current on changes to the APIs we use in our data

products.

A RESTful API is a type of web service API that adheres to representational state

transfer (REST) architectural constraints. REST is a simple way to organize interac‐

tions between independent systems, allowing for lightweight interaction with clients

such as mobile phones and other websites. REST is not exclusively tied to the web,

but it is almost always implemented as such, as it was inspired by HTTP. As a result,

wherever HTTP can be used, REST can also be used.

In order to interact with APIs, you must usually register your application with the

service provider, obtain authorization credentials, and agree to the web service’s terms

of use. The credentials provided usually consist of an API key, an API secret, an

access token, and an access token secret; all of which consist of long combinations of

alpha-numeric and special characters. Having a credentialing system in place allows

the service provider to monitor and control use of their API. The primary reason

they do this is so that they can prevent abuse of their service. Many service providers

allow for registration using OAuth, which is an open authentication standard that

allows a user’s information to be communicated to a third party without exposing

confidential information such as their password.

APIs are popular data sources among data scientists because they provide us with a

source of ingestion that is authorized, structured, and well-documented. The service

provider is giving us permission and access to retrieve and use the data they have in a

responsible manner. This isn’t true of crawling/scraping or RSS, and for this reason,

obtaining data via API is preferable whenever it is an option.

To illustrate how we can work with an API to acquire some data, let’s take a look at an

example. The following example uses the popular tweepy library to connect to Twit‐

ter’s API and then, given a list of user names, retrieves the last 100 tweets from each

user and saves each tweet to disk as an individual document.

Data Ingestion of Text | 15

剩余81页未读，继续阅读

承载的流年

粉丝: 0
资源: 2

Python文本分析实战：创建语料库与机器学习模型

Python文本分析技巧与升级变迁

掌握Python文本分析技巧：texthero实战应用

Python文本分析：实用测试数据与案例代码分享

Python文本分析

python文本分析与处理

python test.rar_284373_Python文本_meantmx8_python 文本_文本分析python

Python文本分析教程.rar

Python文本分析：情感分析与词性标注应用

Python文本分析技术的深入研究与应用

Python文本分析在信息恢复中的应用研究

最新资源