Sphinx 2.2.5 发布：开源全文搜索引擎指南

需积分: 0 146 浏览量更新于2024-07-22 收藏 1.42MB PDF 举报

Sphinx 2.2.5-reference manual 是一个免费开源的全文搜索引擎，由 Andrew Aksyonoff 在 2001-2014 年间创建，后来 Sphinx Technologies Inc. 在 2008-2014 年间接手维护，官网地址为 <http://sphinxsearch.com>。该手册详细介绍了 Sphinx 的核心功能、安装步骤、配置选项以及数据处理流程。 1. **Introduction** - Sphinx 提供强大的全文搜索功能，特别适合于对大量文本数据进行高效索引和检索。 - 该版本的特点包括但不限于：快速的查询性能、灵活的数据源支持（如 SQL 数据库、XML 和 TSV 文件）以及多值属性 (MVA) 的处理。 2. **Installation** - Sphinx 支持多种操作系统，包括 Linux（编译源码或通过包管理器如 Debian 和 Ubuntu 或 Red Hat 和 CentOS 安装）、Windows。 - 安装过程中需要注意的事项包括所需工具的选择、可能遇到的编译问题、以及不同环境下的安装步骤。 - Sphinx 还介绍了版本更新中的变化和默认配置的调整，以便用户了解新特性与兼容性。 3. **Indexing** - Sphinx 的核心任务是建立索引，涉及到数据源管理，如 SQL 数据库（如 MySQL、PostgreSQL）和非结构化数据源（如 XMLpipe2 和 TSVpipe）的连接。 - 全文字段和多值属性（MVA）的定义对索引质量和查询性能至关重要。 - 支持实时（live indexing）和增量更新（delta indexing），允许在数据发生变化时自动更新索引，以及合并多个索引。 4. **Restrictions on Source Data** - 源数据需要满足一定的规则，例如字符集、大小写处理、翻译表和替换规则，以确保正确地转换和匹配查询。 Sphinx 2.2.5-reference manual 是一个全面的指南，涵盖了从基础安装到高级索引设置的方方面面，为开发人员和系统管理员提供了构建和优化基于 Sphinx 的搜索引擎所需的关键信息。用户可以根据手册学习如何集成 Sphinx 到项目中，以及如何充分利用其功能来提升搜索体验和应用性能。

support as well. (We are using version 2.2.1-beta here for the sake of example only; be sure

to change this to a specific version you're using.) You can use Windows Explorer in

Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive.

For the remainder of this guide, we will assume that the folders are unzipped into

C:\Sphinx, such that searchd.exe can be found in

C:\Sphinx\bin\searchd.exe. If you decide to use any different location for the

folders or configuration file, please change it accordingly.

2. Edit the contents of sphinx.conf.in - specifically entries relating to @CONFDIR@ - to paths

suitable for your system.

3. Install the searchd system as a Windows service:

C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config

C:\Sphinx\sphinx.conf.in --servicename SphinxSearch

4. The searchd service will now be listed in the Services panel within the Management

Console, available from Administrative Tools. It will not have been started, as you will need

to configure it and build your indexes with indexer before starting the service. A guide to

do this can be found under Quick tour.

During the next steps of the install (which involve running indexer pretty much as you

would on Linux) you may find that you get an error relating to libmysql.dll not being found.

If you have MySQL installed, you should find a copy of this library in your Windows

directory, or sometimes in Windows\System32, or failing that in the MySQL core

directories. If you do receive an error please copy libmysql.dll into the bin directory.

2.6. Sphinx deprecations and changes in default configuration

In 2.2.1-beta version we decided to start removing some old features. All of them was 'unofficially'

deprecated for some time. And we're informing you now about it.

Changes are as follows:

 32-bit document IDs are now deprecated. Our binary releases are now all built with 64-bit

IDs by default. Note that they can still load older indexes with 32-bit IDs, but that support

will eventually be removed. In fact, that was deprecated awhile ago, but now we just want to

make it clear: we don't see any sense in trying to save your server's RAM this way.

 dict=crc is now deprecated. It has a bunch of limitations, the most important ones being

keyword collisions, and no (good) wildcard matching support. You can read more about

those limitations in our documentation.

 charset_type=sbcs is now deprecated, we're slowly switching to UTF-only. Even if your

database is SBCS (likely for legacy reasons too, eh?), this should be absolutely trivial to

workaround, just add a pre-query to fetch your data in UTF-8 and you're all set. Also, in fact,

our current UTF-8 tokenizer is even faster than the SBCS one.

 custom sort (@custom) is now removed from Sphinx. This feature was introduced long

before sort by expression became a reality and it has been deprecated for a very long time.

 enable_star is deprecated now. Previous default mode was enable_star=0 which was due to

compatibility with a very old Sphinx version. Such implicit star search isn't very intuitive.

So, we've decided to eventually remove it and have marked it as deprecated just recently. We

plan to totally remove this configuration key in the 2.2.X branch.

 str2ordinal attributes are deprecated. This feature allows you to perform sorting by a string.

But it's also possible to do this with ordinary string attributes, which is much easier to use.

str2ordinal only covers a small part of this functionality and is not needed now.

 str2wordcount attributes are deprecated. index_field_lengths=1 will create an integer

attribute with field length set automatically and we recommend to use this configuration key

when you need to store field lengths. Also, index_field_lengths=1 allows you to use new

ranking formulas like BM25F().

 hit_format is deprecated. This is a hidden configuration key - it's not mentioned in our

documentation. But, it's there and it's possible that someone may use it. And now we're

urging you: don't use it. The default value is 'inline' and it's a new standard. 'plain' hit_format

is obsolete and will be removed in the near future.

 docinfo=inline is deprecated. You can now use ondisk_attrs or ondisk_attrs_default instead.

 workers=threads is a new default for all OS now. We're gonna get rid of other modes in

future.

 mem_limit=128M is a new default.

 rt_mem_limit=128M is a new default.

 ondisk_dict is deprecated. No need to save RAM this way.

 ondisk_dict_default is deprecated. No need to save RAM this way.

 compat_sphinxql_magics was removed. Now you can't use an old result format and

SphinxQL always looks more like ANSI SQL.

 Completely removed xmlpipe. This was a very old ad hoc solution for a particular customer.

xmlpipe2 surpasses it in every single aspect.

None of the different querying methods are deprecated, but as of version 2.2.1-beta, SphinxQL is

the most advanced method. We plan to remove SphinxAPI and Sphinx SE someday so it would be a

good idea to start using SphinxQL.

 The SetWeights() API call has been deprecated for a long time and has now been removed

from official APIs.

 The default matching mode for the API is now 'extended'. Actually, all other modes are

deprecated. We recommend using the extended query syntax instead.

Changes for 2.2.2-beta:

 Removed deprecated "address" and "port" directives. Use "listen" instead.

 Removed str2wordcount attributes. Use index_field_lengths=1 instead.

 Removed str2ordinal attributes. Use string attributes for sorting.

 ondisk_dict and ondisk_dict_default was removed.

 Removed charset_type and mssql_unicode - we now support only UTF-8 encoding.

 Removed deprecated enable_star. Now always work as with enable_star=1.

 Removed CLI search which confused people instead of helping them and sql_query_info.

 Deprecated SetMatchMode() API call.

 Changed default thread_stack value to 1M.

 Deprecated SetOverride() API call.

Changes for 2.2.3-beta:

SELECT COUNT(*) c, id%3 idd FROM test1 GROUP BY idd HAVING COUNT(*
)>1;
SELECT COUNT(*) FROM test1;
CALL KEYWORDS ('one two three', 'test1');
CALL KEYWORDS ('one two three', 'test1', 1);
Happy searching! 
Chapter 3. Indexing
Table of Contents
3.1. Data sources 
3.2. Full-text fields 
3.3. Attributes 
3.4. MVA (multi-valued attributes) 
3.5. Indexes 
3.6. Restrictions on the source data 
3.7. Charsets, case folding, translation tables, and replacement rules 
3.8. SQL data sources (MySQL, PostgreSQL) 
3.9. xmlpipe2 data source 
3.10. tsvpipe (Tab Separated Values) data source 
3.11. Live index updates 
3.12. Delta index updates 
3.13. Index merging 
3.1. Data sources
The data to be indexed can generally come from very different sources: SQL databases, plain text 
files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of 
structured documents, each of which has the same set of fields and attributes. This is similar to 
SQL, where each row would correspond to a document, and each column to either a field or an 
attribute. 
Depending on what source Sphinx should get the data from, different code is required to fetch the 
data and prepare it for indexing. This code is called data source driver (or simply driver or data 
source for brevity). 
At the time of this writing, there are built-in drivers for MySQL, PostgreSQL, MS SQL (on 
Windows), and ODBC. There is also a generic driver called xmlpipe2, which runs a specified 
command and reads the data from its stdout. See Section      3.9, “xmlpipe2 data source”    section for 
the format description. In 2.2.1-beta a tsvpipe (Tab Separated Values) data source was added. You 
can get more information here Section      3.10, “tsvpipe (Tab Separated Values) data source”   . 
There can be as many sources per index as necessary. They will be sequentially processed in the 
very same order which was specified in index definition. All the documents coming from those 
sources will be merged as if they were coming from a single source. 
3.2. Full-text fields
Full-text fields (or just fields for brevity) are the textual document contents that get indexed by 
Sphinx, and can be (quickly) searched for keywords. 

Fields are named, and you can limit your searches to a single field (eg. search through "title" only)

or a subset of fields (eg. to "title" and "abstract" only). Sphinx index format generally supports up to

256 fields. However, up to version 2.0.1-beta indexes were forcibly limited by 32 fields, because of

certain complications in the matching engine. Full support for up to 256 fields was added in version

2.0.2-beta.

Note that the original contents of the fields are not stored in the Sphinx index. The text that you

send to Sphinx gets processed, and a full-text index (a special data structure that enables quick

searches for a keyword) gets built from that text. But the original text contents are then simply

discarded. Sphinx assumes that you store those contents elsewhere anyway.

Moreover, it is impossible to fully reconstruct the original text, because the specific whitespace,

capitalization, punctuation, etc will all be lost during indexing. It is theoretically possible to

partially reconstruct a given document from the Sphinx full-text index, but that would be a slow

process (especially if the CRC dictionary is used, which does not even store the original keywords

and works with their hashes instead).

3.3. Attributes

Attributes are additional values associated with each document that can be used to perform

additional filtering and sorting during search.

It is often desired to additionally process full-text search results based not only on matching

document ID and its rank, but on a number of other per-document values as well. For instance, one

might need to sort news search results by date and then relevance, or search through products within

specified price range, or limit blog search to posts made by selected users, or group results by

month. To do that efficiently, Sphinx allows to attach a number of additional attributes to each

document, and store their values in the full-text index. It's then possible to use stored values to filter,

sort, or group full-text matches.

Attributes, unlike the fields, are not full-text indexed. They are stored in the index, but it is not

possible to search them as full-text, and attempting to do so results in an error.

For example, it is impossible to use the extended matching mode expression @column 1 to match

documents where column is 1, if column is an attribute, and this is still true even if the numeric

digits are normally indexed.

Attributes can be used for filtering, though, to restrict returned rows, as well as sorting or result

grouping; it is entirely possible to sort results purely based on attributes, and ignore the search

relevance tools. Additionally, attributes are returned from the search daemon, while the indexed text

is not.

A good example for attributes would be a forum posts table. Assume that only title and content

fields need to be full-text searchable - but that sometimes it is also required to limit search to a

certain author or a sub-forum (ie. search only those rows that have some specific values of

author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to

group matching posts by month of the post_date and calculate per-group match counts.

This can be achieved by specifying all the mentioned columns (excluding title and content, that are

full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and

grouping. Here as an example.

Example sphinx.conf part:

...

sql_query = SELECT id, title, content, \

author_id, forum_id, post_date FROM my_forum_posts

剩余253页未读，继续阅读

WiOS

粉丝: 0
资源: 2

Sphinx 2.2.5 发布：开源全文搜索引擎指南

sphinx-2.2.10-release-win64-full

sphinx-2.0.7-release

sphinx2.2.10-release.tar.gz

sphinx-2.0.8-release-win32.zip

sphinx-2.2.11-release-win64.zip

sphinx-2.2.10-release-win32.zip

sphinx-2.2.10-release.tar

sphinx-2.0.8-release.tar.gz源码包

sphinx-2.2.5

OPEN SPHiNX C-- Compiler-开源

最新资源