FOREWORDxvi
That amounted to a big challenge, as it required Tika to provide a flexible and robust
set of interfaces that could be used in any programming context where metadata anal-
ysis was needed.
Luckily, Tika got there. With this book, written by Tika’s two main creators and
maintainers, Chris and Jukka, you’ll understand the problems of document analysis
and document information extraction. They first explain to the reader why develop-
ers have such a need for Tika. Today, content handling and analysis are basic building
blocks of all major modern services: search engines, content management systems,
data mining, and other areas.
If you’re a software developer, you’ve no doubt needed, on many occasions, to
guess the encoding, formatting, and language of a file, and then to extract its meta-
data (title, author, and so on) and content. And you’ve probably noticed that this is a
pain. That’s what Tika does for you. It provides a robust toolkit to easily handle any data
format and to simplify this painful process.
Chris and Jukka explain many details and examples of the Tika API and toolkit,
including the Tika command-line interface and its graphical user interface (GUI) that
you can use to extract information about any type of file handled by Tika. They show
how you can use the Tika Application Programming Interface (API) to integrate Tika
commodities directly with your own projects. You’ll discover that Tika is both simple
to use and powerful. Tika has been carefully designed by Chris and Jukka and, despite
the internal complexity of this type of library, Tika’s API and tools are simple and easy
to understand and to use.
Finally, Chris and Jukka show many real-life uses cases of Tika. The most noticeable
real-life projects are Tika powering the NASA Science Data Systems, Tika curating can-
cer research data at the National Cancer Institute’s Early Detection Research Net-
work, and the use of Tika for content management within the Apache Jackrabbit
project. Tika is already used in many projects.
I’m proud to have helped launch Tika. And I’m extremely grateful to Chris and
Jukka for bringing Tika to this level and knowing that the long nights I spent writing
code for automatic language identification for the MIME type repository weren’t in
vain. To now make (even) a small contribution, for example, to assist in research in
the fight against cancer, goes straight to my heart.
Thank you both for all your work, and thank you for this book.
JÉRÔME CHARRON
CHIEF TECHNICAL OFFICER
WEBPULSE