vaites/php-apache-tika

Stars: 105

Forks: 21

Pull Requests: 9

Issues: 26

Watchers: 5

Last Updated: 2023-08-31 09:59:16

#apache #tika #text-extraction #text-recognition #ocr #php-library

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

License: MIT License

Languages: PHP, Shell

PHP Apache Tika

This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats.

The following modes are supported:

App mode: run app JAR via command line interface
Server mode: make HTTP requests to JSR 311 network server

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.

Although the library contains a list of supported versions, any version of Apache Tika should be compatible as long as backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library to work with the new versions of the tool.

Features

Simple class interface to Apache Tika features:
- Text and HTML extraction
- Metadata extraction
- OCR recognition
Standarized metadata for documents
Support for local and remote resources
No heavyweight library dependencies
Compatible with Apache Tika 1.15 or greater
- Tested up to 1.28.5 and 2.8.0
Works on Linux, macOS, Windows and probably on FreeBSD

Requirements

PHP 7.3 or greater
- Multibyte String support
- cURL extension
Apache Tika 1.15 or greater
Oracle Java or OpenJDK
- Java 8 for Tika 1.19 or greater
- Java 7 for Tika from 1.15 to 1.18
Tesseract (optional for OCR recognition)

NOTE: the supported PHP version will remain synced with the latest supported by PHP team

Installation

Install using Composer:

composer require vaites/php-apache-tika

If you want to use OCR you must install Tesseract:

Fedora/CentOS: sudo yum install tesseract (use dnf instead of yum on Fedora 22 or greater)
Debian/Ubuntu: sudo apt-get install tesseract-ocr
macOS: brew install tesseract (using Homebrew)
Windows: scoop install tesseract (using Scoop)

The library assumes tesseract binary is in path, so you can compile it yourself or install using any other method.

Usage

Start Apache Tika server with caution:

java -jar tika-server-x.xx.jar

If you are using JRE instead of JDK, you must run if you have Java 9 or greater:

java --add-modules java.se.ee -jar tika-server-x.xx.jar

Instantiate the class, checking if JAR exists or server is running:

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode

If you want to use dependency injection, serialize the class or just delay the check:

$client = \Vaites\ApacheTika\Client::prepare('localhost', 9998);
$client = \Vaites\ApacheTika\Client::prepare('/path/to/tika-app.jar');

You can use an URL too:

$client = \Vaites\ApacheTika\Client::make('http://localhost:9998');
$client = \Vaites\ApacheTika\Client::prepare('http://localhost:9998');

Use the class to extract text from documents:

$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');

Or use to extract text from images:

$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');

You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's no need to add -enableUnsecureFeatures -enableFileUrl to command line when starting the server, as described here.

If you use Apache Tika >= 2.0.0, you can define an HttpFetcher and use the option -enableUnsecureFeatures -enableFileUrl when starting the server to make the server download remote files when passing a URL instead of a filename. In order to do so, you must set the name of the HttpFetcher using $client->setFetcherName('yourFetcherName').

Methods

Here are the full list of available methods

Common

Tika file related methods:

$client->getMetadata($file);
$client->getRecursiveMetadata($file, 'text');
$client->getLanguage($file);
$client->getMIME($file);
$client->getHTML($file);
$client->getXHTML($file); // only CLI mode
$client->getText($file);
$client->getMainText($file);

Other Tika related methods:

$client->getSupportedMIMETypes();
$client->getIsMIMETypeSupported('application/pdf');
$client->getAvailableDetectors();
$client->getAvailableParsers();
$client->getVersion();

Encoding methods:

$client->getEncoding();
$client->setEncoding('UTF-8');

Supported versions related methods:

$client->getSupportedVersions();
$client->isVersionSupported($version);

Set/get a callback for sequential read of response:

$client->setCallback($callback);
$client->getCallback();

Set/get the chunk size for secuential read:

$client->setChunkSize($size);
$client->getChunkSize();

Enable/disable the internal remote file downloader:

$client->setDownloadRemote(true);
$client->getDownloadRemote();

Set the fetcher name:

$client->setFetcherName($fetcher); // one of FileSystemFetcher, HttpFetcher, S3Fetcher, GCSFetcher, or SolrFetcher
$client->getFetcherName();

Command line client

Set/get JAR/Java paths (only CLI mode):

$client->setPath($path);
$client->getPath();

$client->setJava($java);
$client->getJava();

$client->setJavaArgs('-JXmx4g');
$client->getJavaArgs();

$client->setEnvVars(['LANG' => 'es_ES.UTF-8']);
$client->getEnvVars();

Web client

Set/get host properties

$client->setHost($host);
$client->getHost();

$client->setPort($port);
$client->getPort();

$client->setUrl($url);
$client->getUrl();

$client->setRetries($retries);
$client->getRetries();

Set/get cURL client options

$client->setOptions($options);
$client->getOptions();
$client->setOption($option, $value);
$client->getOption($option);

Set/get timeout:

$client->setTimeout($seconds);
$client->getTimeout();

Set/get HTTP headers (see TikaServer):

$client->setHeader('Foo', 'bar');
$client->getHeader('Foo');
$client->setHeaders(['Foo' => 'bar', 'Bar' => 'baz']);
$client->getHeaders();

Set/get OCR languages (see TikaOCR):

$client->setOCRLanguage($language);
$client->setOCRLanguages($languages);
$client->getOCRLanguages();

Set HTTP fetcher name (for Tika >= 2.0.0 only, see https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)

$client->setFetcherName($fetcherName)

Breaking changes

Since 1.0 version there are some breaking changes:

Apache Tika versions prior to 1.15 are not supported (use 0.x version for 1.14 and older)
PHP minimum requirement is 7.3 or greater (use 0.x version for 7.1 and older)
$client->getRecursiveMetadata() returns an array as expected
Client::getSupportedVersions() and Client::isVersionSupported() methods cannot be called statically
Values returned by Client::getAvailableDetectors() and Client::getAvailableParsers() are identical and have a new definition

See CHANGELOG.md for more details.

Troubleshooting

Empty responses or unexpected results

This library is only a proxy so if you get an empy responses or unexpected results the most common cause is Tika itself. A simple test is using the GUI to check the response:

Run the Tika app without arguments: java -jar tika-app-x.xx.jar
Drop your file or select it using File -> Open
Wait until the metadata appears
Get the text or HTML using View menu

If the results are the same, you must take a look into Tika's Jira and open an issue if necessary.

Encoding

By default the returned text is encoded with UTF-8, andthe Client::setEncoding() method allows to set the expected encoding.

Tests

Tests are designed to cover all features for all supported versions of Apache Tika in app mode and server mode. There are a few samples to test against:

sample1: document metadata and text extraction
sample2: image metadata
sample3: text recognition
sample4: unsupported media
sample5: huge text for callbacks
sample6: remote calls
sample7: text encoding
sample8: recursive metadatata

Known issues

There are some issues found during tests, not related with this library:

Apache Tika 1.17 and lower can't extract text from OCR as described in TIKA-2509
Tesseract slows down document parsing as described in TIKA-2359

Integrations

Symfony2 Bundle

OPEN ISSUES

See all

RELEASES

See all

v1.3.1 by @vaites
v1.3.0 by @vaites
v1.2.5 by @vaites
v1.2.4 by @vaites
v1.2.3 by @vaites
v1.2.2 by @vaites
v1.2.1 by @vaites
v1.2.0 by @vaites
v1.1.1 by @vaites
v1.1.0 by @vaites
v1.0.2 by @vaites
v1.0.1 by @vaites
v1.0.0 by @vaites
v0.9.3 by @vaites
v0.9.2 by @vaites
v0.9.1 by @vaites
v0.9.0 by @vaites
v0.8.0 by @vaites
v0.7.2 by @vaites
v0.7.1 by @vaites
v0.7.0 by @vaites
v0.6.0 by @vaites
v0.5.1 by @vaites
v0.5.0 by @vaites
v0.4.6 by @vaites
v0.4.5 by @vaites
v0.4.4 by @vaites
v0.4.3 by @vaites
v0.4.2 by @vaites
v0.4.1 by @vaites
v0.4.0 by @vaites
v0.3.7 by @vaites
v0.3.6 by @vaites
v0.3.5 by @vaites
v0.3.4 by @vaites
v0.3.3 by @vaites
v0.3.2 by @vaites
v0.3.1 by @vaites
v0.3.0 by @vaites
v0.2.0 by @vaites
v0.1.0 by @vaites

vaites/php-apache-tika

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">PHP Apache Tika

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Features

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Requirements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Installation

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Usage

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Methods

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Common

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Command line client

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Web client

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Breaking changes

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Troubleshooting

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Empty responses or unexpected results

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Encoding

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Tests

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Known issues

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Integrations