Stars: 105
Forks: 21
Pull Requests: 9
Issues: 26
Watchers: 5
Last Updated: 2023-08-31 09:59:16
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
License: MIT License
Languages: PHP, Shell
This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats.
The following modes are supported:
Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.
Although the library contains a list of supported versions, any version of Apache Tika should be compatible as long as backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library to work with the new versions of the tool.
NOTE: the supported PHP version will remain synced with the latest supported by PHP team
Install using Composer:
composer require vaites/php-apache-tika
If you want to use OCR you must install Tesseract:
sudo yum install tesseract
(use dnf instead of yum on Fedora 22 or greater)sudo apt-get install tesseract-ocr
brew install tesseract
(using Homebrew)scoop install tesseract
(using Scoop)The library assumes tesseract
binary is in path, so you can compile it yourself or install using any other method.
Start Apache Tika server with caution:
java -jar tika-server-x.xx.jar
If you are using JRE instead of JDK, you must run if you have Java 9 or greater:
java --add-modules java.se.ee -jar tika-server-x.xx.jar
Instantiate the class, checking if JAR exists or server is running:
$client = \Vaites\ApacheTika\Client::make('localhost', 9998); // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar'); // app mode
If you want to use dependency injection, serialize the class or just delay the check:
$client = \Vaites\ApacheTika\Client::prepare('localhost', 9998);
$client = \Vaites\ApacheTika\Client::prepare('/path/to/tika-app.jar');
You can use an URL too:
$client = \Vaites\ApacheTika\Client::make('http://localhost:9998');
$client = \Vaites\ApacheTika\Client::prepare('http://localhost:9998');
Use the class to extract text from documents:
$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');
$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');
Or use to extract text from images:
$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');
$text = $client->getText('/path/to/your/image');
You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's
no need to add -enableUnsecureFeatures -enableFileUrl
to command line when starting the server, as described
here.
If you use Apache Tika >= 2.0.0, you can define an HttpFetcher
and use the option -enableUnsecureFeatures -enableFileUrl
when starting the server to make the server download remote
files when passing a URL instead of a filename. In order to do so, you must set the name of the HttpFetcher using
$client->setFetcherName('yourFetcherName')
.
Here are the full list of available methods
Tika file related methods:
$client->getMetadata($file);
$client->getRecursiveMetadata($file, 'text');
$client->getLanguage($file);
$client->getMIME($file);
$client->getHTML($file);
$client->getXHTML($file); // only CLI mode
$client->getText($file);
$client->getMainText($file);
Other Tika related methods:
$client->getSupportedMIMETypes();
$client->getIsMIMETypeSupported('application/pdf');
$client->getAvailableDetectors();
$client->getAvailableParsers();
$client->getVersion();
Encoding methods:
$client->getEncoding();
$client->setEncoding('UTF-8');
Supported versions related methods:
$client->getSupportedVersions();
$client->isVersionSupported($version);
Set/get a callback for sequential read of response:
$client->setCallback($callback);
$client->getCallback();
Set/get the chunk size for secuential read:
$client->setChunkSize($size);
$client->getChunkSize();
Enable/disable the internal remote file downloader:
$client->setDownloadRemote(true);
$client->getDownloadRemote();
Set the fetcher name:
$client->setFetcherName($fetcher); // one of FileSystemFetcher, HttpFetcher, S3Fetcher, GCSFetcher, or SolrFetcher
$client->getFetcherName();
Set/get JAR/Java paths (only CLI mode):
$client->setPath($path);
$client->getPath();
$client->setJava($java);
$client->getJava();
$client->setJavaArgs('-JXmx4g');
$client->getJavaArgs();
$client->setEnvVars(['LANG' => 'es_ES.UTF-8']);
$client->getEnvVars();
Set/get host properties
$client->setHost($host);
$client->getHost();
$client->setPort($port);
$client->getPort();
$client->setUrl($url);
$client->getUrl();
$client->setRetries($retries);
$client->getRetries();
Set/get cURL client options
$client->setOptions($options);
$client->getOptions();
$client->setOption($option, $value);
$client->getOption($option);
Set/get timeout:
$client->setTimeout($seconds);
$client->getTimeout();
Set/get HTTP headers (see TikaServer):
$client->setHeader('Foo', 'bar');
$client->getHeader('Foo');
$client->setHeaders(['Foo' => 'bar', 'Bar' => 'baz']);
$client->getHeaders();
Set/get OCR languages (see TikaOCR):
$client->setOCRLanguage($language);
$client->setOCRLanguages($languages);
$client->getOCRLanguages();
Set HTTP fetcher name (for Tika >= 2.0.0 only, see https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)
$client->setFetcherName($fetcherName)
Since 1.0 version there are some breaking changes:
$client->getRecursiveMetadata()
returns an array as expectedClient::getSupportedVersions()
and Client::isVersionSupported()
methods cannot be called staticallyClient::getAvailableDetectors()
and Client::getAvailableParsers()
are identical and have a new definitionSee CHANGELOG.md for more details.
This library is only a proxy so if you get an empy responses or unexpected results the most common cause is Tika itself. A simple test is using the GUI to check the response:
java -jar tika-app-x.xx.jar
If the results are the same, you must take a look into Tika's Jira and open an issue if necessary.
By default the returned text is encoded with UTF-8, andthe Client::setEncoding()
method allows to set the expected
encoding.
Tests are designed to cover all features for all supported versions of Apache Tika in app mode and server mode. There are a few samples to test against:
There are some issues found during tests, not related with this library: