Imangazaliev/DiDOM

Stars: 2134

Forks: 230

Pull Requests: 33

Issues: 170

Watchers: 87

Last Updated: 2023-03-05 03:49:39

#html-parser #xml-parser #html #parser #xml #dom #xpath

Simple and fast HTML and XML parser

License: MIT License

Languages: PHP

DiDOM

DiDOM - simple and fast HTML parser.

README на русском
DiDOM 1.x documentation. To upgrade from 1.x please checkout the changelog.

Installation
Quick start
Creating new document
Search for elements
Verify if element exists
Search in element
Supported selectors
Changing content
Output
Working with elements
Working with cache
Miscellaneous
Comparison with other parsers

Installation

To install DiDOM run the command:

composer require imangazaliev/didom

Quick start

use DiDom\Document;

$document = new Document('http://www.news.com/', true);

$posts = $document->find('.post');

foreach($posts as $post) {
    echo $post->text(), "\n";
}

Creating new document

DiDom allows to load HTML in several ways:

With constructor

// the first parameter is a string with HTML
$document = new Document($html);

// file path
$document = new Document('page.html', true);

// or URL
$document = new Document('http://www.example.com/', true);

The second parameter specifies if you need to load file. Default is false.

Signature:

__construct($string = null, $isFile = false, $encoding = 'UTF-8', $type = Document::TYPE_HTML)

$string - an HTML or XML string or a file path.

$isFile - indicates that the first parameter is a path to a file.

$encoding - the document encoding.

$type - the document type (HTML - Document::TYPE_HTML, XML - Document::TYPE_XML).

With separate methods

$document = new Document();

$document->loadHtml($html);

$document->loadHtmlFile('page.html');

$document->loadHtmlFile('http://www.example.com/');

There are two methods available for loading XML: loadXml and loadXmlFile.

These methods accept additional options:

$document->loadHtml($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$document->loadHtmlFile($url, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$document->loadXml($xml, LIBXML_PARSEHUGE);
$document->loadXmlFile($url, LIBXML_PARSEHUGE);

Search for elements

DiDOM accepts CSS selector or XPath as an expression for search. You need to path expression as the first parameter, and specify its type in the second one (default type is Query::TYPE_CSS):

With method `find()`:

use DiDom\Document;
use DiDom\Query;

...

// CSS selector
$posts = $document->find('.post');

// XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);

If the elements that match a given expression are found, then method returns an array of instances of DiDom\Element, otherwise - an empty array. You could also get an array of DOMElement objects. To get this, pass false as the third parameter.

With magic method `__invoke()`:

$posts = $document('.post');

Warning: using this method is undesirable because it may be removed in the future.

With method `xpath()`:

$posts = $document->xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");

You can do search inside an element:

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

Verify if element exists

To verify if element exist use has() method:

if ($document->has('.post')) {
    // code
}

If you need to check if element exist and then get it:

if ($document->has('.post')) {
    $elements = $document->find('.post');
    // code
}

but it would be faster like this:

if (count($elements = $document->find('.post')) > 0) {
    // code
}

because in the first case it makes two queries.

Search in element

Methods find(), first(), xpath(), has(), count() are available in Element too.

Example:

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

Method `findInDocument()`

If you change, replace, or remove an element that was found in another element, the document will not be changed. This happens because method find() of Element class (a, respectively, the first () and xpath methods) creates a new document to search.

To search for elements in the source document, you must use the methods findInDocument() and firstInDocument():

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head')->firstInDocument('title')->remove();

Warning: methods findInDocument() and firstInDocument() work only for elements, which belong to a document, and for elements created via new Element(...). If an element does not belong to a document, LogicException will be thrown;

Supported selectors

DiDom supports search by:

tag
class, ID, name and value of an attribute
pseudo-classes:
- first-, last-, nth-child
- empty and not-empty
- contains
- has

// all links
$document->find('a');

// any element with id = "foo" and "bar" class
$document->find('#foo.bar');

// any element with attribute "name"
$document->find('[name]');
// the same as
$document->find('*[name]');

// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');

// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');

// all links starting with https
$document->find('a[href^=https]');

// all images with the extension png
$document->find('img[src$=png]');

// all links containing the string "example.com"
$document->find('a[href*=example.com]');

// text of the links with "foo" class
$document->find('a.foo::text');

// address and title of all the fields with "bar" class
$document->find('a.bar::attr(href|title)');

Changing content

Change inner HTML

$element->setInnerHtml('<a href="#">Foo</a>');

Change inner XML

$element->setInnerXml(' Foo <span>Bar</span><!-- Baz --><![CDATA[
    <root>Hello world!</root>
]]>');

Change value (as plain text)

$element->setValue('Foo');
// will be encoded like using htmlentities()
$element->setValue('<a href="#">Foo</a>');

Output

Getting HTML

With method `html()`:

$posts = $document->find('.post');

echo $posts[0]->html();

Casting to string:

$html = (string) $posts[0];

Formatting HTML output

$html = $document->format()->html();

An element does not have format() method, so if you need to output formatted HTML of the element, then first you have to convert it to a document:

$html = $element->toDocument()->format()->html();

Inner HTML

$innerHtml = $element->innerHtml();

Document does not have the method innerHtml(), therefore, if you need to get inner HTML of a document, convert it into an element first:

$innerHtml = $document->toElement()->innerHtml();

Getting XML

echo $document->xml();

echo $document->first('book')->xml();

Getting content

$posts = $document->find('.post');

echo $posts[0]->text();

Creating a new element

Creating an instance of the class

use DiDom\Element;

$element = new Element('span', 'Hello');

// Outputs "<span>Hello</span>"
echo $element->html();

First parameter is a name of an attribute, the second one is its value (optional), the third one is element attributes (optional).

An example of creating an element with attributes:

$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];

$element = new Element('textarea', 'Text', $attributes);

An element can be created from an instance of the class DOMElement:

use DiDom\Element;
use DOMElement;

$domElement = new DOMElement('span', 'Hello');

$element = new Element($domElement);

Using the method `createElement`

$document = new Document($html);

$element = $document->createElement('span', 'Hello');

Getting the name of an element

$element->tagName();

Getting parent element

$document = new Document($html);

$input = $document->find('input[name=email]')[0];

var_dump($input->parent());

Getting sibling elements

$document = new Document($html);

$item = $document->find('ul.menu > li')[1];

var_dump($item->previousSibling());

var_dump($item->nextSibling());

Getting the child elements

$html = '<div>Foo<span>Bar</span><!--Baz--></div>';

$document = new Document($html);

$div = $document->first('div');

// element node (DOMElement)
// string(3) "Bar"
var_dump($div->child(1)->text());

// text node (DOMText)
// string(3) "Foo"
var_dump($div->firstChild()->text());

// comment node (DOMComment)
// string(3) "Baz"
var_dump($div->lastChild()->text());

// array(3) { ... }
var_dump($div->children());

Getting owner document

$document = new Document($html);

$element = $document->find('input[name=email]')[0];

$document2 = $element->ownerDocument();

// bool(true)
var_dump($document->is($document2));

$element->setAttribute('name', 'username');

With method `attr`:

$element->attr('name', 'username');

With magic method `__set`:

$element->name = 'username';

Getting value of an attribute

With method `getAttribute`:

$username = $element->getAttribute('value');

With method `attr`:

$username = $element->attr('value');

With magic method `__get`:

$username = $element->name;

Returns null if attribute is not found.

Verify if attribute exists

With method `hasAttribute`:

if ($element->hasAttribute('name')) {
    // code
}

With magic method `__isset`:

if (isset($element->name)) {
    // code
}

Removing attribute:

With method `removeAttribute`:

$element->removeAttribute('name');

With magic method `__unset`:

unset($element->name);

Comparing elements

$element  = new Element('span', 'hello');
$element2 = new Element('span', 'hello');

// bool(true)
var_dump($element->is($element));

// bool(false)
var_dump($element->is($element2));

Appending child elements

$list = new Element('ul');

$item = new Element('li', 'Item 1');

$list->appendChild($item);

$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($items);

Adding a child element

$list = new Element('ul');

$item = new Element('li', 'Item 1');
$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($item);
$list->appendChild($items);

Replacing element

$element = new Element('span', 'hello');

$document->find('.post')[0]->replace($element);

Waning: you can replace only those elements that were found directly in the document:

// nothing will happen
$document->first('head')->first('title')->replace($title);

// but this will do
$document->first('head title')->replace($title);

More about this in section Search for elements.

Removing element

$document->find('.post')[0]->remove();

Warning: you can remove only those elements that were found directly in the document:

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head title')->remove();

More about this in section Search for elements.

Working with cache

Cache is an array of XPath expressions, that were converted from CSS.

Getting from cache

use DiDom\Query;

...

$xpath    = Query::compile('h2');
$compiled = Query::getCompiled();

// array('h2' => '//h2')
var_dump($compiled);

Cache setting

Query::setCompiled(['h2' => '//h2']);

Miscellaneous

`preserveWhiteSpace`

By default, whitespace preserving is disabled.

You can enable the preserveWhiteSpace option before loading the document:

$document = new Document();

$document->preserveWhiteSpace();

$document->loadXml($xml);

`count`

The count () method counts children that match the selector:

// prints the number of links in the document
echo $document->count('a');

// prints the number of items in the list
echo $document->first('ul')->count('li');

`matches`

Returns true if the node matches the selector:

$element->matches('div#content');

// strict match
// returns true if the element is a div with id equals content and nothing else
// if the element has any other attributes the method returns false
$element->matches('div#content', true);

`isElementNode`

Checks whether an element is an element (DOMElement):

$element->isElementNode();

`isTextNode`

Checks whether an element is a text node (DOMText):

$element->isTextNode();

`isCommentNode`

Checks whether the element is a comment (DOMComment):

$element->isCommentNode();

Comparison with other parsers

OPEN ISSUES

See all

CSS selector doesn't work on soap xml response by @vedmant
Incorrect parsing when string contains BOM symbol by @koteezy
Get link attributes by @eslaser
Invalid expression for query path by @kowach
Couldn't add child. by @yusufusta
Selector find(':not(:contains("any"))') not work by @lokanaft
Some HTML Comments go on new line by @miro-ux
Замена найденного элемента на HTML строку by @FrozenCoyote
Chinese characters inside script tag breaking by @gijo-varghese
abount script by @diybl
Strange behavior by dublicating elements by @CnczubehoerEu
element "link" is not handled correctly by @DavidBruchmann
Валидация? by @KarelWintersky
All attributes to array by @ndowerdev
nth-child(-n+3) by @fabian-jg
Как получить только дочерние элементы (без рекурсивного поиска) ? (аналог children(selector) из jquery?) by @rusproject
Разобрать строку внутри элемента by @staixe
find не работает в cml файлах с кириллическими тегами by @lyrmin
Не работает $element->closest('body > *'); by @lyrmin
wait to upload by @guypeled1
DIDOM not working in Livewire components by @isaacdarcilla
trim() expects parameter 1 to be string, boolean given (0) by @vaajnur
get all nodes by @FrozenCoyote
Fatal error: Uncaught RuntimeException: Could not load file by @dewinang
loadHtml does not work with latest version by @kgnfth
search on latin characters throws error by @Granaat
Find all class, ID, Attribute, and tag in Document by @mizanexpertofficial
Как отключить авто-обертку html body, и перенос тега style в head? by @inilim
How to unwrap a parent node and elevate the children? by @scott8035
How to disable autofixing? by @uzairwp
find() all tags by @miroslavdostanic

RELEASES

See all

v1.0 by @Imangazaliev
v1.1 by @Imangazaliev
v1.2 by @Imangazaliev
v1.3.1 by @Imangazaliev
v1.3.2 by @Imangazaliev
v1.4 by @Imangazaliev
v1.5 by @Imangazaliev
v1.5.1 by @Imangazaliev
v1.6 by @Imangazaliev
v1.6.1 by @Imangazaliev
v1.6.2 by @Imangazaliev
v1.6.3 by @Imangazaliev
by @Imangazaliev
v1.6.5 by @Imangazaliev
v1.6.8 by @Imangazaliev
v1.7.0 by @Imangazaliev
v1.7.1 by @Imangazaliev
v1.7.2 by @Imangazaliev
v1.7.3 by @Imangazaliev
v1.7.4 by @Imangazaliev
v1.8 by @Imangazaliev
v1.8.1 by @Imangazaliev
v1.8.2 by @Imangazaliev
v1.8.3 by @Imangazaliev
v1.8.4 by @Imangazaliev
v1.8.5 by @Imangazaliev
v1.8.6 by @Imangazaliev
v1.8.7 by @Imangazaliev
v1.8.8 by @Imangazaliev
v1.9.0 by @Imangazaliev
v1.9.1 by @Imangazaliev
v1.10 by @Imangazaliev
v1.10.1 by @Imangazaliev
v1.10.2 by @Imangazaliev
v1.10.3 by @Imangazaliev
v1.10.4 by @Imangazaliev
v1.10.5 by @Imangazaliev
v1.10.6 by @Imangazaliev
v1.11 by @Imangazaliev
v1.11.1 by @Imangazaliev
v1.12 by @Imangazaliev
v1.13 by @Imangazaliev
v1.14 by @Imangazaliev
v1.14.1 by @Imangazaliev
v1.15 by @Imangazaliev
v1.16 by @Imangazaliev
v1.16.1 by @Imangazaliev
v1.16.3 by @Imangazaliev
v1.16.4 by @Imangazaliev
v1.17 by @Imangazaliev
v1.18 by @Imangazaliev
v2.0 by @Imangazaliev
v2.0.1 by @Imangazaliev

Imangazaliev/DiDOM

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">DiDOM

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Contents

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Installation

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Quick start

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Creating new document

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With constructor

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With separate methods

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Search for elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method find():

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With magic method __invoke():

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method xpath():

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Verify if element exists

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Search in element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Method findInDocument()

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Supported selectors

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Changing content

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Change inner HTML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Change inner XML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Change value (as plain text)

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Output

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting HTML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method html():

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Casting to string:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Formatting HTML output

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Inner HTML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting XML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting content

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Creating a new element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Creating an instance of the class

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Using the method createElement

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting the name of an element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting parent element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting sibling elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting the child elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting owner document

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Working with element attributes

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Creating/updating an attribute

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method setAttribute:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method attr:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With magic method __set:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting value of an attribute

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method getAttribute:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method attr:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With magic method __get:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Verify if attribute exists

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method hasAttribute:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With magic method __isset:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Removing attribute:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With method removeAttribute:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">With magic method __unset:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Comparing elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Appending child elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Adding a child element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Replacing element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Removing element

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Working with cache

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting from cache

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Cache setting

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Miscellaneous

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">preserveWhiteSpace

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">count

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">matches

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">isElementNode

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">isTextNode

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">isCommentNode

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Comparison with other parsers

OPEN ISSUES

RELEASES

DiDOM

Contents

Installation

Quick start

Creating new document

With constructor

With separate methods

Search for elements

With method `find()`:

With magic method `__invoke()`:

With method `xpath()`:

Verify if element exists

Search in element

Method `findInDocument()`

Supported selectors

Changing content

Change inner HTML

Change inner XML

Change value (as plain text)

Output

Getting HTML

With method `html()`:

Casting to string:

Formatting HTML output

Inner HTML

Getting XML

Getting content

Creating a new element

Creating an instance of the class

Using the method `createElement`

Getting the name of an element

Getting parent element

Getting sibling elements

Getting the child elements

Getting owner document

Working with element attributes

Creating/updating an attribute

With method `setAttribute`:

With method `attr`:

With magic method `__set`:

Getting value of an attribute

With method `getAttribute`:

With method `attr`:

With magic method `__get`:

Verify if attribute exists

With method `hasAttribute`:

With magic method `__isset`:

Removing attribute:

With method `removeAttribute`:

With magic method `__unset`:

Comparing elements

Appending child elements

Adding a child element

Replacing element

Removing element

Working with cache

Getting from cache

Cache setting

Miscellaneous

`preserveWhiteSpace`

`count`

`matches`

`isElementNode`

`isTextNode`

`isCommentNode`

Comparison with other parsers