olamedia/nokogiri

Stars: 231

Forks: 94

Pull Requests: 11

Issues: 22

Watchers: 26

Last Updated: 2020-09-15 04:14:54

#html #nokogiri #php #html-parser #xpath #css #domdocument #attr

HTML parser for PHP - Парсер HTML

License: MIT License

Languages: PHP, HTML

http://olamedia.github.com/nokogiri/

Attention: New version can break compatibility, in that case use previous version under the v1.0 branch or tag which supports even php 5.4+

\nokogiri class is left for compatibility

In English На русском

HTML parser

This library is a fast HTML parser, which can work with invalid code (errors are ignored).
Under the hood is used LibXML.
As the input you can use HTML string in UTF-8 encoding or DOMDocument.
For the querying elements CSS selectors are used, which are transformed to XPath expressions internally.

Usage

Loading HTML

HTML errors are ignored

From HTML string $saw = new \nokogiri($html); $saw = \nokogiri::fromHtml($html);
From DOM elements $saw = new \nokogiri($dom); $saw = \nokogiri::fromDom($dom);

get($cssSelector)

$cssSelector elements have the following format: tagName[attribute=value]#elementId.className:pseudoSelector(expression)

$saw->get('div > a[rel=bookmark]')->toArray();

toArray()

Returns underlying DOM structure as an array.
Values are attributes, text content under #text key and child elements under numeric keys

toXml()

Returns HTML string

getDom() toDom()

Returns DOMDocument. Given true as the first argument - can also return DOMNodeList or DOMElement

Iteration over found elements

foreach ($saw->get('#sidebar a.topic') as $link){
    var_dump($link['#text']);
}

Implemented selectors

tag
.class
#id
[attr]
[attr=value]
:root
:empty
:first-child
:last-child
:first-of-type
:last-of-type
:only-of-type
:nth-child(a)
:nth-child(an+b)
:nth-child(even/odd)

Requirements

DOM
libxml >=2.9.0
PHP >= 7.3

License

MIT

What's new

2.0.0

Minimal PHP version 7.3
Minimal LibXML version 2.9.0
Complete refactoring
Partially changed behaviour, can break compatibility
HTML loading behaviour changed
Test coverage
Fixed work of nth-child and other selectors
Incorrect selectors now throw exceptions
New selectors added

1.0.0

First version, 2011
Minimal PHP version 5.4

OPEN ISSUES

See all

Скрипт не может корректно обработать большой объём входного html. by @KhArtNJava
Поиск элементов по Xpath by @FirestarterUA
Установка кодировки by @visavi
Attention: New version can break compatibility Внимание, новая версия by @olamedia
Emty cells in table by @sptik12

RELEASES

See all

by @olamedia
24 May 2015 by @olamedia

olamedia/nokogiri

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">HTML parser

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Usage

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Loading HTML

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">get($cssSelector)

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">toArray()

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">toXml()

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">getDom() toDom()

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Iteration over found elements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Implemented selectors

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Requirements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">License

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">What's new

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">2.0.0

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">1.0.0