Stars: 180
Forks: 39
Pull Requests: 105
Issues: 142
Watchers: 32
Last Updated: 2023-06-05 14:11:35
php-mf2 is a pure, generic microformats-2 parser for PHP. It makes HTML as easy to consume as JSON.
License: Creative Commons Zero v1.0 Universal
Languages: PHP, HTML, Dockerfile
php-mf2 is a pure, generic microformats-2 parser. It makes HTML as easy to consume as JSON.
Instead of having a hard-coded list of all the different microformats, it follows a set of procedures to handle different property types (e.g. p-
for plaintext, u-
for URL, etc). This allows for a very small and maintainable parser.
There are two ways of installing php-mf2. We highly recommend installing php-mf2 using Composer. The rest of the documentation assumes that you have done so.
To install using Composer, run
composer require mf2/mf2
If you can’t or don’t want to use Composer, then php-mf2 can be installed the old way by downloading /Mf2/Parser.php
, adding it to your project and requiring it from files you want to call its functions from, like this:
<?php
require_once 'Mf2/Parser.php';
// Now all the functions documented below are available, for example:
$mf = Mf2\fetch('https://waterpigs.co.uk');
It is recommended to install the HTML5 parser for proper handling of HTML5 elements. Using composer, run
composer require masterminds/html5
If this library is added to your project, the php-mf2 parser will use it automatically instead of the built-in HTML parser.
From v0.2.9, php-mf2’s version tags are signed using GPG, allowing you to cryptographically verify that you’re using code which hasn’t been tampered with. To verify the code you will need the GPG keys for one of the people in the list of code signers:
To import the relevant keys into your GPG keychain, execute the following command:
gpg --recv-keys 1C00430B19C6B426922FE534BEF8CE58118AD524 F38412A155FB8B15B7DD8E0742252B5E65CE0ADD 0A939BA78203FCBC58A9E8B59D1E06618EE5B4D8
Then verify the installed files like this:
# in your project root
cd vendor/mf2/mf2
git tag -v v0.3.0
If nothing went wrong, you should see the tag commit message, ending something like this:
gpg: Signature made Wed 6 Aug 10:04:20 2014 GMT using RSA key ID 2B2BBB65
gpg: Good signature from "Barnaby Walters <[email protected]>"
gpg: aka "[jpeg image of size 12805]"
Possible issues:
git config --global gpg.program 'gpg2'
php-mf2 is PSR-0 autoloadable, so simply include Composer’s auto-generated autoload file (/vendor/autoload.php
) and you can start using it. These two functions cover most situations:
Mf2\fetch($url)
Mf2\parse($html, $url)
, where $url
is the URL from which $html
was loaded, if any. This parameter is required for correct relative URL parsing and must not be left out unless parsing HTML which is not loaded from the web.All parsing functions return a canonical microformats 2 representation of any microformats found on the page, as an array. For a general guide to safely and successfully processing parsed microformats data, see How to Consume Microformats 2 Data.
<?php
namespace YourApp;
require '/vendor/autoload.php';
use Mf2;
// (Above code (or equivalent) assumed in future examples)
$mf = Mf2\fetch('http://microformats.org');
// $mf is either a canonical mf2 array, or null on an error.
if (is_array($mf)) {
foreach ($mf['items'] as $microformat) {
// Note: in real code, never assume that a property exists, or that a particular property value is a string!
echo "A {$microformat['type'][0]} called {$microformat['properties']['name'][0]}\n";
}
}
Here we demonstrate parsing of microformats2 implied property parsing, where an entire h-card with name and URL properties is created using a single h-card
class.
<?php
$html = '<a class="h-card" href="https://waterpigs.co.uk/">Barnaby Walters</a>';
$output = Mf2\parse($html, 'https://waterpigs.co.uk/');
$output
is a canonical microformats2 array structure like:
{
"items": [
{
"type": ["h-card"],
"properties": {
"name": ["Barnaby Walters"],
"url": ["https://waterpigs.co.uk/"]
}
}
],
"rels": {},
"rel-urls": {}
}
If no microformats are found, items
will be an empty array.
Note that, whilst the property prefixes are stripped, the prefix of the h-*
classname(s) in the "type" array are retained.
Most of the time you’ll be getting your input HTML from a URL. You should pass that URL as the second parameter to Mf2\parse()
so that any relative URLs in the document can be resolved. For example, say you got the following HTML from http://example.org/post/1
:
<div class="h-card">
<h1 class="p-name">Mr. Example</h1>
<img class="u-photo" alt="" src="/photo.png" />
</div>
Parsing like this:
$output = Mf2\parse($html, 'http://example.org/post/1');
will result in the following output, with relative URLs made absolute:
{
"items": [{
"type": ["h-card"],
"properties": {
"name": ["Mr. Example"],
"photo": [{
"value": "http://example.org/photo.png",
"alt": ""
}]
}
}],
"rels": {},
"rel-urls": {}
}
php-mf2 correctly handles relative URL resolution according to the URI and HTML specs, including correct use of the <base>
element.
rel
Valuesphp-mf2 also parses any link relations in the document, placing them into two top-level arrays. For convenience and completeness, one is indexed by each individual rel value, and the other by each URL.
For example, this HTML:
<a rel="me" href="https://twitter.com/barnabywalters">Me on twitter</a>
<link rel="alternate etc" href="http://example.com/notes.atom" />
parses to the following canonical representation:
{
"items": [],
"rels": {
"me": ["https://twitter.com/barnabywalters"],
"alternate": ["http://example.com/notes.atom"],
"etc": ["http://example.com/notes.atom"]
},
"rel-urls": {
"https://twitter.com/barnabywalters": {
"text": "Me on twitter",
"rels": ["me"]
},
"http://example.com/notes.atom": {
"rels": ["alternate", "etc"]
}
}
}
If you’re not bothered about the microformats2 data and just want rels and alternates, you can (very slightly) improve performance by creating a Mf2\Parser
object (see below) and calling ->parseRelsAndAlternates()
instead of ->parse()
, e.g.
<?php
$parser = new Mf2\Parser('<link rel="…');
$relsAndAlternates = $parser->parseRelsAndAlternates();
Mf2\fetch()
will attempt to parse any response served with “HTML” in the content-type, regardless of what the status code is. If it receives a non-HTML response it will return null.
To learn what the HTTP status code for any request was, or learn more about the request, pass a variable name as the third parameter to Mf2\fetch()
— this will be filled with the contents of curl_getinfo()
, e.g:
<?php
$mf = Mf2\fetch('http://waterpigs.co.uk/this-page-doesnt-exist', true, $curlInfo);
if ($curlInfo['http_code'] == '404') {
// This page doesn’t exist.
}
If it was HTML then it is still parsed, as there are cases where error pages contain microformats — for example a deleted h-entry resulting in a 410 Gone response containing a stub h-entry with an explanation for the deletion.
The Mf2\parse()
function covers the most common usage patterns by internally creating an instance of Mf2\Parser
and returning the output all in one step. For some advanced usage you can also create an instance of Mf2\Parser
yourself.
The constructor takes two arguments, the input HTML (or a DOMDocument) and the URL to use as a base URL. Once you have a parser, there are a few other things you can do:
There are several ways to selectively parse microformats from a document. If you wish to only parse microformats from an element with a particular ID, Parser::parseFromId($id)
is the easiest way.
If your needs are more complex, Parser::parse
accepts an optional context DOMNode as its second parameter. Typically you’d use Parser::query
to run XPath queries on the document to get the element you want to parse from under, then pass it to Parser::parse
. Example usage:
$doc = 'More microformats, more microformats <div id="parse-from-here"><span class="h-card">This shows up</span></div> yet more ignored content';
$parser = new Mf2\Parser($doc);
$parser->parseFromId('parse-from-here'); // returns a document with only the h-card descended from div#parse-from-here
$elementIWant = $parser->query('an xpath query')[0];
$parser->parse(true, $elementIWant); // returns a document with only the Microformats under the selected element
There is still ongoing brainstorming around how HTML language attributes should be added to the parsed result. In order to use this feature, you will need to set a flag to opt in.
$doc = '<div class="h-entry" lang="sv" id="postfrag123">
<h1 class="p-name">En svensk titel</h1>
<div class="e-content" lang="en">With an <em>english</em> summary</div>
<div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>';
$parser = new Mf2\Parser($doc);
$parser->lang = true;
$result = $parser->parse();
{
"items": [
{
"type": ["h-entry"],
"properties": {
"name": ["En svensk titel"],
"content": [
{
"html": "With an <em>english</em> summary",
"value": "With an english summary",
"lang": "en"
},
{
"html": "Och <em>svensk</em> huvudtext",
"value": "Och svensk huvudtext",
"lang": "sv"
}
]
},
"lang": "sv"
}
],
"rels": {},
"rel-urls": {}
}
Note that this option is still considered experimental and in development, and the parsed output may change between minor releases.
Due to a quirk with the way PHP arrays work, there is an edge case (reported by Tom Morris) in which a document with no rel values, when serialised as JSON, results in an empty object as the rels value rather than an empty array. Replacing this in code with a stdClass breaks PHP iteration over the values.
As of version 0.2.6, the default behaviour is back to being PHP-friendly, so if you want to produce results specifically for serialisation as JSON (for example if you run a HTML -> JSON service, or want to run tests against JSON fixtures), enable JSON mode:
// …by passing true as the third constructor:
$jsonParser = new Mf2\Parser($html, $url, true);
php-mf2 has some support for parsing classic microformats markup. It’s enabled by default, but can be turned off by calling Mf2\parse($html, $url, false);
or $parser->parse(false);
if you’re instantiating a parser yourself.
If the built in mappings don’t successfully parse some classic microformats markup, please raise an issue and we’ll fix it.
No filtering of content takes place in mf2\Parser, so treat its output as you would any untrusted data from the source of the parsed document.
Some tips:
e-*
property is not HTML-escaped. For example, <span class="p-name"><code></span>
will result in "name": ["<code>"]
. At the very least, HTML-escape all properties before echoing them out in HTMLe-*
properties, you SHOULD purify the HTML before displaying it to prevent injection of arbitrary code. For PHP we recommend using HTML PurifierIssues and bug reports are very welcome. If you know how to write tests then please do so as code always expresses problems and intent much better than English, and gives me a way of measuring whether or not fixes have actually solved your problem. If you don’t know how to write tests, don’t worry :) Just include as much useful information in the issue as you can.
Pull requests very welcome, please try to maintain stylistic, structural and naming consistency with the existing codebase, and don’t be too upset if we make naming changes :)
composer install
../vendor/bin/phpunit
./vendor/bin/phpunit
) and that your code is compatible with all supported versions of PHP (./vendor/bin/phpcs -p
)There are currently two separate test suites: one, in tests/Mf2
, is written in phpunit, containing many microformats parsing examples as well as internal parser tests and regression tests for specific issues over php-mf2’s history. Run it with ./vendor/bin/phpunit
. If you do not have a live internet connection, you can exclude tests that depend on it: ./vendor/bin/phpunit --exclude-group internet
.
The other, in tests/test-suite
, is a custom test harness which hooks up php-mf2 to the cross-platform microformats test suite. To run these tests you must first install the tests with ./composer.phar install
. Each test consists of a HTML file and a corresponding JSON file, and the suite can be run with php ./tests/test-suite/test-suite.php
.
Currently php-mf2 passes the majority of it’s own test case, and a good percentage of the cross-platform tests. Contributors should ALWAYS test against the PHPUnit suite to ensure any changes don’t negatively impact php-mf2, and SHOULD run the cross-platform suite, especially if you’re changing parsing behaviour.
Breaking changes:
alt
attribute will now be a {'value': 'url', 'alt': 'the alt value'}
structure rather than a single URL stringmaster
branch to main
. Anyone who had been installing the latest development version with dev-master
will need to change their requirements to dev-main
Other changes:
Bugfixes:
Other Updates:
2018-08-02
Bugfixes:
e-
elementsOther Updates:
.editorconfig
to the project and cleaned up whitespace across all files2018-08-01
Bugfixes:
properties
is an object {}
rather than array []
(#171)Microformats Parsing Updates:
u-
properties even when not from a link element (Parsing issue #10)Other Updates:
2018-03-29
If the masterminds/html5 HTML5 parser is available, the Mf2 parser will use that instead of the built-in HTML parser. This enables proper handling of HTML5 elements such as <article>
.
To include the HTML5 parser in your project, run:
composer require masterminds/html5
2018-03-29
Fixes:
Backcompat:
rel=tag
as p-category
for hEntry and hReview2018-03-15
Fixes:
2018-03-13
Breaking changes:
rel-urls
to parsed result. Removes alternates
by default but still available behind a feature flag.p-name
. See Microformats issue #6. This means it is now possible for the parsed result to not have a name
property, whereas before there was always a name
property on an object. Make sure consuming code can handle an object without a name now.Fixes:
h-*
class names containing invalid characters.dt-
parsing. Issues #126 and #115.rel=bookmark
backcompat parsing.summary
property in hreview
2017-05-27
2017-05-24
img alt=""
attributesAccept: text/html
header when using the fetch
methodposter
attribute for video
tagsMany thanks to @gRegorLove for the major overhaul of the backcompat parsing!
2016-03-14
Many thanks to @aaronpk, @diplix, @dissolve, @dymcx @gRegorLove, @jeena, @veganstraightedge and @voxpelli for all your hard work opening issues and sending and merging PRs!
2015-07-12
Many thanks to @aaronpk, @gRegorLove and @kylewm for contributions, @aaronpk and @kevinmarks for PR management and @tantek for issue reporting!
2015-07-10
2015-04-29
2014-08-06
2014-07-17
<template>
element by ignoring it<img>
elements2014-06-18
Mf2\fetch()
which fetches content from a URL and returns parsed microformatsdt-end
discovery (thanks for all your hard work, @gRegorLove!)blah e- blah
to produce properties with numeric keys (thanks @aaronpk and @gRegorLove)Mf2\parse()
function added to simplify the most common case of just parsing some HTML
{
"html": "The Content",
"value: "The Content"
}
htmlSafe
options as new e-* parsing rules make them redundantphp-mf2 is dedicated to the public domain using Creative Commons -- CC0 1.0 Universal.