zrashwani/arachnid

Stars: 252

Forks: 64

Pull Requests: 16

Issues: 24

Watchers: 22

Last Updated: 2022-09-10 20:27:13

#php #scraping #crawler #seo

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

License: MIT License

Languages: PHP, HTML

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including: title, h1 elements, h2 elements, statusCode, contentType, meta description, meta keyword and canonicalLink.

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install.

Getting Started

Basic Usage:

Here's a quick demo to crawl a website:

    <?php
    require 'vendor/autoload.php';

    $url = 'http://www.example.com';
    $linkDepth = 3;
    // Initiate crawl, by default it will use http client (GoutteClient), 
    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->traverse();

    // Get link data
    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
    print_r($links);

Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.

enableHeadlessBrowserMode method set the scraping adapter used to be PantherChromeAdapter which is based on Symfony Panther library:

    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->enableHeadlessBrowserMode()
            ->traverse()
            ->getLinksArray();

In order to use this, you need to have chrome-driver installed on your machine, you can use dbrekelmans/browser-driver-installer to install chromedriver locally:

composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers

Advanced Usage:

Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:

    <?php
        use \Arachnid\Adapters\CrawlingFactory;
        //third parameter is the options used to configure http client
        $clientOptions = ['auth_basic' => array('username', 'password')];
        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);
           
        //or by creating and setting scrap client
        $options = array(
            'verify_host' => false,
            'verify_peer' => false,
            'timeout' => 30,
        );
                        
        $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
        $crawler->setScrapClient($scrapperClient);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):

    <?php    
    $crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

    //set logger for crawler activity (compatible with PSR-3)
    $logger = new \Monolog\Logger('crawler logger');
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
    $crawler->setLogger($logger);
    ?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

    <?php
    //filter links according to specific callback as closure
    $links = $crawler->filterLinks(function($link) {
                        //crawling only links with /blog/ prefix
                        return (bool)preg_match('/.*\/blog.*$/u', $link); 
                    })
                    ->traverse()
                    ->getLinks();

You can use LinksCollection class to get simple statistics about the links, as following:

    <?php
    $links = $crawler->traverse()
                     ->getLinks();
    $collection = new LinksCollection($links);

    //getting broken links
    $brokenLinks = $collection->getBrokenLinks();
   
    //getting links for specific depth
    $depth2Links = $collection->getByDepth(2);

    //getting external links inside site
    $externalLinks = $collection->getExternalLinks();

How to Contribute

Fork this repository
Create a new branch for each feature or improvement
Apply your code changes along with corresponding unit test
Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

PHP 7.2.0+

Authors

Josh Lockhart https://github.com/codeguy
Zeid Rashwani http://zrashwani.com

License

MIT Public License

OPEN ISSUES

See all

Improvement suggestions by @wachterjohannes
Response 401 - Authentification by @merolhack
How to find out from which url the url was crawled? by @mkantautas
Absolute links and the actual urls in some cases is being rendered wrongly. by @mkantautas
Get anchor of link by @Djomobil

RELEASES

See all

2.2.1 by @zrashwani
PHP 8.0 compatibility by @zrashwani
PHP 7.4 compatibility by @zrashwani
Adding Support for js sites and bug fixes by @zrashwani
Adding javascript support using chrome driver by @zrashwani
Adding LinksCollection class by @zrashwani
Improvements and refactoring by @zrashwani
by @zrashwani
Update dependency version by @zrashwani
Version 1.0.2 by @zrashwani
Version 1.0.1 by @codeguy
Version 1.0.0 by @codeguy

zrashwani/arachnid

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Arachnid Web Crawler

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Sponsored By

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">How to Install

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Getting Started

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Basic Usage:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Enabling Headless Browser mode:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Advanced Usage:

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">How to Contribute

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">System Requirements

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Authors

octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">License