Link crawlers¶

(Check out another example on the Github repo readme.)

For this example, we will extract content from all talks on pyvideo. We will use the event listing as the base page.

To generate a skeleton configuration file, use the genconfig command. The primary arguments for the command are the project name and the URL of the base page. To generate a skeleton configuration file for a crawler, use the --type=crawler argument.

$ scrapple genconfig pyvideo http://pyvideo.org/category \
> --type=crawler

This will create pyvideo.json which will initially look like this -

{

    "scraping": {
        "url": "http://pyvideo.org/category",
        "data": [
            {
                "default": "",
                "field": "",
                "attr": "",
                "selector": ""
            }
        ],
        "next": [
            {
                "follow_link": "",
                "scraping": {
                    "data": [
                        {
                            "default": "",
                            "field": "",
                            "attr": "",
                            "selector": ""
                        }
                    ]
                }
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

You can edit this json file to specify selectors for the various data that you would want to extract from the given page.

For example,

{

    "scraping": {
        "url": "http://pyvideo.org/category/",
        "data": [
            {
                "field": "",
                "attr": "",
                "selector": "",
                "default": ""
            }
        ],
        "next": [
            {
                "follow_link": "//table//td[1]//a",
                "scraping": {
                    "data": [
                        {
                            "field": "event",
                            "attr": "text",
                            "selector": "//h1",
                            "default": ""
                        },
                        {
                            "field": "event_url",
                            "attr": "",
                            "selector": "url",
                            "default": ""
                        }
                    ],
                    "next": [
                        {
                            "follow_link": " \
                            //div[@id='video-summary-content']/div//strong/a \
                            ",
                            "scraping": {
                                "data": [
                                    {
                                        "field": "talk_title",
                                        "attr": "text",
                                        "selector": "//h3",
                                        "default": "<unknown>"
                                    },
                                    {
                                        "field": "speaker",
                                        "attr": "text",
                                        "selector": " \
                                        //div[@id='sidebar']//dd[2] \
                                        ",
                                        "default": "<unknown>"
                                    },
                                    {
                                        "field": "talk_url",
                                        "attr": "",
                                        "selector": "url",
                                        "default": ""
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

Using this configuration file, you could generate a Python script using scrapple generate or directly run the scraper using scrapple run.

The generate and run commands take two positional arguments - the project name and the output file name.

To generate the Python script -

$ scrapple generate pyvideo talk_list

This will create talk_list.py, which is the script that can be run to replicate the action of scrapple run.

from __future__ import print_function
import json
import os

from scrapple.selectors.xpath import XpathSelector


def task_pyvideo():
        """
        Script generated using
        `Scrapple <http://scrappleapp.github.io/scrapple>`_
        """
        results = dict()
        results['project'] = "pyvideo"
        results['data'] = list()
        try:
                r0 = dict()
                page0 = XpathSelector("http://pyvideo.org/category/")

                for page1 in page0.extract_links(
                "//table//td[1]//a"):
                        r1 = r0.copy()
                        r1["event"] = page1.extract_content(
                        "//h1", "text", ""
                        )
                        r1["event_url"] = page1.extract_content(
                        "url", "", ""
                        )


                for page2 in page1.extract_links(
                "//div[@class='video-summary-data']/div[1]//a"):
                        r2 = r1.copy()
                        r2["talk_title"] = page2.extract_content(
                        "//h3", "text", "<unknown>"
                        )
                        r2["speaker"] = page2.extract_content(
                        "//div[@id='sidebar']//dd[2]", "text", "<unknown>"
                        )
                        r2["talk_url"] = page2.extract_content(
                        "url", "", ""
                        )
                        results['data'].append(r2)
        except KeyboardInterrupt:
                pass
        except Exception as e:
                print(e)
        finally:
                with open(os.path.join(os.getcwd(), 'talks.json'), 'w') as f:
                        json.dump(results, f)


if __name__ == '__main__':
        task_pyvideo()

To run the scraper -

$ scrapple run pyvideo talk_list

This will create talk_list.json, which contains the extracted information.

A portion of the talk_list.json will look like this.

{

    "project": "pyvideo",
    "data": [
        {
            "talk_title": "Boston Python Meetup: ...",
            "talk_url": "http://pyvideo.org/video/591/...",
            "event_url": "http://pyvideo.org/category/15/...",
            "speaker": "Stephan Richter",
            "event": "Boston Python Meetup"
        },
        {
            "talk_title": "Boston Python Meetup: ...",
            "talk_url": "http://pyvideo.org/video/592/...",
            "event_url": "http://pyvideo.org/category/15/...",
            "speaker": "Marshall Weir",
            "event": "Boston Python Meetup"
        },
        {
            "talk_title": "November 2014 ...",
            "talk_url": "http://pyvideo.org/video/3359/...",
            "event_url": "http://pyvideo.org/category/14/...",
            "speaker": "Asma Mehjabeen Isaac Adorno",
            "event": "ChiPy"
        },


        ### talk_list.json continues


        {
            "talk_title": "Python 2.7 & Python 3: ...",
            "talk_url": "http://pyvideo.org/video/3373/...",
            "event_url": "http://pyvideo.org/category/64/...",
            "speaker": "Kenneth Reitz",
            "event": "Twitter University 2014"
        }
    ]

}