Scrapple
latest
  • Introducing Scrapple
  • Review of existing systems
  • System requirements
  • Install Scrapple
  • Web page structure
  • Selector expressions
  • Data formats
  • Scrapple architecture
  • Scrapple commands
  • Configuration file
  • Scrapple implementation classes
  • Interaction scenarios
  • Command line interface
  • Command classes
  • Selector classes
  • Utility functions
  • Single page linear scrapers
  • Link crawlers
  • Comparison between Scrapple & Ducky
  • Authors
  • History
  • Contributing
Scrapple
  • Docs »
  • Link crawlers
  • Edit on GitHub

Link crawlersΒΆ

(Check out another example on the Github repo readme.)

For this example, we will extract content from all talks on pyvideo. We will use the event listing as the base page.

To generate a skeleton configuration file, use the genconfig command. The primary arguments for the command are the project name and the URL of the base page. To generate a skeleton configuration file for a crawler, use the --type=crawler argument.

$ scrapple genconfig pyvideo http://pyvideo.org/category \
> --type=crawler

This will create pyvideo.json which will initially look like this -

{

    "scraping": {
        "url": "http://pyvideo.org/category",
        "data": [
            {
                "default": "",
                "field": "",
                "attr": "",
                "selector": ""
            }
        ],
        "next": [
            {
                "follow_link": "",
                "scraping": {
                    "data": [
                        {
                            "default": "",
                            "field": "",
                            "attr": "",
                            "selector": ""
                        }
                    ]
                }
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

You can edit this json file to specify selectors for the various data that you would want to extract from the given page.

For example,

{

    "scraping": {
        "url": "http://pyvideo.org/category/",
        "data": [
            {
                "field": "",
                "attr": "",
                "selector": "",
                "default": ""
            }
        ],
        "next": [
            {
                "follow_link": "//table//td[1]//a",
                "scraping": {
                    "data": [
                        {
                            "field": "event",
                            "attr": "text",
                            "selector": "//h1",
                            "default": ""
                        },
                        {
                            "field": "event_url",
                            "attr": "",
                            "selector": "url",
                            "default": ""
                        }
                    ],
                    "next": [
                        {
                            "follow_link": " \
                            //div[@id='video-summary-content']/div//strong/a \
                            ",
                            "scraping": {
                                "data": [
                                    {
                                        "field": "talk_title",
                                        "attr": "text",
                                        "selector": "//h3",
                                        "default": "<unknown>"
                                    },
                                    {
                                        "field": "speaker",
                                        "attr": "text",
                                        "selector": " \
                                        //div[@id='sidebar']//dd[2] \
                                        ",
                                        "default": "<unknown>"
                                    },
                                    {
                                        "field": "talk_url",
                                        "attr": "",
                                        "selector": "url",
                                        "default": ""
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

Using this configuration file, you could generate a Python script using scrapple generate or directly run the scraper using scrapple run.

The generate and run commands take two positional arguments - the project name and the output file name.

To generate the Python script -

$ scrapple generate pyvideo talk_list

This will create talk_list.py, which is the script that can be run to replicate the action of scrapple run.

from __future__ import print_function
import json
import os

from scrapple.selectors.xpath import XpathSelector


def task_pyvideo():
        """
        Script generated using
        `Scrapple <http://scrappleapp.github.io/scrapple>`_
        """
        results = dict()
        results['project'] = "pyvideo"
        results['data'] = list()
        try:
                r0 = dict()
                page0 = XpathSelector("http://pyvideo.org/category/")

                for page1 in page0.extract_links(
                "//table//td[1]//a"):
                        r1 = r0.copy()
                        r1["event"] = page1.extract_content(
                        "//h1", "text", ""
                        )
                        r1["event_url"] = page1.extract_content(
                        "url", "", ""
                        )


                for page2 in page1.extract_links(
                "//div[@class='video-summary-data']/div[1]//a"):
                        r2 = r1.copy()
                        r2["talk_title"] = page2.extract_content(
                        "//h3", "text", "<unknown>"
                        )
                        r2["speaker"] = page2.extract_content(
                        "//div[@id='sidebar']//dd[2]", "text", "<unknown>"
                        )
                        r2["talk_url"] = page2.extract_content(
                        "url", "", ""
                        )
                        results['data'].append(r2)
        except KeyboardInterrupt:
                pass
        except Exception as e:
                print(e)
        finally:
                with open(os.path.join(os.getcwd(), 'talks.json'), 'w') as f:
                        json.dump(results, f)


if __name__ == '__main__':
        task_pyvideo()

To run the scraper -

$ scrapple run pyvideo talk_list

This will create talk_list.json, which contains the extracted information.

A portion of the talk_list.json will look like this.

{

    "project": "pyvideo",
    "data": [
        {
            "talk_title": "Boston Python Meetup: ...",
            "talk_url": "http://pyvideo.org/video/591/...",
            "event_url": "http://pyvideo.org/category/15/...",
            "speaker": "Stephan Richter",
            "event": "Boston Python Meetup"
        },
        {
            "talk_title": "Boston Python Meetup: ...",
            "talk_url": "http://pyvideo.org/video/592/...",
            "event_url": "http://pyvideo.org/category/15/...",
            "speaker": "Marshall Weir",
            "event": "Boston Python Meetup"
        },
        {
            "talk_title": "November 2014 ...",
            "talk_url": "http://pyvideo.org/video/3359/...",
            "event_url": "http://pyvideo.org/category/14/...",
            "speaker": "Asma Mehjabeen Isaac Adorno",
            "event": "ChiPy"
        },


        ### talk_list.json continues


        {
            "talk_title": "Python 2.7 & Python 3: ...",
            "talk_url": "http://pyvideo.org/video/3373/...",
            "event_url": "http://pyvideo.org/category/64/...",
            "speaker": "Kenneth Reitz",
            "event": "Twitter University 2014"
        }
    ]

}
Next Previous

© Copyright 2015, Alex Mathew, Harish Balakrishnan Revision ddf65560.

Built with Sphinx using a theme provided by Read the Docs.