Scrapple 0.3.0 documentation

The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. Data extraction is a primary step involved in data mining and analysis. Extracting content from structured web pages is a vital task to be performed when the Internet is the principal source of data.

The current standards in web structure involve the use of CSS selectors or XPath expressions to select particular tags from which information can be extracted. Web pages are structured as element trees which can be parsed to traverse through the tags. This tree structure, which represents tags as parent/children/siblings, is very useful when tags should be represented in terms of the rest of the web page structure.

Scrapple is a project aimed at designing a framework for building web content extractors. Scrapple uses key-value based configuration files to define parameters to be considered in generating the extractor. It considers the base page URL, selectors for the data to be extracted, and the selector for the links to be crawled through. At its core, Scrapple abstracts the implementation of the extractor, focussing more on representing the selectors for the required tags. Scrapple can be used to generate single page content extractors or link crawlers.

Overview

Web content extraction is a common task in the process of collecting data for data analysis. There are several existing frameworks that aid in this task. In this chapter, a brief introduction of Scrapple is provided, with instructions on setting up the development machine to run Scrapple.

Introducing Scrapple
An introduction to Scrapple
Review of existing systems
A review of existing systems
System requirements
Hardware and software requirements to run Scrapple
Install Scrapple
Instructions for installing Scrapple and the required dependencies

Concepts

Creating web content extractors requires a good understanding of the following topics :

In this chapter, a brief overview of the concepts behind Scrapple is given.

Web page structure
The basics of web page structure and element trees
Selector expressions
An introduction to tag selector expressions
Data formats
The primary data formats involved in handling data

The Scrapple framework

This section deals with how Scrapple works - the architecture of the Scrapple framework, the commands and options provided by the framework and the specification of the configuration file.

Scrapple architecture
The architecture of the Scrapple framework
Scrapple commands
Commands provided by the Scrapple CLI
Configuration file
The configuration file which is used by Scrapple to implement the required extractor/crawler

Implementation

This section deals with the implementation of the Scrapple framework. This includes an explanation of the classes involved in the framework, the interaction scenarios for each of the commands supported by Scrapple, and utility functions that form a part of the implementation of the extractor.

Scrapple implementation classes
The classes involved in the implementation of Scrapple
Interaction scenarios
Interaction scenarios in the implementation of each of the Scrapple commands
Command line interface
The Scrapple command line interface
Command classes
The implementation of the command classes
Selector classes
The implementation of the selector classes
Utility functions
Utilities functions that support the implementation of the extractor

Experimentation & Results

In this section, some experiments with Scrapple are provided. There are two main types of tools that can be implemented with the Scrapple framework :

Once you’ve installed Scrapple, you can see the list of available commands and the related options using the command

$ scrapple --help

The configuration file is the backbone of Scrapple. It specifies the base page URL, selectors for the data extraction, the follow link for the link crawler and several other parameters.

Examples for each type are given.

Single page linear scrapers
Tutorial for single page linear extractors
Link crawlers
Tutorial for link crawlers

Contributing to Scrapple

Scrapple is on GitHub !

Authors
The creators of Scrapple
History
History of Scrapple releases
Contributing
The Scrapple contribution guide

The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets.

Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.