Scrapple 0.3.0 documentation¶
The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. Data extraction is a primary step involved in data mining and analysis. Extracting content from structured web pages is a vital task to be performed when the Internet is the principal source of data.
The current standards in web structure involve the use of CSS selectors or XPath expressions to select particular tags from which information can be extracted. Web pages are structured as element trees which can be parsed to traverse through the tags. This tree structure, which represents tags as parent/children/siblings, is very useful when tags should be represented in terms of the rest of the web page structure.
Scrapple is a project aimed at designing a framework for building web content extractors. Scrapple uses key-value based configuration files to define parameters to be considered in generating the extractor. It considers the base page URL, selectors for the data to be extracted, and the selector for the links to be crawled through. At its core, Scrapple abstracts the implementation of the extractor, focussing more on representing the selectors for the required tags. Scrapple can be used to generate single page content extractors or link crawlers.
Web content extraction is a common task in the process of collecting data for data analysis. There are several existing frameworks that aid in this task. In this chapter, a brief introduction of Scrapple is provided, with instructions on setting up the development machine to run Scrapple.
Creating web content extractors requires a good understanding of the following topics :
In this chapter, a brief overview of the concepts behind Scrapple is given.
The Scrapple framework¶
This section deals with how Scrapple works - the architecture of the Scrapple framework, the commands and options provided by the framework and the specification of the configuration file.
This section deals with the implementation of the Scrapple framework. This includes an explanation of the classes involved in the framework, the interaction scenarios for each of the commands supported by Scrapple, and utility functions that form a part of the implementation of the extractor.
- Scrapple implementation classes
- The classes involved in the implementation of Scrapple
- Interaction scenarios
- Interaction scenarios in the implementation of each of the Scrapple commands
- Command line interface
- The Scrapple command line interface
- Command classes
- The implementation of the command classes
- Selector classes
- The implementation of the selector classes
- Utility functions
- Utilities functions that support the implementation of the extractor
Experimentation & Results¶
In this section, some experiments with Scrapple are provided. There are two main types of tools that can be implemented with the Scrapple framework :
$ scrapple --help
The configuration file is the backbone of Scrapple. It specifies the base page URL, selectors for the data extraction, the follow link for the link crawler and several other parameters.
Examples for each type are given.
Contributing to Scrapple¶
Scrapple is on GitHub !
- The creators of Scrapple
- History of Scrapple releases
- The Scrapple contribution guide
The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets.
Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.