Selector classes

Selectors are used to specifically point to certain tags on a web page, from which content has to be extracted. In Scrapple, selectors are implemented through selector classes, which define methods to extract necessary content through specified selector expressions and to extract links from anchor tags to be crawled through.

There are two selector types that are supported in Scrapple :

  • XPath expressions
  • CSS selector expressions

These selector types are implemented through the XpathSelector and CssSelector classes, respectively. These two classes use the Selector class as their super class.

In the super class, the URL of the web page to be loaded is validated - ensuring the schema has been specified, and that the URL is valid. A HTTP GET request is made to load the web page, and the HTML content of this fetched web page is used to generate the element tree. This is the element tree that will be parsed to extract the necessary content.

scrapple.selectors.selector

class scrapple.selectors.selector.Selector(url)[source]

This class defines the basic Selector object.

extract_columns(result={}, selector='', table_headers=[], attr='', connector='', default='', verbosity=0, *args, **kwargs)[source]

Column data extraction for extract_tabular

extract_content(selector='', attr='', default='', connector='', *args, **kwargs)[source]

Method for performing the content extraction for the particular selector type. If the selector is “url”, the URL of the current web page is returned. Otherwise, the selector expression is used to extract content. The particular attribute to be extracted (“text”, “href”, etc.) is specified in the method arguments, and this is used to extract the required content. If the content extracted is a link (from an attr value of “href” or “src”), the URL is parsed to convert the relative path into an absolute path.

If the selector does not fetch any content, the default value is returned. If no default value is specified, an exception is raised.

Parameters:
  • selector – The XPath expression
  • attr – The attribute to be extracted from the selected tag
  • default – The default value to be used if the selector does not return any data
  • connector – String connector for list of data returned for a particular selector
Returns:

The extracted content

Method for performing the link extraction for the crawler. The selector passed as the argument is a selector to point to the anchor tags that the crawler should pass through. A list of links is obtained, and the links are iterated through. The relative paths are converted into absolute paths and a XpathSelector/CssSelector object (as is the case) is created with the URL of the next page as the argument and this created object is yielded.

The extract_links method basically generates XpathSelector/CssSelector objects for all of the links to be crawled through.

Parameters:selector – The selector for the anchor tags to be crawled through
Returns:A XpathSelector/CssSelector object for every page to be crawled through
extract_rows(result={}, selector='', table_headers=[], attr='', connector='', default='', verbosity=0, *args, **kwargs)[source]

Row data extraction for extract_tabular

extract_tabular(header='', prefix='', suffix='', table_type='', *args, **kwargs)[source]

Method for performing the tabular data extraction. :param result: A dictionary containing the extracted data so far :param table_type: Can be “rows” or “columns”. This determines the type of table to be extracted. A row extraction is when there is a single row to be extracted and mapped to a set of headers. A column extraction is when a set of rows have to be extracted, giving a list of header-value mappings. :param header: The headers to be used for the table. This can be a list of headers, or a selector that gives the list of headers :param prefix: A prefix to be added to each header :param suffix: A suffix to be added to each header :param selector: For row extraction, this is a selector that gives the row to be extracted. For column extraction, this is a list of selectors for each column. :param attr: The attribute to be extracted from the selected tag :param default: The default value to be used if the selector does not return any data :param verbosity: The verbosity set as the argument for scrapple run :return: A 2-tuple containing the list of all the column headers extracted and the list of dictionaries which contain (header, content) pairs

scrapple.selectors.xpath

class scrapple.selectors.xpath.XpathSelector(url)[source]

The XpathSelector object defines XPath expressions.

scrapple.selectors.css

class scrapple.selectors.css.CssSelector(url)[source]

The CssSelector object defines CSS selector expressions.