Utility functions


Functions related to dynamic dispatch of objects


Called from runCLI() to select the command class for the selected command.

Parameters:command – The command to be implemented
Returns:The command class corresponding to the selected command

This function is implemented through this simple code block :

from scrapple.commands import genconfig, generate, run, web
cmdClass = getattr(eval(command), command.title() + 'Command')
return cmdClass


Functions related to handling exceptions in the input arguments

The function uses regular expressions to validate the CLI input.

projectname_re = re.compile(r'[^a-zA-Z0-9_]')
if args['genconfig']:
        if args['--type'] not in ['scraper', 'crawler']:
                raise Exception("--type has to be 'scraper' or 'crawler'")
        if args['--selector'] not in ['xpath', 'css']:
                raise Exception("--selector has to be 'xpath' or 'css'")
if args['generate'] or args['run']:
        if args['--output_type'] not in ['json', 'csv']:
                raise Exception("--output_type has to be 'json' or 'csv'")
if args['genconfig'] or args['generate'] or args['run']:
        if projectname_re.search(args['<projectname>']) is not None:
                raise Exception("Invalid <projectname>")


Functions related to traversing the configuration file

scrapple.utils.config.traverse_next(page, nextx, results, tabular_data_headers=[], verbosity=0)[source]

Recursive generator to traverse through the next attribute and crawl through the links to be followed.

  • page – The current page being parsed
  • next – The next attribute of the current scraping dict
  • results – The current extracted content, stored in a dict

The extracted content, through a generator

In the case of crawlers, the configuration file can be treated as a tree, with the anchor tag links extracted from the follow link selector as the child nodes. This level-wise representation of the crawler configuration file provides a clear picture of how the file should be parsed.

Tree representation of crawler

Tree representation of crawler

This recursive generator performs a depth-first traversal of the config file tree. It can be implemented through this code snippet :

for link in page.extract_links(next['follow_link']):
        r = results.copy()
        for attribute in next['scraping'].get('data'):
                if attribute['field'] != "":
                        r[attribute['field']] = \
        if not next['scraping'].get('next'):
                yield r
                for next2 in next['scraping'].get('next'):
                        for result in traverse_next(link, next2, r):
                                yield result

Recursive generator that yields the field names in the config file

Parameters:config – The configuration file that contains the specification of the extractor
Returns:The field names in the config file, through a generator

get_fields() parses the configuration file through a recursive generator, yielding the field names encountered.

for data in config['scraping']['data']:
        if data['field'] != '':
                yield data['field']
if 'next' in config['scraping']:
        for n in config['scraping']['next']:
                for f in get_fields(n):
                        yield f

Function to return a list of unique field names from the config file

Parameters:config – The configuration file that contains the specification of the extractor
Returns:A list of field names from the config file

The extract_fieldnames() function uses the get_fields() generator, and handles cases like multiple occurrences of the same field name.

fields = []
for x in get_fields(config):
        if x in fields:
                fields.append(x + '_' + str(fields.count(x) + 1))
return fields


Functions related to form handling.


Takes the form from the POST request in the web interface, and generates the JSON config file

Parameters:form – The form from the POST request

The web form is structured in a way that all the data field are linearly numbered. This is done so that it is easier to process the form while converting it into a JSON document.

for i in itertools.count(start=1):
                data = {
                        'field': form['field_' + str(i)],
                        'selector': form['selector_' + str(i)],
                        'attr': form['attribute_' + str(i)],
                        'default': form['default_' + str(i)]
        except KeyError: