Command classes

scrapple.commands.genconfig

class scrapple.commands.genconfig.GenconfigCommand(args)[source]

Defines the execution of genconfig

execute_command()[source]

The genconfig command depends on predefined Jinja2 templates for the skeleton configuration files. Taking the –type argument from the CLI input, the corresponding template file is used.

Settings for the configuration file, like project name, selector type and URL are taken from the CLI input and using these as parameters, the template is rendered. This rendered JSON document is saved as <project_name>.json.

scrapple.commands.generate

class scrapple.commands.generate.GenerateCommand(args)[source]

Defines the execution of generate

execute_command()[source]

The generate command uses Jinja2 templates to create Python scripts, according to the specification in the configuration file. The predefined templates use the extract_content() method of the selector classes to implement linear extractors and use recursive for loops to implement multiple levels of link crawlers. This implementation is effectively a representation of the traverse_next() utility function, using the loop depth to differentiate between levels of the crawler execution.

According to the –output_type argument in the CLI input, the results are written into a JSON document or a CSV document.

The Python script is written into <output_filename>.py - running this file is the equivalent of using the Scrapple run command.

scrapple.commands.run

class scrapple.commands.run.RunCommand(args)[source]

Defines the execution of run

execute_command()[source]

The run command implements the web content extractor corresponding to the given configuration file.

The execute_command() validates the input project name and opens the JSON configuration file. The run() method handles the execution of the extractor run.

The extractor implementation follows these primary steps :

  1. Selects the appropriate selector class through a dynamic dispatch, with the selector_type argument from the CLI input.
  2. Iterate through the data section in level-0 of the configuration file. On each data item, call the extract_content() method from the selector class to extract the content according to the specified extractor rule.
  3. If there are multiple levels of the extractor, i.e, if there is a ‘next’ attribute in the configuration file, call the traverse_next() utility function and parse through successive levels of the configuration file.
  4. According to the –output_type argument, the result data is saved in a JSON document or a CSV document.

scrapple.commands.web

class scrapple.commands.web.WebCommand(args)[source]

Defines the execution of web

execute_command()[source]

The web command runs the Scrapple web interface through a simple Flask app.

When the execute_command() method is called from the runCLI() function, it starts of two simultaneous processes :

  • Calls the run_flask() method to start the Flask app on port 5000 of localhost
  • Opens the web interface on a web browser

The ‘/’ view of the Flask app, opens up the Scrapple web interface. This provides a basic form, to fill in the required configuration file. On submitting the form, it makes a POST request, passing in the form in the request header. This form is passed to the form_to_json() utility function, where the form is converted into the resultant JSON configuration file.

Currently, closing the web command execution requires making a keyboard interrupt on the command line after the web interface has been closed.