Web page structure¶
Structure of a Web page¶
The main elements that comprise a web page are :
- DOCTYPE: This lets the browser know the type of markup language the page is written in.
- Document Tree: We can consider a page as a document tree that contain any number of branches.
- HTML: This is the root element of the document tree and everything that follows is a child node. HTML has two descendants – HEAD and BODY
- HEAD: It contains the title and the information of the page.
- BODY: It contains the data displayed by the page.
ElementTree¶
The Element type [6] is a data object that can contain tree-like data structures.
The ElementTree wrapper [6] type adds code to load web pages as trees of Element objects. An element consists of properties like a tag(identify the element type), number of attributes, text string holding the textual content and the number of child nodes.
To create a tree, we create the root element and add children elements to the root element. A method called Subelement can be used for creating and adding an element to the parent element. Few methods that are provided to search for Subelements are as follows:
- find(pattern) – Return the first subelement matching the pattern
- findtext(pattern) – Returns the value of the text attribute of the first subelement matching the pattern
- findall(pattern) – Return a list matching the pattern
- getiterator(tag) - Return a list matching the tag attribute
- getiterator() – Return a list of all the Subelements