Tech Stuff - DOM-2 Page Explorer (smasher)

Overview and Background

This was an experiment in DOM programming. We wanted a simple technique where we could look at the DOM page structure and that would work on any page (like this page), would have as few lines of code as possible (we're very lazy) and provide as much info as we could. We looked at passive techniques but they all had problems (a.k.a. needed us to read documents) so we adopted a simple 'semi-intrusive' method that works (with limitations defined below) on any page as long as you have no HTML 'id=' attributes starting with 'z-'. With that simple restriction the deconstructor (we call it smasher) should work just fine on any HTML page. The basic explorer idea is very simple (we're simple folks):

Since 'getElementById' is the most useful and simplest DOM access method we decided to use it everywhere. Downside - it needs an id attribute. Soooo... when the page is loaded we use an algorithm that says if an element has an id we'll use it, else we'll give it one using node.id = 'z-x' (where x is a simple incrementing count). Then we can get to it easily. And if the id begins with 'z-' then we also know it DID NOT PREVIOUSLY HAVE an id so we add the text (dig) after it (we don't look at this as actually cheating... just kinda recreating the truth!).
As we expand the page we test for various things and the display links (hrefs) contain direct javascript calls passing the page element id (so we know what to expand) and the display element id (so we know where to stick it or collapse from).
The pop-up window which displays the structure uses dynamically created 'thingies' (don't you love our depth of understanding - it's actually a 'div' and a paragraph) so doesn't need anything added to the page.
To make it all a lot easier we use a couple of styles which you need to add to your style sheet (you do use style sheets don't you) Yeah, one day we'll create them dynamically as well.

Limitations and Alternate Approaches

There are a number of limitations in the approach we have taken, some just temporary code omissions , some serious design problems depending on what you want to do.

Limitations

Page intrusion: Fairly modest ('onload=start()' does everything) with the exception of the style sheet elements which could be dynamically created if we could be bothered to write the code.
Hierarchy not clearly visible: This could be fixed by cloning the style sheet and incrementing the 'text-indent' property for each level. We had planned to do that...but....
White space handling: Currently not very pretty or useful.
Non-Element Nodes: Poor handling since the code relies on an 'id' and since these nodes do not take one we would need an alternate approach to fully explore these nodes.
Display width: when a BIG text node is displayed and then hidden the display does not resize. We could probably just redraw the whole structure.
Where are we?: Its is hard to the know to where you are in the page. We love the Mozilla DOM Inspector flashing box to show what we are looking at and we think we know how it works - one day we'll add the code... real soon now(™)...
Multiple browsers: Poor support just now but just a question of time.
Anchor nodes: The code 'String(element)' does not return the same format string for anchor as other HTMLElements - no clue why at this stage. So we have yet another test condition.
STYLE and SCRIPT Elements: we found limitations with either the W3C spec or the Gecko implementation (we're not sure which) so cannot explore the attributes since we cannot allocate an 'id' to these elements (see notes under Function = smash() below.
Attribute expansions: We still have not found out if we can terminate an anchor's text scope string, instead the only technique is to avoid it by appending new nodes, in sequence, but to a non-anchor element which is before the anchor in question. Now if you understood that explanation the rest is easy.

Alternate Approaches

The DOM-2 includes the Traversal and Range Specification which provides NodeIterator, TreeWalker and NodeFilter interfaces to help in navigating a document. An alternative approach may be use a variation on the current approach (direct javascript calls with parameters) which saves and restores state information from one or more of these interfaces may offer a better approach for all nodes. We have not yet had the time to fully investigate.

Code Explanation

To view whole page source. To view the smasher code only.

Function = start()

This function is called when the page is loaded via 'onload=start()' in the body tag. It calls the smash() function to recurse through the document and make sure every element has an id. Page elements are given an id 'z-pX' and display elements 'z-dX' or 'z-aX' (X is a simple incrementing counter).

The function creates the DIV element that will be used as the base for the display and allocates it an id (always z-d0') and a style (dd) which has 'position:absolute;' and 'visibility:hidden;' as well as purely cosmetic attributes. We manipulate these attributes when we display and hide the menu of elements.

Finally we enable the event listener (clicked) for any click event in the browsers window (not document).

Function = smash()

This function is called by start() (to count nodes and allocate an id to all elements but does not display anything), nodex() and clicked() (to format and display elements and nodes). As a consequence it takes a number of parameters which are described in the code.

This function can be recursive (it calls itself based on the supplied parameters). While you do not have to declare variables before use in javascript GECKO's javascript engine REQUIRES explicit definition for a recursive function to work correctly.

When testing for, and allocating, id's we exclude SCRIPT and STYLE elements. This is a limitation (we think) of the W3C spec which says that all elements have ids but the HTML specs says that STYLE and SCRIPT do NOT have ids. If you allocate them the GECKO engine chokes and stops dead.

Since this function looks at every node (via the start() call) we also make it count the nodes to provide some basic stats.

We format and add display nodes via the addnode() function to keep the separation of details clean.

Finally we test for elements with 'id=z-d' and ignore them (these are all associated with the display and NOT the pages elements) to keep the display clean and relevant.

Function = clicked()

This function is called via the Events interface whenever the mouse is clicked. We allow normal (left) clicks for navigation but test for the right mouse button. When this is detected we stop the normal default actions by using the preventDefault method (stops the context menu from being displayed). stopPropagation also gives the same result. MSIE uses the special 'oncontextmenu' event which is added to the body tag.

clicked() gets the base div (id=z-d0) that we created in start(), adds an anchor element to call del() which removes the display, adds a text node for page stats and then calls smash() with the starting document element and the display start node ('z-d0' the div) and no recursion. This results in a display of the top level nodes in the page. Further expansion is by clicking the relevant nodes.

Function = addnode()

A very messy function with too many special cases because of the limitation of the DOM and the method we chose. The function creates and adds nodes to the display. The nodes may be pure text (e.g. page stats), non-element nodes or element nodes.

If the node is pure text its relatively straight forward - we create a text node and append it to the suppled display node.

For all other nodes we use String(node) as a way of serialising the node. We then extract the relevant data from this string (typically returns [class: nodename]').

If its a non-element it cannot take an id attribute so we have to immediately display the interesting bits (in the case of text or comment nodes we just display the full text) whether you want to see them or not. This is not a good solution and it should have a 'Attr' clickable link.

If its an element node we check if it has childNodes and if so we add the anchor node with an href calling the nodex() function ('javascript:nodex(p1, p2)') p1 = the page element 'id' and p2 = the display element 'id' (always a paragraph element). If the element has attributes we add a further anchor node calling the attrex() function with similar parameters.

Due to the limitations of SCRIPT and STYLE elements we have to ignore them (see above) and serializing an anchor element does not give the same result as other nodes so we have to explicitly test for and substitute the string (there must be a better way but we cannot currently find it!).

Function = nodex()

nodex() expands or collapses the element display and is called directly via javascript with all necessary parameters. If the number of childNodes is > 2 (an anchor and text are always present) it assumes a collapse is required so removes nodes until it hits an 'id=z-a' when its stops. For expansion it just calls smash() to do all the work.

Function = attrex()

attrex() expands or collapses the attributes display and is called directly via javascript with all necessary parmeters. If the number of childNodes is > 1 (an anchor is always present) it assumes a collapse is required so removes nodes until it hits an 'id=z-a' when its stops.

For expansion it gets a list of all the elements attributes and loops (using addnode()) to display their 'name' and 'value' attributes. This function looks at each id and if it begins with 'z-' we add the explanatory text (dig) to indicate it was added by smash().

Function = del()

del() removes the display by referencing the base display div (z-d0), changing its visibility attribute to 'hidden' and then loops removing nodes until there are no more childNodes on the div (this leaves only the div node).

Installing on a page

To install the deconstructor on any page do the following:

Save the javascript to a file and use a LINK tag to load into your page OR cut and paste into an existing page.
add 'onload="start()' and 'oncontextmenu=stopit()' attributes to the body tag of the document.

Add the following style sheet definitions

.at {font-family:Verdana,sans-serif;font-size:9pt;margin:0px;
     text-indent:8px;}
.d {font-family:Verdana,sans-serif;font-size:9pt;margin:0px;}
.dd {position:absolute;left:0;top:0;font-family:Verdana,sans-serif;
     font-size:9pt;visibility:hidden;background:cyan;color:black;
		 margin:0px;}

Load the page and start clicking!

When it all goes horribly wrong

*!#$% happens as we north americans say (brits use the quaint expression 'when it all goes pear shaped') - not that we have any experience of such crises ourselves you understand. You have three great tools to get you out of the stuff both from mozilla.org.

Firebug - Javascript debugger and Inspector which is available as a Mozilla addon. This is a superb tool.
Wenkman - the Javascript debugger which ships with every Mozilla release (use 0.9.x+) from Tools->web development ->javascript debugger
DOM Inspector which also ships with every Mozilla release.

Microsoft also have a javascript debugger which we hate because every time you kill the debugger you kill MSIE as well (must be a way around it) but you can get ithere.

Problems, comments, suggestions, corrections (including broken links) or something to add? Please take the time from a busy life to 'mail us' (at top of screen), the webmaster (below) or info-support at zytrax. You will have a warm inner glow for the rest of the day.