Extracting Links from HTML

June 5, 2001
by Will Turnage

Dear Multimedia Handyman,

I need to be able to import local html files (easy enough) and then parse the HTML and strip out email addresses into a text member separated by Carriage Returns. Can you help?

Regards,

Jim Savage

Jim,

I think there's been a point in each developer's career where he or she has had to do some really nasty text parsing. Luckily, this is one of those situations where that doesn't have to be the case. The secret here is to let Director's text members do most of the work for you by taking advantage of a text member's html property. That's as simple as:

myHTML = "<html><body>This is a quick <a href=mailto:will / at / director-online.com>hyperlink</a> test. Can you count the <a href=mailto:will / at / director-online.com>hyperlinks</a>?</body></html>"

member("testHTML").html = myHTML

All you've done here is start with a simple HTML page containing two hyperlinks. Then you just set the html of a text member equal to that HTML data, and Director automatically parses that HTML into a text member for you. Once that's done, then you can instantly find out where all the hyperlinks are in the text member.

put member("testHTML").hyperlinks
-- [[17, 25], [51, 60]]

The result that you get is a list of lists of numbers. Each pair of numbers represents the beginning and ending character of each hyperlink in the text member. So in the example above, you instantly know that the first hyperlink exists between character 17 and 25.

If you want to find out what the hyperlink is for each range of characters, then you use the hyperlink property. The difference between the hyperlinks property and the hyperlink property (aside from the letter s) is that hyperlinks will always return a list of numbers, whereas hyperlink will return the actual text of a link.

put member ("testHTML").char[17].hyperlink
-- "mailto:will / at / director-online.com"

In this code, you have to specify a particular section of the text member, such as a word, character, item, or line, and then look for its hyperlink property. If Director finds that this chunk expression contains only one hyperlink, then it will return the hyperlink. If there's no hyperlink, or there are multiple hyperlinks, then Director will return an empty string.

The final step left is to write a handler that will process these hyperlinks for you automatically,

on parseHTML textMemberName
  tempStr = EMPTY
  repeat with i in member (textMemberName).hyperlinks
    theHyperlink = member (textMemberName).char[i[1]].hyperlink
    if theHyperlink.char[1..7] = "mailto:" then
      delete char 1 to 7 of theHyperlink
      tempStr = tempStr & theHyperlink & RETURN
    end if
  end repeat
  delete the last char of tempStr
  return tempStr
end

First, you need to pass this handler the name of a text member that contains the HTML you want to parse. The handler begins by initializing an empty string. Next it repeats through the list of hyperlinks in the text member. For each hyperlink, it checks to see if the it starts with the string "mailto:". If it does, then it deletes the string from the hyperlink and then adds the email address to the string. Finally, when the loop is done, the handler deletes the last RETURN from the string and returns the list of email addresses to whatever called the parseHTML function.

put parseHTML ("testHTML")
-- "will / at / director-online.com
will / at / director-online.com"

Here's an example that shows this handler in action. This movie allows you to enter any web page, and it will parse out the email addresses for you.

A sample Director 8 movie is available for download in Mac or Windows format.

Will Turnage is a multimedia programmer based in New York City. In addition to his weekly role as Director Online's Multimedia Handyman, he is also the Technology Director for Sony Music's Client Side Technologies Group. You can read more about his work at http://will.turnage.com/.