Bay Six Software Forum Index Bay Six Software
Beyond the Basics
 
 FAQFAQ   SearchSearch   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

[RC1] HTML Tamer

 
Post new topic   Reply to topic    Bay Six Software Forum Index -> Snippet Testing
View previous topic :: View next topic  
Author Message
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 800

PostPosted: Dec 26th, 2007, 10:11pm    Post subject: [RC1] HTML Tamer Reply with quote

This code is designed to parse HTML code, which has been submitted by a user. It tries to eliminate the potential to wreck the design of a web page, advertise, or even install malware on the computers of venerable users of your site. The routine goes through one character at a time, breaking the text apart, classifying them and normalizing HTML code to some degree.

You can paste the RB code into your program or create a module and use the RUN command to include it. Your program muse get the user-entered code into a string variable. That variable is passed to the tameHtml function and it does the parsing and normalizing.

"tameHtml" calls three auxillary routines which are meant to be modified to meet your site's needs. tameHtmlTag is passed an HTML tag name and returns a boolean result, true (nonzero) if the tag is "tame," false (0) if it is not acceptable, the latter resulting in the removal of that tag's element. tameHtmlAttrName is passed an HTML attribute name and returns a boolean value, true if the attribute is "tame," false if it is not, the latter resulting in that attribute's removal. And tameHtmlAttrValue is passed a variable containing an attribute value (the text after the "=") and tries to catch problems and correct them.
Code:
code$ = "<span onclick='doDamage()'>span</span><a href='sample.spam'>spam</a>"
r = tameHtml(code$)
print code$
end

'===============================================================================
' HTML Tamer
' By Brent D. Thorn, 1/2008
' Released to the public domain "as is" without any warranty or guarantees.
'===============================================================================

function tameHtml( byref code$ )
' Takes "raw" HTML code that a user has entered and attempts to "tame" it,
' including removing whole elements and attributes that are considered
' inappropriate, properly escaping characters, and generally normalizing
' the code to make it safer and more portable.

  state = 0 ' plain text
  tame$ = ""
  token$ = ""
  goodElem = 1
  goodAttr = 1

  for i = 1 to len(code$)
    c$ = mid$(code$, i, 1)
    c = asc(c$)
    select case state
      case 0 ' plain text
        goodElem = 1
        select case c
          case 34 ' quote
            c$ = "&quot;"
          case 38 ' "&"
            'TODO: escape improper entities
          case 60 ' "<"
            iElem = len(tame$) + 1 ' Start of element
            state = 1
            c$ = "" ' Delay addition until we know more.
            goodElem = 1 ' Assume the element and
            goodAttr = 1 ' attributes are valid.
          case 62 ' ">"
            c$ = "&gt;"
        end select

      case 1 ' <
        if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
          state = 2 ' This looks like a tag.
          token$ = lower$(c$) ' Start recording.
          c$ = "<" + token$
        else
          select case c
            case 33 ' "!"
              state = 12 : c$ = "<!" ' Could be a comment.
            case 47 ' "/"
              state = 10 : c$ = "</" ' Likely is an end tag.
            case else
              state = 0 : c$ = "&lt;" + c$ ' It's just a less-than.
          end select
        end if

      case 2 ' <tag
        if (c >= 48 and c <= 57) _
        or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
          c$ = lower$(c$)
          token$ = token$ + c$
        else
          ' end of tag
          if not (tameHtmlTag(token$)) then
            tame$ = left$(tame$, iElem - 1) + " "
            goodElem = 0
          end if

          if c <= 32 then ' ws
            state = 3
          else
            select case c
              case 47 ' "/"
                state = 9
                c$ = ""
              case 62 ' ">"
                state = 0
              case else ' junk
                state = 18
                c$ = " "
            end select
          end if
        end if

      case 3 ' <tag ???
        if c > 32 then
          goodAttr = 1
          if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
            state = 4
            c$ = lower$(c$)
            token$ = c$
            iAttr = len(tame$)
          else
            select case c
              case 47 ' "/"
                state = 9
              case 62 ' ">"
                state = 0
              case else ' junk
                state = 18
                c$ = " "
            end select
          end if
        end if

      case 4 ' <tag attr
        if (c >= 48 and c <= 57) _
        or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
          c$ = lower$(c$)
          token$ = token$ + c$
        else
          ' end of attr name
          if not (tameHtmlAttrName(token$)) then
            tame$ = left$(tame$, iAttr)
            goodAttr = 0
          end if

          if c > 32 then
            select case c
              case 47 ' "/"
                state = 9
              case 61 ' "="
                state = 5
              case 62 ' ">"
                state = 0
              case else ' junk
                state = 18
                c$ = " "
            end select
          else
            state = 3
          end if
        end if

      case 5 ' <tag attr=
        if c > 32 then
          select case c
            case 34 ' quote
              state = 8
              token$ = ""
            case 39 ' "'"
              state = 7
              token$ = ""
            case 62 ' ">"
              state = 0
              c$ = chr$(34) + chr$(34) + ">"
            case else
              state = 6
              token$ = c$
              c$ = chr$(34) + c$
          end select
        end if

      case 6 ' <tag attr=value
        if c <= 32 or c = 62 then
          ' end of value
          if c <= 32 then ' ws
            state = 3
          else ' ">"
            state = 0
          end if
          c$ = token$ + chr$(34) + c$
          if goodAttr then call tameHtmlAttrValue c$
        else
          if c = 34 then c$ = "&quot;" ' quote

          token$ = token$ + c$
          c$ = ""
        end if

      case 7 ' <tag attr='value
        if c = 39 then ' "'"
          state = 3
          if goodAttr then
            c$ = token$ + c$
            call tameHtmlAttrValue c$
          end if
        else
          token$ = token$ + c$
          c$ = ""
        end if

      case 8 ' <tag attr="value
        if c = 34 then ' end quote
          state = 3
          if goodAttr then
            c$ = token$ + c$
            call tameHtmlAttrValue c$
          end if
        else
          token$ = token$ + c$
          c$ = ""
        end if

      case 9 ' <tag /
        if c = 62 then
          c$ = "/>"
          state = 0
        else
          state = 3
        end if

      case 10 ' </
        if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
          state = 11
          c$ = lower$(c$)
          token$ = c$
        else
          tame$ = left$(tame$, len(tame$) - 2) + "&lt;/"
          state = 0
        end if

      case 11 ' </tag
        if (c >= 48 and c <= 57) _
        or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
          c$ = lower$(c$)
          token$ = token$ + c$
        else
          ' end of tag name
          if not (tameHtmlTag(token$)) then
            tame$ = left$(tame$, iElem - 1) + " "
            goodElem = 0
          end if

          if c = 62 then ' ">"
            state = 0
          else
            ' trash the junk
            state = 17
            c$ = ""
          end if
        end if

      case 12 ' <!
        if c = 45 then state = 13 ' "-"
        if c = 62 then state = 0 ' ">"

      case 13 ' <!-
        if c = 45 then state = 14 ' "-"
        if c = 62 then state = 0 ' ">"

      case 14 ' <!--
        if c = 45 then state = 15 ' "-"

      case 15 ' <!-- -
        if c = 45 _ ' "-"
          then state = 16 _
          else state = 14

      case 16 ' <!-- --
        if c = 62 _ ' ">"
          then state = 0 _
          else state = 14

      case 17 ' </tag ???
        if c = 62 then ' ">"
          state = 0
        else
          c$ = ""
        end if

      case 18 ' continue junking
        select case c
          case 47 : state = 9 ' "/"
          case 62 : state = 0 ' ">"
          case else
            if (c <= 32) _ ' ws
            or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) _ ' alpha
              then state = 3 _
              else c$ = ""
        end select
    end select

    if goodElem and goodAttr then tame$ = tame$ + c$
  next

  code$ = tame$
end function

function tameHtmlTag( tag$ )
' Gets called for each HTML tag that <function tameHtml> parses from an element.
' Takes the HTML tag name and returns a boolean value indicating its "tameness."
' True (non-zero) means that the tag is "tame" and therefore safe to keep.
' False (zero) indicates an unsafe or inappropriate tag that should be removed.
' The list of unsafe tags below is editable, with just two rules:
'  1.  At least one space must be on either side of a word.
'  2.  All words must be in lower case.

  tag$ = " " + tag$ + " "

  if instr( _
    " a applet base bgsound blink body button custom embed event form frame" + _
    " frameset head html input label link marquee meta noframes noscript" + _
    " object option param script select style title ", _
    tag$) = 0 then tameHtmlTag = 1
end function

'-------------------------------------------------------------------------------

function tameHtmlAttrName( attr$ )
' Gets called for each HTML attribute that <function tameHtml> parses from an
' element.  Takes the attribute name (usually before an equals sign, "=") and
' returns a boolean value indicating its "tameness."
' True (non-zero) indicates a safe attribute that should be kept.
' False (zero) indicates an unsafe attribute that should be removed.

  tameHtmlAttrName = 1 ' assume it's safe

  ' all scripting events are prohibited
  if left$(attr$, 2) = "on" then tameHtmlAttrName = 0

  if instr( _
    " style ", _
    " " + attr$ + " ") then tameHtmlAttrName = 0
end function

sub tameHtmlAttrValue byref value$
' Gets called for each HTML attribute value (the string following the equals
' sign, "=") that <function tameHtml> parses from an element.  Takes the string
' and, if necessary, modifies it to eliminate unsafe constructs.

  lcval$ = lower$(value$)

  i = instr(lcval$, "script:")
  if i > 0 then
    value$ = left$(value$, i + 5) + ":" + mid$(value$, i + 6)
  else
    i = instr(value$, ":")
    if i > 1 then
      if instr( _
      ":about:activex:aim:applet:callto:chrome:file:mailto:skype:ymsgr:", _
      ":" + left$(lcval$, i)) then
        value$ = left$(value$, i - 1) + ":" + mid$(value$, i + 1)
      end if
    end if
  end if
end sub

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 800

PostPosted: Jan 2nd, 2008, 8:35am    Post subject: Re: [RC1] HTML Tamer Reply with quote

Updated to Release Candidate 1.
_________________
Brent
Back to top
View user's profile Send private message Send e-mail
BASICwebmaster
Guest





PostPosted: Jan 3rd, 2008, 2:34pm    Post subject: Re: [RC1] HTML Tamer Reply with quote

Hey Brent,

This is pretty sweet. It'd be great if the tameHtmlAttrName() function accepted the name of the tag, so more strict filtering could be done more easily. An example use case is only allowing the HREF attribute for <a> tags. Something similar for the tameHtmlAttrValue subroutine would also be useful.

- Bill
Back to top
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 800

PostPosted: Jan 4th, 2008, 2:53am    Post subject: Re: [RC1] HTML Tamer Reply with quote

Hi Bill,

I had the same thought about two-thirds through the development. If no one finds any bugs in the RC1, I'll probably add that functionality.

Thanks for checking it out.

_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Alyce
Full Member


Joined: 04 Jul 2005
Posts: 91

PostPosted: Jan 4th, 2008, 1:26pm    Post subject: Re: [RC1] HTML Tamer Reply with quote

This would be a good program to include on the Run BASIC wiki, if you are so inclined.

http://runbasic.wikispaces.com/

_________________
- Alyce
Back to top
View user's profile Send private message Visit poster's website
Brent
Site Admin


Joined: 01 Jul 2005
Posts: 800

PostPosted: Jan 4th, 2008, 9:31pm    Post subject: Re: [RC1] HTML Tamer Reply with quote

Yes, Alyce, that is a good idea. I'll seriously consider it when the code has had enough time to stew.
_________________
Brent
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    Bay Six Software Forum Index -> Snippet Testing All times are GMT
Page 1 of 1
Jump to:  
Quick Reply
Username:
Message:
   Shortcut keys: Alt+Q to activate, Alt+P to preview, Alt+S to submit
You cannot post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum



Lo-Fi Version
Powered by phpBB © 2001, 2005 phpBB Group