Bay Six Software Beyond the Basics
View previous topic :: View next topic |
Author |
Message |
Brent Site Admin
Joined: 01 Jul 2005 Posts: 800
Posted: Dec 26th, 2007, 10:11pm Post subject: [RC1] HTML Tamer |
This code is designed to parse HTML code, which has been submitted by a user. It tries to eliminate the potential to wreck the design of a web page, advertise, or even install malware on the computers of venerable users of your site. The routine goes through one character at a time, breaking the text apart, classifying them and normalizing HTML code to some degree.
You can paste the RB code into your program or create a module and use the RUN command to include it. Your program muse get the user-entered code into a string variable. That variable is passed to the tameHtml function and it does the parsing and normalizing.
"tameHtml" calls three auxillary routines which are meant to be modified to meet your site's needs. tameHtmlTag is passed an HTML tag name and returns a boolean result, true (nonzero) if the tag is "tame," false (0) if it is not acceptable, the latter resulting in the removal of that tag's element. tameHtmlAttrName is passed an HTML attribute name and returns a boolean value, true if the attribute is "tame," false if it is not, the latter resulting in that attribute's removal. And tameHtmlAttrValue is passed a variable containing an attribute value (the text after the "=") and tries to catch problems and correct them.
Code: | code$ = "<span onclick='doDamage()'>span</span><a href='sample.spam'>spam</a>"
r = tameHtml(code$)
print code$
' HTML Tamer
' By Brent D. Thorn, 1/2008
' Released to the public domain "as is" without any warranty or guarantees.
function tameHtml( byref code$ )
' Takes "raw" HTML code that a user has entered and attempts to "tame" it,
' including removing whole elements and attributes that are considered
' inappropriate, properly escaping characters, and generally normalizing
' the code to make it safer and more portable.
state = 0 ' plain text
tame$ = ""
token$ = ""
goodElem = 1
goodAttr = 1
for i = 1 to len(code$)
c$ = mid$(code$, i, 1)
c = asc(c$)
select case state
case 0 ' plain text
goodElem = 1
select case c
case 34 ' quote
c$ = """
case 38 ' "&"
'TODO: escape improper entities
case 60 ' "<"
iElem = len(tame$) + 1 ' Start of element
state = 1
c$ = "" ' Delay addition until we know more.
goodElem = 1 ' Assume the element and
goodAttr = 1 ' attributes are valid.
case 62 ' ">"
c$ = ">"
end select
case 1 ' <
if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
state = 2 ' This looks like a tag.
token$ = lower$(c$) ' Start recording.
c$ = "<" + token$
select case c
case 33 ' "!"
state = 12 : c$ = "<!" ' Could be a comment.
case 47 ' "/"
state = 10 : c$ = "</" ' Likely is an end tag.
case else
state = 0 : c$ = "<" + c$ ' It's just a less-than.
end select
end if
case 2 ' <tag
if (c >= 48 and c <= 57) _
or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
c$ = lower$(c$)
token$ = token$ + c$
' end of tag
if not (tameHtmlTag(token$)) then
tame$ = left$(tame$, iElem - 1) + " "
goodElem = 0
end if
if c <= 32 then ' ws
state = 3
select case c
case 47 ' "/"
state = 9
c$ = ""
case 62 ' ">"
state = 0
case else ' junk
state = 18
c$ = " "
end select
end if
end if
case 3 ' <tag ???
if c > 32 then
goodAttr = 1
if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
state = 4
c$ = lower$(c$)
token$ = c$
iAttr = len(tame$)
select case c
case 47 ' "/"
state = 9
case 62 ' ">"
state = 0
case else ' junk
state = 18
c$ = " "
end select
end if
end if
case 4 ' <tag attr
if (c >= 48 and c <= 57) _
or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
c$ = lower$(c$)
token$ = token$ + c$
' end of attr name
if not (tameHtmlAttrName(token$)) then
tame$ = left$(tame$, iAttr)
goodAttr = 0
end if
if c > 32 then
select case c
case 47 ' "/"
state = 9
case 61 ' "="
state = 5
case 62 ' ">"
state = 0
case else ' junk
state = 18
c$ = " "
end select
state = 3
end if
end if
case 5 ' <tag attr=
if c > 32 then
select case c
case 34 ' quote
state = 8
token$ = ""
case 39 ' "'"
state = 7
token$ = ""
case 62 ' ">"
state = 0
c$ = chr$(34) + chr$(34) + ">"
case else
state = 6
token$ = c$
c$ = chr$(34) + c$
end select
end if
case 6 ' <tag attr=value
if c <= 32 or c = 62 then
' end of value
if c <= 32 then ' ws
state = 3
else ' ">"
state = 0
end if
c$ = token$ + chr$(34) + c$
if goodAttr then call tameHtmlAttrValue c$
if c = 34 then c$ = """ ' quote
token$ = token$ + c$
c$ = ""
end if
case 7 ' <tag attr='value
if c = 39 then ' "'"
state = 3
if goodAttr then
c$ = token$ + c$
call tameHtmlAttrValue c$
end if
token$ = token$ + c$
c$ = ""
end if
case 8 ' <tag attr="value
if c = 34 then ' end quote
state = 3
if goodAttr then
c$ = token$ + c$
call tameHtmlAttrValue c$
end if
token$ = token$ + c$
c$ = ""
end if
case 9 ' <tag /
if c = 62 then
c$ = "/>"
state = 0
state = 3
end if
case 10 ' </
if (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alpha
state = 11
c$ = lower$(c$)
token$ = c$
tame$ = left$(tame$, len(tame$) - 2) + "</"
state = 0
end if
case 11 ' </tag
if (c >= 48 and c <= 57) _
or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) then ' alnum
c$ = lower$(c$)
token$ = token$ + c$
' end of tag name
if not (tameHtmlTag(token$)) then
tame$ = left$(tame$, iElem - 1) + " "
goodElem = 0
end if
if c = 62 then ' ">"
state = 0
' trash the junk
state = 17
c$ = ""
end if
end if
case 12 ' <!
if c = 45 then state = 13 ' "-"
if c = 62 then state = 0 ' ">"
case 13 ' <!-
if c = 45 then state = 14 ' "-"
if c = 62 then state = 0 ' ">"
case 14 ' <!--
if c = 45 then state = 15 ' "-"
case 15 ' <!-- -
if c = 45 _ ' "-"
then state = 16 _
else state = 14
case 16 ' <!-- --
if c = 62 _ ' ">"
then state = 0 _
else state = 14
case 17 ' </tag ???
if c = 62 then ' ">"
state = 0
c$ = ""
end if
case 18 ' continue junking
select case c
case 47 : state = 9 ' "/"
case 62 : state = 0 ' ">"
case else
if (c <= 32) _ ' ws
or (c >= 65 and c <= 90) or (c >= 97 and c <= 122) _ ' alpha
then state = 3 _
else c$ = ""
end select
end select
if goodElem and goodAttr then tame$ = tame$ + c$
code$ = tame$
end function
function tameHtmlTag( tag$ )
' Gets called for each HTML tag that <function tameHtml> parses from an element.
' Takes the HTML tag name and returns a boolean value indicating its "tameness."
' True (non-zero) means that the tag is "tame" and therefore safe to keep.
' False (zero) indicates an unsafe or inappropriate tag that should be removed.
' The list of unsafe tags below is editable, with just two rules:
' 1. At least one space must be on either side of a word.
' 2. All words must be in lower case.
tag$ = " " + tag$ + " "
if instr( _
" a applet base bgsound blink body button custom embed event form frame" + _
" frameset head html input label link marquee meta noframes noscript" + _
" object option param script select style title ", _
tag$) = 0 then tameHtmlTag = 1
end function
function tameHtmlAttrName( attr$ )
' Gets called for each HTML attribute that <function tameHtml> parses from an
' element. Takes the attribute name (usually before an equals sign, "=") and
' returns a boolean value indicating its "tameness."
' True (non-zero) indicates a safe attribute that should be kept.
' False (zero) indicates an unsafe attribute that should be removed.
tameHtmlAttrName = 1 ' assume it's safe
' all scripting events are prohibited
if left$(attr$, 2) = "on" then tameHtmlAttrName = 0
if instr( _
" style ", _
" " + attr$ + " ") then tameHtmlAttrName = 0
end function
sub tameHtmlAttrValue byref value$
' Gets called for each HTML attribute value (the string following the equals
' sign, "=") that <function tameHtml> parses from an element. Takes the string
' and, if necessary, modifies it to eliminate unsafe constructs.
lcval$ = lower$(value$)
i = instr(lcval$, "script:")
if i > 0 then
value$ = left$(value$, i + 5) + ":" + mid$(value$, i + 6)
i = instr(value$, ":")
if i > 1 then
if instr( _
":about:activex:aim:applet:callto:chrome:file:mailto:skype:ymsgr:", _
":" + left$(lcval$, i)) then
value$ = left$(value$, i - 1) + ":" + mid$(value$, i + 1)
end if
end if
end if
end sub |
_________________ Brent |
Back to top |
Brent Site Admin
Joined: 01 Jul 2005 Posts: 800
Posted: Jan 2nd, 2008, 8:35am Post subject: Re: [RC1] HTML Tamer |
Updated to Release Candidate 1. _________________ Brent |
Back to top |
BASICwebmaster Guest
Posted: Jan 3rd, 2008, 2:34pm Post subject: Re: [RC1] HTML Tamer |
Hey Brent,
This is pretty sweet. It'd be great if the tameHtmlAttrName() function accepted the name of the tag, so more strict filtering could be done more easily. An example use case is only allowing the HREF attribute for <a> tags. Something similar for the tameHtmlAttrValue subroutine would also be useful.
- Bill |
Back to top |
Brent Site Admin
Joined: 01 Jul 2005 Posts: 800
Posted: Jan 4th, 2008, 2:53am Post subject: Re: [RC1] HTML Tamer |
Hi Bill,
I had the same thought about two-thirds through the development. If no one finds any bugs in the RC1, I'll probably add that functionality.
Thanks for checking it out. _________________ Brent |
Back to top |
Alyce Full Member
Joined: 04 Jul 2005 Posts: 91
Posted: Jan 4th, 2008, 1:26pm Post subject: Re: [RC1] HTML Tamer |
This would be a good program to include on the Run BASIC wiki, if you are so inclined.
http://runbasic.wikispaces.com/ _________________ - Alyce |
Back to top |
Brent Site Admin
Joined: 01 Jul 2005 Posts: 800
Posted: Jan 4th, 2008, 9:31pm Post subject: Re: [RC1] HTML Tamer |
Yes, Alyce, that is a good idea. I'll seriously consider it when the code has had enough time to stew. _________________ Brent |
Back to top |
You cannot post new topics in this forum You can reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum