13b7fe603SJiří Techet.. _html: 23b7fe603SJiří Techet 33b7fe603SJiří Techet====================================================================== 43b7fe603SJiří TechetThe new HTML parser 53b7fe603SJiří Techet====================================================================== 63b7fe603SJiří Techet 73b7fe603SJiří Techet:Maintainer: Jiri Techet <techet@gmail.com> 83b7fe603SJiří Techet 93b7fe603SJiří TechetIntroduction 103b7fe603SJiří Techet--------------------------------------------------------------------- 113b7fe603SJiří Techet 123b7fe603SJiří TechetThe old HTML parser was line-oriented based on regular expression matching. This 133b7fe603SJiří Techetbrought several limitations like the inability of the parser to deal with tags 143b7fe603SJiří Techetspanning multiple lines and not respecting HTML comments. In addition, the speed 153b7fe603SJiří Techetof the parser depended on the number of regular expressions - the more tag types 163b7fe603SJiří Techetwere extracted, the more regular expressions were needed and the slower the 173b7fe603SJiří Techetparser became. Finally, parsing of embedded JavaScript was very limited, based 183b7fe603SJiří Techeton regular expressions and detecting only function declarations. 193b7fe603SJiří Techet 203b7fe603SJiří TechetThe new parser is hand-written, using separated lexical analysis (dividing 213b7fe603SJiří Techetthe input into tokens) and syntax analysis. The parser has been profiled and 22*dccba5efSHiroo HAYASHIoptimized for speed so it is one of the fastest parsers in Universal Ctags. 233b7fe603SJiří TechetIt handles HTML comments correctly and in addition to existing tags it extracts 243b7fe603SJiří Techetalso <h1>, <h2> and <h3> headings. It should be reasonably simple to add new 253b7fe603SJiří Techettag types. 263b7fe603SJiří Techet 27*dccba5efSHiroo HAYASHIFinally, the parser uses the new functionality of Universal Ctags to use another 283b7fe603SJiří Techetparser for parsing other languages within a host language. This is used for 293b7fe603SJiří Techetparsing JavaScript within <script> tags and CSS within <style> tags. This 303b7fe603SJiří Techetsimplifies the parser and generates much better results than having a simplified 31b4c1d7b2SMasatake YAMATOJavaScript or CSS parser within the HTML parser. To run JavaScript and CSS parsers 32b4c1d7b2SMasatake YAMATOfrom HTML parser, use `--extras=+g` option. 33