1.. _html: 2 3====================================================================== 4The new HTML parser 5====================================================================== 6 7:Maintainer: Jiri Techet <techet@gmail.com> 8 9Introduction 10--------------------------------------------------------------------- 11 12The old HTML parser was line-oriented based on regular expression matching. This 13brought several limitations like the inability of the parser to deal with tags 14spanning multiple lines and not respecting HTML comments. In addition, the speed 15of the parser depended on the number of regular expressions - the more tag types 16were extracted, the more regular expressions were needed and the slower the 17parser became. Finally, parsing of embedded JavaScript was very limited, based 18on regular expressions and detecting only function declarations. 19 20The new parser is hand-written, using separated lexical analysis (dividing 21the input into tokens) and syntax analysis. The parser has been profiled and 22optimized for speed so it is one of the fastest parsers in Universal Ctags. 23It handles HTML comments correctly and in addition to existing tags it extracts 24also <h1>, <h2> and <h3> headings. It should be reasonably simple to add new 25tag types. 26 27Finally, the parser uses the new functionality of Universal Ctags to use another 28parser for parsing other languages within a host language. This is used for 29parsing JavaScript within <script> tags and CSS within <style> tags. This 30simplifies the parser and generates much better results than having a simplified 31JavaScript or CSS parser within the HTML parser. To run JavaScript and CSS parsers 32from HTML parser, use `--extras=+g` option. 33