xref: /Universal-ctags/docs/parser-html.rst (revision dccba5efd1817d9496a454283c2332234ce8b193)
1.. _html:
2
3======================================================================
4The new HTML parser
5======================================================================
6
7:Maintainer: Jiri Techet <techet@gmail.com>
8
9Introduction
10---------------------------------------------------------------------
11
12The old HTML parser was line-oriented based on regular expression matching. This
13brought several limitations like the inability of the parser to deal with tags
14spanning multiple lines and not respecting HTML comments. In addition, the speed
15of the parser depended on the number of regular expressions - the more tag types
16were extracted, the more regular expressions were needed and the slower the
17parser became. Finally, parsing of embedded JavaScript was very limited, based
18on regular expressions and detecting only function declarations.
19
20The new parser is hand-written, using separated lexical analysis (dividing
21the input into tokens) and syntax analysis. The parser has been profiled and
22optimized for speed so it is one of the fastest parsers in Universal Ctags.
23It handles HTML comments correctly and in addition to existing tags it extracts
24also <h1>, <h2> and <h3> headings. It should be reasonably simple to add new
25tag types.
26
27Finally, the parser uses the new functionality of Universal Ctags to use another
28parser for parsing other languages within a host language. This is used for
29parsing JavaScript within <script> tags and CSS within <style> tags. This
30simplifies the parser and generates much better results than having a simplified
31JavaScript or CSS parser within the HTML parser. To run JavaScript and CSS parsers
32from HTML parser, use `--extras=+g` option.
33