xref: /Universal-ctags/docs/parser-html.rst (revision dccba5efd1817d9496a454283c2332234ce8b193)
13b7fe603SJiří Techet.. _html:
23b7fe603SJiří Techet
33b7fe603SJiří Techet======================================================================
43b7fe603SJiří TechetThe new HTML parser
53b7fe603SJiří Techet======================================================================
63b7fe603SJiří Techet
73b7fe603SJiří Techet:Maintainer: Jiri Techet <techet@gmail.com>
83b7fe603SJiří Techet
93b7fe603SJiří TechetIntroduction
103b7fe603SJiří Techet---------------------------------------------------------------------
113b7fe603SJiří Techet
123b7fe603SJiří TechetThe old HTML parser was line-oriented based on regular expression matching. This
133b7fe603SJiří Techetbrought several limitations like the inability of the parser to deal with tags
143b7fe603SJiří Techetspanning multiple lines and not respecting HTML comments. In addition, the speed
153b7fe603SJiří Techetof the parser depended on the number of regular expressions - the more tag types
163b7fe603SJiří Techetwere extracted, the more regular expressions were needed and the slower the
173b7fe603SJiří Techetparser became. Finally, parsing of embedded JavaScript was very limited, based
183b7fe603SJiří Techeton regular expressions and detecting only function declarations.
193b7fe603SJiří Techet
203b7fe603SJiří TechetThe new parser is hand-written, using separated lexical analysis (dividing
213b7fe603SJiří Techetthe input into tokens) and syntax analysis. The parser has been profiled and
22*dccba5efSHiroo HAYASHIoptimized for speed so it is one of the fastest parsers in Universal Ctags.
233b7fe603SJiří TechetIt handles HTML comments correctly and in addition to existing tags it extracts
243b7fe603SJiří Techetalso <h1>, <h2> and <h3> headings. It should be reasonably simple to add new
253b7fe603SJiří Techettag types.
263b7fe603SJiří Techet
27*dccba5efSHiroo HAYASHIFinally, the parser uses the new functionality of Universal Ctags to use another
283b7fe603SJiří Techetparser for parsing other languages within a host language. This is used for
293b7fe603SJiří Techetparsing JavaScript within <script> tags and CSS within <style> tags. This
303b7fe603SJiří Techetsimplifies the parser and generates much better results than having a simplified
31b4c1d7b2SMasatake YAMATOJavaScript or CSS parser within the HTML parser. To run JavaScript and CSS parsers
32b4c1d7b2SMasatake YAMATOfrom HTML parser, use `--extras=+g` option.
33