Menu
Home Explore People Places Arts History Plants & Animals Science Life & Culture Technology
On this page
Comparison of HTML parsers
List article

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
ParserLicenseImplementation language(s)Latest date*HTML parsingHTML5-compliant parsingClean HTML**Update HTML***
HTML TidyW3C licenseANSI C2021-07-17YesYesYesYes
HtmlUnitApache License 2.0Java2023-10-31Yes?NoNo
Beautiful SoupMIT LicensePython2023-04-07YesYes?No
jsoupMIT LicenseJava2025-04-29YesYesYesYes
ParserLicenseImplementation language(s)Latest date*HTML ParsingHTML5-compliant ParsingClean HTML**Update HTML***
* Latest release (of significant changes) date. ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code. *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

  1. 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine https://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

  2. HTML Tidy release 5.8.0 https://github.com/htacg/tidy-html5/releases/tag/5.8.0

  3. What is Tidy? http://www.html-tidy.org/#what_is_tidy

  4. What is Tidy? http://www.html-tidy.org/#what_is_tidy

  5. HtmlUnit 3.7.0 https://github.com/HtmlUnit/htmlunit/releases/tag/3.7.0

  6. Beautiful Soup release 4.10 https://www.crummy.com/software/BeautifulSoup/bs4/download/4.12/

  7. jsoup Java HTML Parser release 1.20.1 https://jsoup.org/news/release-1.20.1