Editing Character encodings in HTML (section)

===Encoding detection algorithm===
An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:

# Explicit user instruction
# An explicit meta tag within the first 1024 bytes of the document
# A [[byte order mark]] (BOM) within the first three bytes of the document
# The HTTP Content-Type or other transport layer information
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms.

Characters outside of the printable ASCII range (32 to 126) may appear incorrectly if the document is served with an incorrect character encoding. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.

[[UTF-8]] has been the most common character encoding on the Web since 2008, in part because, as an encoding of [[Unicode]], it allows use of the same encoding for all languages. <!--Sentence and source copied from [[UTF-8#Implementations and adoption]]:-->{{As of|2026|01}}, UTF-8 is used by 98.9% of web sites surveyed by W3Techs.<ref name=W3TechsWebEncoding>{{Cite web|url=https://w3techs.com/technologies/cross/character_encoding/ranking |title=Usage Survey of Character Encodings broken down by Ranking |website=W3Techs |language=en |date=January 2026 |access-date=2026-01-03}}</ref> [[UTF-16]] or [[UTF-32]], other encodings of Unicode, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.