Anonymous
Not logged in
Talk
Contributions
Create account
Log in
RS-485
Search
Editing
Character encodings in HTML
(section)
From RS-485
Namespaces
Page
Discussion
More
More
Page actions
Read
Edit
Edit source
History
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Encoding detection algorithm=== An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction # An explicit meta tag within the first 1024 bytes of the document # A [[byte order mark]] (BOM) within the first three bytes of the document # The HTTP Content-Type or other transport layer information # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms. Characters outside of the printable ASCII range (32 to 126) may appear incorrectly if the document is served with an incorrect character encoding. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. [[UTF-8]] has been the most common character encoding on the Web since 2008, in part because, as an encoding of [[Unicode]], it allows use of the same encoding for all languages. <!--Sentence and source copied from [[UTF-8#Implementations and adoption]]:-->{{As of|2026|01}}, UTF-8 is used by 98.9% of web sites surveyed by W3Techs.<ref name=W3TechsWebEncoding>{{Cite web|url=https://w3techs.com/technologies/cross/character_encoding/ranking |title=Usage Survey of Character Encodings broken down by Ranking |website=W3Techs |language=en |date=January 2026 |access-date=2026-01-03}}</ref> [[UTF-16]] or [[UTF-32]], other encodings of Unicode, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents. Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
Summary:
Please note that all contributions to RS-485 may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
RS-485:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Wiki tools
Wiki tools
Special pages
Page tools
Page tools
User page tools
More
What links here
Related changes
Page information
Page logs