World Library  
Flag as Inappropriate
Email this Article

Soft hyphen

Article Id: WHEBN0000780587
Reproduction Date:

Title: Soft hyphen  
Author: World Heritage Encyclopedia
Language: English
Subject: Zero-width space, Unicode character property, Unicode, EBCDIC 500, Latin-1 Supplement (Unicode block)
Collection: Control Characters, Punctuation, Typography, Unicode Formatting Code Points, Whitespace
Publisher: World Heritage Encyclopedia

Soft hyphen

In computing and typesetting, a soft hyphen (ISO 8859: 0xAD, Unicode U+00AD soft hyphen, HTML: ­ ­) or syllable hyphen (EBCDIC: 0xCA), abbreviated SHY, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens. Two alternative ways of using the soft-hyphen character for this purpose have emerged, depending on whether the encoded text will be broken into lines by its recipient, or has already been preformatted by its originator.[1][2][3]


  • Text to be formatted by the recipient 1
  • Text preformatted by the originator 2
  • Encodings and definitions 3
  • Security issues 4
  • See also 5
  • References 6

Text to be formatted by the recipient

The use of SHY characters in text that will be broken into lines by the recipient is the application context considered by the post-1999 HTML and Unicode specifications, as well as some word-processing file formats. In this context, the soft hyphen may also be called a discretionary hyphen or optional hyphen. It serves as an invisible marker used to specify a place in text where a hyphenated break is allowed without forcing a line break in an inconvenient place if the text is re-flowed. It becomes visible only after word wrapping at the end of a line. The soft hyphen's Unicode semantics and HTML implementation are in many ways similar to Unicode's zero-width space.

To show the effect of a soft hyphen in HTML, the following words have been separated with soft hyphens:


On HTML browsers supporting soft hyphens, resizing the window will re-break the above text only at word boundaries, and insert a hyphen at the end of each line.

HTML4 describes it as a "hyphenation hint", though it suggests that that interpretation is not universal:[4]

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur. Those browsers that interpret soft hyphens must observe the following semantics. If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

Text preformatted by the originator

The SHY character is also being used in text where paragraphs have already been broken into lines, such as certain plain text files, text sent to VT100-style terminal emulators or printers, or pages represented in page description languages. This is the application context originally considered by the EBCDIC and ISO 8859-1 standards and implemented in many VT100 terminal emulators.[1][2]

Here, SHY is a visible hyphen that is usually visually indistinguishable from a regular hyphen, but has been inserted solely for the purpose of line breaking. The purpose of the soft hyphen here is to distinguish it from any regular hyphen that might have been part of the original spelling of the word. This distinction helps to reuse already formatted text, when line breaks and soft hyphens inserted during word wrapping have to be removed to convert the text back into its unformatted form. For example, the copy or paste function of a terminal emulator can offer to replace line breaks with a space character, and remove any soft hyphens including any immediately following whitespace characters.

An example application that outputs soft hyphens for this reason is the groff text formatter as used on many Unix/Linux systems to display man pages.

Encodings and definitions

SHY characters in coded characters sets, roughly in chronological order:

  • EBCDIC placed a SHY character (known there as a "syllable hyphen") at position 202 (0xCA hexadecimal).[1][5] IBM defined its purpose as a "hyphen used to divide a word at the end of a line [that] may be removed when a program adjusts lines."[6]
  • ISO 8859-1:1986 (Latin 1) inherited SHY from EBCDIC, but called it "soft hyphen", placed it at position 0xAD (hexadecimal), and stated its purpose as "for use when a line break has been established within a word". Other ISO 8859 parts placed it at the same position, with the exception of ISO 8859-11 (Latin/Thai), which lacks it.
  • IBM code page 850 (an MS-DOS character set covering all ISO 8859-1 characters) placed it at position 240 = 0xF0.
  • SGML's "Numeric and Special Graphic" (isonum) character entity set (ISO 8879:1986) includes "­" for the ISO 8859-1 soft hyphen.
  • Unicode 1.0 (1991) and ISO 10646 (1993) took the first 256 code positions from ISO 8859-1, resulting in SHY at Unicode codepoint of U+00AD.
  • HTML 2 (1995) incorporated the "­" character entity from SGML, but explicitly discouraged its use.
  • HTML 4 (1999) redefined the purpose of the character as marking a hyphenation opportunity, which only becomes visible as a hyphen at the end of a line after formatting.
  • Unicode 4.0 (2002) changed the category of its SHY character from previously "Pd" (punctuation, dash) to "Cf" (other, format), thereby aligning its interpretation of the character with that of HTML 4.

Other commands for marking hyphenation opportunities in text formatting languages (similar to the HTML 4 and Unicode 4.0 interpretation of SHY):

Security issues

Soft hyphens have been used to obscure malicious domains or URLs in e-mail spam.[8][9]

See also


  1. ^ a b c Jukka Korpela (Revision as of January 2011). "Soft hyphen (SHY) – a hard problem?".  
  2. ^ a b  
  3. ^ Eric Muller (2002-08-14). "Yes, SOFT HYPHEN is a hard problem".  
  4. ^ "9.3.3 Hyphenation". HTML 4.01 Specification.  
  5. ^ "Extended Binary-Coded Decimal Interchange Code - S/390". Retrieved 2011-04-08. 
  6. ^ "Glossary".  
  7. ^ "Commonly Confused Characters". Greg Baker,  
  8. ^ "Spammers Using Soft Hyphen To Hide Malicious URLs".  
  9. ^ "Soft Hyphen – A New URL Obfuscation Technique".  
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from World eBook Library are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.