Phishing 5-minute read

Homograph Attacks: The Lookalike-Domain Trick

Some characters from non-Latin scripts look identical to ordinary letters. Attackers use them to register domains that read like a trusted brand but point somewhere else entirely. Here is how the trick works, and how to catch it.

In short

Domain names can include letters from many scripts, not just the English alphabet. A few of those letters are visually identical to Latin ones, so a domain like аpple.com (with a Cyrillic first letter) can look exactly like the real thing while encoding to a completely different address. Your browser converts these internationalized names into an ASCII form called punycode, which starts with xn--, and that encoded form is the honest version to check. Plenty of internationalized domains are perfectly legitimate, so a mixed-script name is a reason to look closer, not a verdict on its own.

Domains were not always limited to English letters

The original domain name system only allowed a narrow set of ASCII characters: the letters a through z, the digits 0 through 9, and the hyphen. That left out most of the world's languages. To fix this, the internet adopted internationalized domain names, or IDNs, which let people register names in scripts like Cyrillic, Greek, Arabic, Chinese, and many others. This is genuinely useful: a business in Athens or Moscow can have a web address that reads naturally in its own alphabet.

The catch is that some characters across different scripts are drawn almost identically. The Latin lowercase a and the Cyrillic lowercase а (the character at code point U+0430) look the same in most fonts, but to a computer they are entirely separate characters. The same goes for the Greek omicron and the Latin o, or the Cyrillic е and the Latin e. Characters that look alike but are not the same are called confusables or homoglyphs, and they are the raw material of a homograph attack.

How attackers turn lookalikes into fake domains

The idea is simple. An attacker takes the name of a real brand and swaps one or more letters for a confusable from another script, then registers that domain. To a person glancing at the address bar or an email link, the result can be indistinguishable from the genuine site. Click through, and you may land on a convincing copy of a login page that quietly harvests whatever you type.

A classic demonstration replaced every letter of a well-known brand with Cyrillic lookalikes, producing a domain that rendered identically to the original in several browsers of the day. The visible text matched, but the underlying characters, and therefore the actual destination, did not. Because the swap can involve a single character, these names are easy to miss and easy to mass-produce.

It is worth being clear about the limits here. Browsers and registries have added defenses over the years, and many lookalike registrations get caught. But the technique still surfaces in phishing campaigns, so it remains worth understanding rather than dismissing.

Punycode: the honest ASCII form

Underneath, the domain name system still speaks only ASCII. So whenever a name contains non-ASCII characters, it gets translated into a plain-ASCII representation called punycode, using an algorithm defined in RFC 3492. The encoded label always begins with the prefix xn-- and is known as the A-label. The human-readable version you see on screen is the U-label.

For example, a domain that displays as аpple.com with a Cyrillic first letter does not travel the network as those Unicode characters. It is encoded to something like xn--pple-43d.com before any lookup happens. That xn-- form is the key insight for spotting trouble: if a domain looks like a familiar brand but its punycode form is an unexpected xn-- string, that is a strong reason to slow down. A genuine all-ASCII brand domain never needs a punycode form at all.

This is also why browsers convert IDNs to punycode in the first place. The conversion is not the attack; it is the safeguard. By showing or resolving the xn-- form in suspicious cases, browsers give you a way to see the real, unambiguous identity of a name that might otherwise be disguised.

How the standards reason about confusables

The Unicode Consortium maintains a technical standard, Unicode Technical Standard 39 (UTS 39), that deals directly with this problem. One of its core ideas is the skeleton: a way of reducing a string to a canonical form by mapping each confusable character to a representative one. If two different strings collapse to the same skeleton, they are confusable with each other. UTS 39 also describes mixed-script detection, the practice of flagging a single label that draws characters from more than one script, which is unusual for a legitimate word.

Security tools, including the lookalike detection built into our own email and domain checks, lean on these ideas. They normalize a candidate domain, compare its skeleton against a list of known brands, and surface anything that resolves to a suspiciously similar shape. The standard gives this a documented, repeatable basis rather than a hunch.

Warning signs worth a closer look

None of the following proves a domain is malicious, but each is a reason to verify before you trust it:

Mixed scripts in one label. A single word that combines, say, Latin and Cyrillic letters is rarely how a real brand spells its name. Most legitimate words stay within one script.
An unexpected xn-- encoding. A domain that reads like a plain English brand but encodes to a punycode A-label is a contradiction worth resolving.
Invisible or zero-width characters. Some code points carry no visible glyph (zero-width joiners and spaces, for example) and can be slipped into a name to make it differ from the real one without looking different. UTS 39 treats characters like these as restricted in identifiers for exactly this reason.
A brand name where you would not expect internationalization. A multinational with a long-standing ASCII domain has little reason to suddenly appear under a non-Latin lookalike in an email link.
Context that pushes urgency. Lookalike domains tend to arrive in messages that want you to act fast, which discourages the second look that would expose them.

Practical defenses

You do not need special software to defend against this. A few habits go a long way:

Check the punycode form. When a domain looks important and even slightly unfamiliar, convert it to its xn-- representation and see whether that matches what you expect. An ASCII brand should have no punycode form.
Hover before you click. In email and chat, hovering over a link reveals the real destination. Read the whole domain, not just the part that looks familiar at the start.
Treat unexpected IDNs in email links with suspicion. A link claiming to be your bank or your employer should resolve to the domain you already know. If it does not, navigate there yourself by typing the address you trust.
Type or use a saved bookmark for sensitive sites. For anything involving credentials or money, reaching the site through your own bookmark sidesteps the lookalike entirely.
Lean on built-in protections, but do not rely on them alone. Modern browsers display the punycode form for many high-risk mixed-script names, yet coverage varies, so your own check is still the last line.

Keep it honest: not every IDN is an attack

This is the part that is easy to get wrong. Internationalized domains exist because the web is global, and the overwhelming majority of them are completely legitimate. A bakery in Berlin, a news site in Seoul, or a shop in Cairo may all have perfectly genuine non-Latin domains. Seeing a non-ASCII character or an xn-- form does not mean fraud. Mixed script and unexpected punycode are flags that say "verify this", not conclusions that say "block this". The skill is matching the script to the context: a non-Latin domain for a local business in that language is ordinary, while the same trick aimed at a global ASCII brand is the one to question.

Inspect a domain for lookalikes

Paste any domain to see its punycode (xn--) form, spot mixed scripts and confusable characters, and check whether it shadows a known brand. Runs entirely in your browser.

Inspect a domain for lookalikes →

This guide is educational and reflects publicly available information about internationalized domain names, the Punycode standard, and Unicode's confusable-character guidance. It is not legal advice or a recommendation about any specific domain, email, person, or decision. Security and access decisions should follow your organization's policies and applicable law.