blob: 739a33a5c226984eaa4a388ec479845ac4d87c55 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
# There is a publically editable copy of this file at
# http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets
# This is the input file for generateEquivset.php
# The format is:
#
# <hexadecimal codepoint> <character> => [<hexadecimal codepoint>] <character>
#
# If the codepoint is given, it must match the character, or else a warning
# will be issued and the line will be ignored.
#
# The effect of such a line is to conflate the two identified character, i.e.
# to put them in the same set. If two sets share a member, then they will be
# merged into a single larger set.
#
# We have attempted to include the following types of equivalence:
# * Case folding. Although letters of different cases are often visually
# distinct, they can easily be confused by people who are familiar with
# the alphabet. Two words with a different case may be read as the same
# word. This is a popular technique for impersonation.
#
# * Visually similar characters. Cross-script pairs are included, but these
# tend to produce false conflations within scripts, and so should be
# avoided. The software implements a blanket restriction against cross-
# script strings, which makes cross-script pairs mostly redundant.
#
# * Chinese Simplified/Traditional pairs.
#
# The list is based on one by Neil Harris, which was derived by unknown methods.
# That list also contained transliteration pairs, which we considered excessive
# and have attempted to remove. For example, the latin E and H were considered
# equivalent, because the latin transliteration of the cyrillic "Н" (which
# looks like latin H) is "E".
|