Subtitles section Play video
Twitter was set up to support 140 characters. And in the English alphabet, that's easy to
understand: a character is a letter, number, space or punctuation mark. People more or
less agree with computers there. And if it was twenty years ago, that's exactly how the
system would work. That far, no further.
But now, we have Unicode.
Mind you, it's still fairly straightforward in some languages. East Asian languages, for
example - Chinese, Japanese, Korean -- "one character" is a glyph, a number, a space,
or a punctuation mark. But since the language is denser -- each of these characters encodes
more information than an English character -- you can fit almost twice as much information
into each tweet.
And then, it gets complicated.
Take Arabic, for example. What counts as an Arabic letter? First of all, the shape of
Arabic letters change significantly depending on where they are in a word. Watch what happens
as I take the Arabic for "Arabic alphabet", and hit backspace. Arabic's right to left,
remember. The characters change in order to be consistent with the rules of the written
language, and the diacritics disappear separately to the letters they're next to.
In Vietnamese, on the other hand? Each of those counts as one character.
Backspace, and away they go.
It's at this point that most British programmers, myself included, throw up their hands in defeat
and just use existing code by some other generous soul who's already worked the problem out.
Or if they're lazy, they just say, well, no-one's going to use this who doesn't speak English,
so we don't need to worry about it.
(MOUTHS) Yes you do.
Hmm. Unicode has a single character for some English ligatures, like "ffi" - notice how
the letters there are smushed together to make them look better to the eye. Some programs
will automatically add those in for you. So you copy and paste your text from that into
Twitter, and suddenly you're saving characters.
People would count that as three characters. Unicode, and therefore Twitter, and pretty
much every computer program? Just one. The greatest example of this I could find is the
Arabic for "peace be upon him". Unicode has a single character for this, and Twitter will
treat it as counting for just 1 of your 140. Which is handy, if you're a devout Muslim
and want to talk about the prophets on Twitter.
So. What counts as a character? Well, it's complicated. Computers see things differently
to people. And let's be honest: unless you have a professor who's setting their essays
by character count instead of word count the only time it'll really matter for most people...
is when they're trying to tweet.
[Translating these subtitles? Add your name here!]