• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Language selection

Page history last edited by Carl Morris 15 years, 1 month ago



Carl Morris


In 140 characters:


Language selection is microsyntax to indicate language used in a tweet. I suggest underscore _ followed by language code.




Many users interchange between two or more different languages. A user could indicate language used in a tweet - to allow filtering by any given follower. In this sense, a language not understood by a follower is potentially "noise" and the aim is to boost the signal-to-noise ratio of a tweetstream.


If someone's use of Twitter is monolingual, it's no problem - language is already filtered in the choice to follow or not follow. For example, I do not follow someone who uses 100% Japanese because I do not understand Japanese right now (maybe someday!).


Bilingual users have a Venn diagram of followers. One category is those who want to read language A but want to filter out language B. Vice-versa is the other category. And the overlapping category are bilingual users who want to read A and B. A similar observation is true of trilingual users - a "Venn diagram" could be drawn. Users switch between languages frequently.


So there is a need at the client level to filter tweets - for the client software to display only the tweets which are understood by a follower.


The proposal would be an escape character followed by language code. Underscore would be a good escape character. It's universally available on most if not all platforms. It works in search and doesn't appear to be used as microsyntax for anything else. It's also language-independent.


ISO 639-2 and ISO 639-1 are code conventions for representing language, already widely used on the web.


where, for example, en represents English and ja represents Japanese. All language codes are two or three characters in length. They are also language-independent. For example, the code es means Spanish because Spanish users call it Espagnol, yet es signifies it in any language.


(Please note that I have chosen language code deliberately here, not country code. A common mistake in localization is to assume a language-related mapping based on country or on geo location. There isn't a one-to-one mapping from country code onto language. For example, the official languages of Switzerland are French, German, Italian and Romansh, not to mention the many Swiss people who use English.)


The filtering could be achieved in third party applications with "filter out this language" option available for tweets which contain language filter codes with underscores.


The short form of the tag is mainly intended to be a machine-readable. Human-readable versions are possible such as _English or _Cymraeg where the application could ignore the trailing letters, which are redundant albeit human-friendly.


The way an underscore tag adds meaning in relation to other tags is interesting. Certain assumptions COULD be made by an application:

  • some hashtags could be associated with underscore tags. e.g. #eisteddfod09 refers to an event conducted in Cymraeg and a recent search for #eisteddfod09 gives results entirely in Cymraeg http://www.flickr.com/photos/carlmorris/3641148477/ (screen grab of search). Just as #swineflu = #aporkalypse was suggested in tag synonyms then the tweet "#eisteddfod09 _cy" would indicated that there is a link between the hashtag and the language.


  • an @ reply to a tweet with an underscore tag is probably in the same language - if you didn't understand the original, you won't understand the @ reply
  • an @ reply involves two people who would have their own established convention for language use which persists (e.g. I always reply to my mother in English - but I would reply to my colleague in Cymraeg) so any given underscore tag for an @ "relationship" could persist
  • in addition to the previous two points, however, if someone @ replied you directly then @ should override any language preference(s)


Existing Implementation:


On Twitter, a bilingual user can use different accounts for different languages. But this adds potential confusion and effort. Followers who want to read languages A and B (as defined above) have to follow both accounts (assuming they know about them).


Language recognition algorithms are another option for client-end filtering. But these algorithms are only reliable for majority languages such as English, French, German, Spanish and Japanese. There is a "long tail" of many languages for which language recognition is under-developed or non-existant.


Alternative Suggestions:


As an alternative, hashtags could be used for this, but the hashtag name space is already crowded for two-letter and three-letter codes. A distinct tag convention would be useful.




This message is in English but I'm on vacation in Switzerland with my Japanese friend. _en


This message is in _En but I'm on vacation in Switzerland with my Japanese friend.


This message is in _English but I'm on vacation in Switzerland with my Japanese friend.


This tweet isn't about language at all, it's actually about antelopes. Mmm, antelopes, aren't they sleek. _en


Dw i'n sgwennu fy neges efo'r peth arbennig a dw i'n edrych ymlaen i #eisteddfod09! _cy


_ja 最近流動食気味だなー、固形物食べられるけど食べる気が起こらない。よくないね。

Comments (0)

You don't have permission to comment on this page.