How many languages are enough?

2011.04.10

For an international project like this, I’d like to attract as wide an audience as possible to investigate the course of vocabulary acquisition and compare that course across different native languages (L1). Certainly the course would be different for language learners whose L1s are cognate with the target language (L2) they are studying. And for those that are not cognate, there are probably differences among them as well depending on factors such as the degree of similarity between the two languages’ phonemic inventories, morphological typologies, and even writing systems. Even the prevalence of L1 loan words in active use in the target L2 may affect the course of vocabulary acquisition.

Unfortunately, there are thousands of languages and asking each user to scroll through a list that long when telling us about their L1 would be unconscionable. I have seen such lists in active use, though, and laugh when I see entries such as Middle English (1100–1500) or something like that which were probably created by blindly copying all the languages from some list of all languages on the internet. There must be a better way to find a balance between complete coverage of languages likely to be spoken by people who want to test their vocabulary and a reasonably short list which users must scroll through to find their L1.

An Incomplete Compromise

What I’ve decided to do, as with most decisions in this project, is accept the Pareto principle and try to get the biggest bang for my buck. So I’ve found estimates of the total number of speakers of each language on Ethnologue and included any language with at least 20 million speakers. That covers 75% of the world, so it’s a reasonable cut-off point in the spirit of the Pareto principle even though it ignores more than one billion people in the world. Here’s a list of the most commonly spoken languages:

Table Of Most Widely–Spoken Languages By Population

Ethnologue rank
of L1 population
Languague:
Native script(s)
Language:
English equivalent
ISO-639-2
1 普通话 / 國語 / 华语 Chinese — Mandarin cmn
1 吴语 Chinese — Wu wuu
1 粵語 / 广东话 Chinese — Yue yue
1 閩南語 / 闽南语 Chinese — Min Nan nan
1 晋语 Chinese — Jin cjy
1 湘语 / 湖南话 Chinese — Xiang hsn
1 客家話 / 客家话 Chinese — Hakka hak
1 赣语 / 江西话 Chinese — Gan gan
2 Español / castellano Spanish spa
3 English English eng
4 العربية Arabic ara
5 हिन्दी Hindi hin
6 বাংলা Bengali ben
7 Português Portuguese por
8 Русский язык Russian rus
9 日本語 Japanese jpn
10 Deutsch German deu
11 basa Jawa Javanese jav
13 తెలుగు Telugu tel
14 tiếng Việt Vietnamese vie
15 मराठी Marathi mar
16 Français French fra
17 한국어 Korean kor
18 தமிழ் Tamil tam
19 Italiano Italian ita
20 اردو Urdu urd
21 Türkçe Turkish tur
22 ગુજરાતી Gujarati guj
23 język polski Polish pol
24 Bahasa Melayu Malay msa
25 भोजपुरी Bhojpuri bho
26 अवधी Awadhi awa
27 українська мова Ukrainian ukr
28 മലയാളം Malayalam mal
29 ಕನ್ನಡ Kannada kan
30 मैथिली / মৈথিলী Maithili mai
31 Bahasa Sunda / ᮘᮞ ᮞᮥᮔ᮪ᮓ Sunda sun
32 မြန်မာဘာသာ Burmese mya
33 ଓଡ଼ିଆ Oriya ori
34 دربار‎ / فارسی / پارسی / ‎тоҷикӣ Persian fas
35 मारवाड़ी Marwari mwr
36 پنجابی / ਪੰਜਾਬੀ / पंजाबी Punjabi pan
37 Filipino Filipino fil
38 هَوْسَ Hausa hau
39 Tagalog Tagalog tgl
40 română Romanian ron
41 Bahasa Indonesia Indonesian ind
42 Nederlands Dutch nld
43 سنڌي / سندھی / सिन्धी Sindhi snd
44 ภาษาไทย Thai tha
45 پښتو Pushto pus
46 أۇزبېك / O‘zbek / Ўзбек Uzbek uzb
47 راجستھانی / राजस्थानी Rajasthani raj
48 èdèe Yorùbá Yoruba yor

Missing Some Online Communities

The problem is that some languages are disproportionately represented on the internet for various historical reasons. Many languages with a relatively small number of native speakers, such as Icelandic and Norwegian, have a high percentage of their speakers online, creating and consuming content. So, do I include them because practically all of them are online, or exclude them because, in the big picture, it’s less likely that their particular course of vocabulary acquisition would be drowned out by much larger language groups? Probably the former is a safer assumption to make even if they are still statistically less likely to test their vocabulary here.

So, where can I find a better list of languages would give a higher coverage of internet users’ L1s? Perhaps use this list of LibreOffice’s available localisations? It’s an impressive list, and since LibreOffice is an open-source project, a language’s existence on this list implies that there are at least a few speakers of that language online. But is the list too long? How long does it take to find your L1:

Maybe a shorter list would be more appropriate. Unfortunately, there are no reliable estimates of language use on the internet. Global Reach and Internet World Stats both provide some estimates, but they don’t detail their methodologies well nor are their estimates updated regularly. Nevertheless, due to the dearth of alternatives, their estimates are widely cited.

An alternate measure that might prove fruitful is the number of pages in Wikipedia written in each language. Of course, there’s more to any measure than just total number of pages in Wikipedia.

Look at Waray–Waray, for example. There are less than 10,000 Waray–Waray–speaking Wikipedia users but more than 100,000 articles; more articles than Thai, Greek, Hindi, and Cantonese. But few would argue that the size of the Waray–Waray–speaking population would justify inclusion in the list if we really are accepting the Pareto principle. So these raw numbers must be mediated by total number of users, number of edits, number of images, number of admins, and perhaps even externally sourced statistics if possible. This doesn’t directly reflect the total amount of online content available in these languages, but because Wikipedia aims to be the world’s foremost repository of encyclopædic information, it might represent at least the degree to which each language group perceives the utility of the internet and, by extension, their likelihood of using it to determine something about themselves such testing their own vocabulary sizes at a website like this.

Even if a suitable formula could be found to evaluate the Wikipedia statistics, the list may still not be acceptable. The use of Wikipedia is dependent on culture. 百度百科 (Baidu Baike) and 互动在线 (Hudong), for example, are Chinese–language collaborative encyclopædias which are both much bigger than than the Chinese version of Wikipedia, so the quantity of Wikipedia articles would not be a fair measurement in that case. I’m sure there are other examples.

A Better Compromise

For lack of a better alternative, I’ve settled for the results of two relatively old studies which are mostly in agreement. Mas Hernàndez’s (2003) list of online content by ordered by content language overlaps considerably with Guinovart’s (2003) list. These languages can be used to supplement the data from Ethnologue. As learners use the site, I’ll pay particular attention to the numbers of people coming from countries where the following languages are spoken. I’ll also take into account any comments or requests for these languages. Here’s what I’ve come up with so far.

Table Of Languages Added Based On User Requests And Site Usage

Ethnologue rank
of L1 population
Languague:
Native script(s)
Language:
English equivalent
ISO-639-2
81 čeština Czech ces
124 suomi Finnish fin
eesti keel Estonian est

Table Of Languages Under Consideration For Addition

Ethnologue rank
of L1 population
Languague:
Native script(s)
Language:
English equivalent
ISO-639-2
55 hrvatski Croatian hrv
68 ελληνικά Greek ell
73 magyar Hungarian hun
75 català Catalan cat
83 Български език Bulgarian bul
86 беларуская мова Belarusian bel
88 svenska Swedish swe
123 slovenčina / slovenský jazyk Slovak slk
118 dansk Danish dan
121 עִבְרִית Hebrew heb
126 Afrikaans Afrikaans afr
132 norsk Norwegian nor
160 Galego Galician glg
162 lietuvių kalba Lithuanian lit
íslenska Icelandic isl
latviešu valoda Latvian lav
slovenski jezik / slovenščina Slovene slv

Table Of Borderline Languages
(Possibly Too Few Speakers Online To Justify Inclusion)

Ethnologue rank
of L1 population
Languague:
Native script(s)
Language:
English equivalent
ISO-639-2
113 shqip Albanian sqi
Euskara Basque eus
føroyskt Faroese fao
Frysk Frisian
Cymraeg / y Gymraeg Welsh cym

So, that should cover most speakers and most internet users. Is your L1 not mentioned here? Did you feel disappointed when you couldn’t find your L1 after testing your vocab size on this site? If so, please let us know!

Update — One Year Later

Out of 15,938 results, only 964 users selected other as their L1, which represents about 6% of all completed tests. That’s not too bad. Of course, it’s also possible that some language learners reported their L1 as not their real L1, but a language on the list which is closely related to their true L1, so we can’t be absolutely sure of these results. Anyway, here are the L1s reported by users who measured their vocabulary size on this site. It’s interesting to note that the Ethnologue ranking of languages by number of speakers isn’t tightly correlated with the ranking of users taking the vocabulary tests. This confirms my initial suspicions that some other measures of internet usage must be combined with language population sizes to ensure that most users’ L1s will be listed. No list will ever be perfect though, so if you want to test your vocab on this site and your L1 is not listed as an option at the end, please let us know so that we can consider adding it.

Here’s what the results look like so far:

Table Of Users’ Reported L1s

Completed
tests
Ethnologue rank
of L1 population
Languague:
Native script(s)
Language:
English equivalent
ISO-639-2
4835 3 English English eng
3001 9 日本語 Japanese jpn
1126 1 普通话 / 國語 / 华语 Chinese — Mandarin cmn
964 Other—not listed
803 8 Русский язык Russian rus
720 2 Español / castellano Spanish spa
463 4 العربية Arabic ara
419 41 Bahasa Indonesia Indonesian ind
407 14 tiếng Việt Vietnamese vie
286 44 ภาษาไทย Thai tha
283 10 Deutsch German deu
218 21 Türkçe Turkish tur
164 1 粵語 / 广东话 Chinese — Yue yue
153 16 Français French fra
152 23 język polski Polish pol
148 37 Filipino Filipino fil
138 34 دربار‎ / فارسی / پارسی / ‎тоҷикӣ Persian fas
136 5 हिन्दी Hindi hin
129 42 Nederlands Dutch nld
124 7 Português Portuguese por
122 81 čeština Czech ces
121 20 اردو Urdu urd
98 39 Tagalog Tagalog tgl
88 17 한국어 Korean kor
80 19 Italiano Italian ita
74 18 தமிழ் Tamil tam
64 124 suomi Finnish fin
60 40 română Romanian ron
60 27 українська мова Ukrainian ukr
48 13 తెలుగు Telugu tel
36 24 Bahasa Melayu Malay msa
34 6 বাংলা Bengali ben
25 1 閩南語 / 闽南语 Chinese — Min Nan nan
24 1 客家話 / 客家话 Chinese — Hakka hak
24 eesti keel Estonian est
24 1 吴语 Chinese — Wu wuu
23 32 မြန်မာဘာသာ Burmese mya
22 28 മലയാളം Malayalam mal
20 36 پنجابی / ਪੰਜਾਬੀ / पंजाबी Punjabi pan
19 1 赣语 / 江西话 Chinese — Gan gan
18 1 湘语 / 湖南话 Chinese — Xiang hsn
16 15 मराठी Marathi mar
12 11 basa Jawa Javanese jav
11 29 ಕನ್ನಡ Kannada kan
11 1 晋语 Chinese — Jin cjy
11 22 ગુજરાતી Gujarati guj
9 43 سنڌي / سندھی / सिन्धी Sindhi snd
8 48 èdèe Yorùbá Yoruba yor
If you have any comments, questions, or ideas, please contact us.
Loading Loading...
Quantcast