How many languages are enough?
For an international project like this, I’d like to attract as wide an audience as possible to investigate the course of vocabulary acquisition and compare that course across different native languages (L1). Certainly the course would be different for language learners whose L1s are cognate with the target language (L2) they are studying. And for those that are not cognate, there are probably differences among them as well depending on factors such as the degree of similarity between the two languages’ phonemic inventories, morphological typologies, and even writing systems. Even the prevalence of L1 loan words in active use in the target L2 may affect the course of vocabulary acquisition.
Unfortunately, there are thousands of languages and asking each user to scroll through a list that long when telling us about their L1 would be unconscionable. I have seen such lists in active use, though, and laugh when I see entries such as Middle English (1100–1500) or something like that which were probably created by blindly copying all the languages from some list of all languages on the internet. There must be a better way to find a balance between complete coverage of languages likely to be spoken by people who want to test their vocabulary and a reasonably short list which users must scroll through to find their L1.
An Incomplete Compromise
What I’ve decided to do, as with most decisions in this project, is accept the Pareto principle and try to get the biggest bang for my buck. So I’ve found estimates of the total number of speakers of each language on Ethnologue and included any language with at least 20 million speakers. That covers 75% of the world, so it’s a reasonable cut-off point in the spirit of the Pareto principle even though it ignores more than one billion people in the world. Here’s a list of the most commonly spoken languages:
Table Of Most Widely–Spoken Languages By Population
of L1 population
|1||普通话 / 國語 / 华语||Chinese — Mandarin||cmn|
|1||吴语||Chinese — Wu||wuu|
|1||粵語 / 广东话||Chinese — Yue||yue|
|1||閩南語 / 闽南语||Chinese — Min Nan||nan|
|1||晋语||Chinese — Jin||cjy|
|1||湘语 / 湖南话||Chinese — Xiang||hsn|
|1||客家話 / 客家话||Chinese — Hakka||hak|
|1||赣语 / 江西话||Chinese — Gan||gan|
|2||Español / castellano||Spanish||spa|
|30||मैथिली / মৈথিলী||Maithili||mai|
|31||Bahasa Sunda / ᮘᮞ ᮞᮥᮔ᮪ᮓ||Sunda||sun|
|34||دربار / فارسی / پارسی / тоҷикӣ||Persian||fas|
|36||پنجابی / ਪੰਜਾਬੀ / पंजाबी||Punjabi||pan|
|43||سنڌي / سندھی / सिन्धी||Sindhi||snd|
|46||أۇزبېك / O‘zbek / Ўзбек||Uzbek||uzb|
|47||راجستھانی / राजस्थानी||Rajasthani||raj|
Missing Some Online Communities
The problem is that some languages are disproportionately represented on the internet for various historical reasons. Many languages with a relatively small number of native speakers, such as Icelandic and Norwegian, have a high percentage of their speakers online, creating and consuming content. So, do I include them because practically all of them are online, or exclude them because, in the big picture, it’s less likely that their particular course of vocabulary acquisition would be drowned out by much larger language groups? Probably the former is a safer assumption to make even if they are still statistically less likely to test their vocabulary here.
So, where can I find a better list of languages would give a higher coverage of internet users’ L1s? Perhaps use this list of LibreOffice’s available localisations? It’s an impressive list, and since LibreOffice is an open-source project, a language’s existence on this list implies that there are at least a few speakers of that language online. But is the list too long? How long does it take to find your L1:
Maybe a shorter list would be more appropriate. Unfortunately, there are no reliable estimates of language use on the internet. Global Reach and Internet World Stats both provide some estimates, but they don’t detail their methodologies well nor are their estimates updated regularly. Nevertheless, due to the dearth of alternatives, their estimates are widely cited.
An alternate measure that might prove fruitful is the number of pages in Wikipedia written in each language. Of course, there’s more to any measure than just total number of pages in Wikipedia.
Look at Waray–Waray, for example. There are less than 10,000 Waray–Waray–speaking Wikipedia users but more than 100,000 articles; more articles than Thai, Greek, Hindi, and Cantonese. But few would argue that the size of the Waray–Waray–speaking population would justify inclusion in the list if we really are accepting the Pareto principle. So these raw numbers must be mediated by total number of users, number of edits, number of images, number of admins, and perhaps even externally sourced statistics if possible. This doesn’t directly reflect the total amount of online content available in these languages, but because Wikipedia aims to be the world’s foremost repository of encyclopædic information, it might represent at least the degree to which each language group perceives the utility of the internet and, by extension, their likelihood of using it to determine something about themselves such testing their own vocabulary sizes at a website like this.
Even if a suitable formula could be found to evaluate the Wikipedia statistics, the list may still not be acceptable. The use of Wikipedia is dependent on culture. 百度百科 (Baidu Baike) and 互动在线 (Hudong), for example, are Chinese–language collaborative encyclopædias which are both much bigger than than the Chinese version of Wikipedia, so the quantity of Wikipedia articles would not be a fair measurement in that case. I’m sure there are other examples.
A Better Compromise
For lack of a better alternative, I’ve settled for the results of two relatively old studies which are mostly in agreement. Mas Hernàndez’s (2003) list of online content by ordered by content language overlaps considerably with Guinovart’s (2003) list. These languages can be used to supplement the data from Ethnologue. As learners use the site, I’ll pay particular attention to the numbers of people coming from countries where the following languages are spoken. I’ll also take into account any comments or requests for these languages. Here’s what I’ve come up with so far.
Table Of Languages Added Based On User Requests And Site Usage
of L1 population
Table Of Languages Under Consideration For Addition
of L1 population
|123||slovenčina / slovenský jazyk||Slovak||slk|
|—||slovenski jezik / slovenščina||Slovene||slv|
Table Of Borderline Languages
(Possibly Too Few Speakers Online To Justify Inclusion)
of L1 population
|—||Cymraeg / y Gymraeg||Welsh||cym|
So, that should cover most speakers and most internet users. Is your L1 not mentioned here? Did you feel disappointed when you couldn’t find your L1 after testing your vocab size on this site? If so, please let us know!
Update — One Year Later
Out of 15,938 results, only 964 users selected other as their L1, which represents about 6% of all completed tests. That’s not too bad. Of course, it’s also possible that some language learners reported their L1 as not their real L1, but a language on the list which is closely related to their true L1, so we can’t be absolutely sure of these results. Anyway, here are the L1s reported by users who measured their vocabulary size on this site. It’s interesting to note that the Ethnologue ranking of languages by number of speakers isn’t tightly correlated with the ranking of users taking the vocabulary tests. This confirms my initial suspicions that some other measures of internet usage must be combined with language population sizes to ensure that most users’ L1s will be listed. No list will ever be perfect though, so if you want to test your vocab on this site and your L1 is not listed as an option at the end, please let us know so that we can consider adding it.
Here’s what the results look like so far:
Table Of Users’ Reported L1s
of L1 population
|1126||1||普通话 / 國語 / 华语||Chinese — Mandarin||cmn|
|720||2||Español / castellano||Spanish||spa|
|164||1||粵語 / 广东话||Chinese — Yue||yue|
|138||34||دربار / فارسی / پارسی / тоҷикӣ||Persian||fas|
|25||1||閩南語 / 闽南语||Chinese — Min Nan||nan|
|24||1||客家話 / 客家话||Chinese — Hakka||hak|
|24||1||吴语||Chinese — Wu||wuu|
|20||36||پنجابی / ਪੰਜਾਬੀ / पंजाबी||Punjabi||pan|
|19||1||赣语 / 江西话||Chinese — Gan||gan|
|18||1||湘语 / 湖南话||Chinese — Xiang||hsn|
|11||1||晋语||Chinese — Jin||cjy|
|9||43||سنڌي / سندھی / सिन्धी||Sindhi||snd|