Tuesday, February 17, 2009

Among all social applications, the one ¨social¨feature I miss is ¨social dictionaries¨.

I hope they will be coming sooner or later. How I see it, is to be able to see for every word how often it is used, not only on average, but also for particular social groups (for example: different age, different education level, different location / countries where he/she lived, profession... ) A lot of statistical data which would be impossible to collect before and which, in this time of overall obsession with statistics, seems to be more or less feasible to collect. This could help a lot to learn the usage of the words in the foreign language correctly. For example, if I know that there are two words with roughly the same meaning, but one of them is mostly used by teenagers and another one often surfaces up in the official documents then I have less chance to make a mistake with these two words.

Of course, this is a very rough example... the definition of the social groups is per se a separate topic. I could imagine having a number of parameters and a possibility to combine them for a definition of a group.

The big question is: how to collect this information? OK, more and more people are now present in the Internet. How do we know to which group a person belongs? We don't want to have tags on ourselves, do we?..

From the other side, there are already a lot of social networks and their users do provide some information about themselves by themselves, while also providing the content for the public domain. Could it be possible to use this content somehow for such "social dictionary"system?

Of course, this leads directly to the questions of privacy. I wonder if there are already some thought about the evolution of the concept of privacy in the digital world. There is some anxiety about whether we will have much left to ourselves at the end. People tend to have secrets and people are submitting more and more data into the network. There are some mechanisms in development aiming at privacy protection though. That means there will be always some part of the content hidden from the public domain. One might wonder how much difference it can really make for the sort of "social tagging" proposed above.

In the meantime, I miss the "social dictionaries", because I am not content with having a row of synonyms and no clear clue how to distinguish one from another.

Even more precise approach would be to describe each meaningful word or phrase using some formal language (as each word or phrase is actually describing some point in the space of ideas which we all share). Then we could see how closely two words from the two different languages match. May be the same space is covered by one language (or dialect / argot) more densely than in the other one, and these subtleties are chipped of when the text is translated. (There is a well-known example of many varieties for the word "snow" in some Nothern languages, but it's already an exotic variety, there could be more interesting ones, for example describing people's feelings or actions in this or other area).

If we could find a way of mapping any text into this multidimentional space of lexical invariants, then many things could be possible. Comparison of the translations for the well-known texts (books etc) into different languages. Comparison of the different texts and finding similarities between them. Finding the associations, too (we can see with which words/phrases this particular word/phrase is often combined, and we can see if some words / phrases sound alike and therefore could be associated one with another... finding the words which sound alike is feasible with soundex algorithm already). Finding out if the texts of a particular author share some specific traits (specific vector families in this lexical multispace).

Then of course, any text is just a more or less successful attempt to express the thought. Words cannot be trusted as they are just labels for the reality which we chose to share between ourselves. But it is a fascinating area nevertheless.