Solving Challenges of Scale in Data and Language

It would not be too much of an exaggeration to say that the early internet operated on the scale of kilobytes, with all spoken languages represented using a single character encoding – ASCII. Today’s global internet, so fundamental to society and the world’s economy, now enables access to orders of magnitude more information, connecting a speakers of a full spectrum of languages.

The research challenges continue to scale along with data volumes and user diversity.

Two reports at the recent Verisign Labs Distinguished Speaker Series event held at Verisign’s offices in Fribourg, Switzerland — the first such event in Europe — underscored the ongoing activity in this area.

The event’s first speaker, Prof. Philippe Cudré-Mauroux is the director of the eXascale Infolab at the University of Fribourg. Exascale is of course the next in the series starting with the kilobyte measure and continuing with mega-, giga-, tera-, peta- and then exa-: on the order of 1018.

Prof. Cudré-Maroux described his research group’s work on Hadaps, a new system for distributing and load-balancing data across servers by taking into account differences in server performance. He also presented one of the real-world applications of the kind that drive demand for exascale data analysis, an intelligent system for detecting leaks in municipal water systems based on pressure variations reported by sensors.

The remainder of his talk covered a new data publishing platform, the Entity Registry System (ERS). Designed for semi-connected environments, ERS provides scalability in the broader world where internet connectivity is not always so reliable. (ERS was one projects funded in the Verisign Labs Infrastructure Grant program, and previously reported at the December installment of the series.)

The second speaker, Prof. Jean Hennebert, co-leads iCoSys, an institute for complex systems research at the University of Applied Sciences of Fribourg.

Prof. Hennebert’s research group has been working on the application of phonetic analysis to the selection of domain names. He described an efficient algorithm for mapping a word written in one language — a grapheme — into a language-independent sequence of sounds, or phoneme. Another algorithm maps back into written form, making it possible to construct two written words that sound the same in the same language (homophones), as well as in different languages.

Phonetic equivalence within a language can help domain name registrants find names that sound like a word or phrase that they or other users who speak the same language find interesting. Phonetic equivalence across languages can do the same for users who interact with multiple languages. A user may hear a word or phrase in one language, but type it in another. The similar sound of the domain name when spelled in the second language helps the user navigate to what he or she heard in the first.

As one example from the presentation: A Russian speaker might pronounce the word Лужниках the same way that an English speaker pronounces “Luznika”. (The difference is that only one of these is a real word.) The research group has developed a tool in collaboration with Verisign that suggests and rates multiple such mappings based on sound similarity.

Thanks to the these two experts for sharing their work as they continue to solve the challenges of scale in the growing global Internet.

In what other ways do you see the impact of scale in today’s internet?

You might also like