Switch to the Czech version

Saturday, 30 June 2012

Random names generator (part 1)

I have always loved games with randomly generated content: NetHack, Dwarf Fortress, ... And one of the things these games generate are names. Names of locations, people, weapons, etc. In a few of my projects I have also encountered the need to generate random names. And this text (blog-post series) covers the process of my research and development.


The first approach that comes to mind is to write a list of names of people for example and randomly choose. But this would result in a world full of Peters and Johns. Not what I'm looking for.

Another approach could be to download some list of names and use that, as in the previous case. This is a lot better - if I got a large database of first and last names, I could generate a list of real-looking names. Ideal for test data. But again, this is not what I'm looking for. My aim is something more fantasy-like.

I want to connect words like pieces of puzzle. Take an adjective, put it in front of a noun and leave it to the reader's imagination to deal with it. And this is what a part-of-speech (POS) dictionary gives me. It is basically a list of words which are tagged with their lexical category - verbs, particles, nouns, etc.

Basic version

The first database which I used was compiled by Kevin Atkinson: wordlist.sourceforge.net. I created a rather simple Java application that parses the text database and converts it into an SQL database (Derby). This way I can search the database efficiently. Generating names resolves to retrieving random words of a given type and putting them together. Here are some nice examples obtained using a "ProperNoun the Adjective CommonNoun" template:
Lancey the Norse petunia
Lehet the crunchier cleanser
Maleki the enlarged lampworker
Zilla the extrametrical horsemint
Weisburgh the bioecologic linksman
Cimmerian the savvy tidehead
Nahuatlan the half-drunken by-passer (this one is my favourite)
Just a note: the original database contained no information about Proper vs. Common nouns, this additional attribute is generated by checking the following property: Noun & matches(/^[A-Z][a-z].*/) => ProperNoun
As you can see, it works quite nicely.

Better version?

The results are not always meaningful, but this is to be expected. The dictionary doesn't contain enough information to make them meaningful. But I think there is a bigger problem. The list of available words is vast and most of them are rarely used. The outcome? Some generated names look like a heap of symbols to us. For example:
Paff the trans-Jovian heterodoxy
Wutsin the thickening basidiomycotina
Ramiah the zingiberaceous supersensitisation 
(try to read this aloud, I dare you!)
Sometimes it would require to look all the words up just to know how to pronounce them, not to mention understand their meaning. Solution? Use these rare words rarely (oh my!).

Thanks to years of work and a huge amount of CPU time there are now databases of Ngram frequencies, i.e. number of occurrences of  N-word strings (in books for example). One giant (and free) dataset I managed to find is available from Google: books.google.com/ngrams/datasets. Another one comes from our beloved Wiki: en.wiktionary.org/wiki/Wiktionary:Frequency_lists.

If I joined the information from a POS database and a Ngram-frequency database, I could make the resulting database more "normal", believable. Common words would show up more frequently and the texts would be easier to read.