(Ludum Linguarum is an open source project that I recently started, and whose creation I’ve been documenting in a series of posts. Its purpose is to let you pull localized content from games, and make flash cards for learning another language. It can be found on GitHub.)
In this post, I’ll talk a little bit about Ludum Linguarum’s support for some of the Aurora-engine based games that are out there. The Aurora engine was Bioware’s evolution of its earlier Infinity Engine, and was used in quite a few games overall.
There are quite a few games with large amounts of text that were produced with the Aurora engine (including one that I worked on), so it seems quite natural to try and target it for extraction. The text in these games can also be categorized in some ways that I think are interesting, in the context of language learning – there are really short snippets or words (item names, spell names, skill names, etc.), as well as really lengthy bits of dialogue that might be good translation exercises. Additionally, there’s quite a bit of official and unofficial documentation out there around its file formats.
Goals for Extraction
The raw strings for the game are (mostly) located inside the talk table files. However, just extracting the talk tables would lose all context around how the strings are actually used in the game. For example, the spell names, feat names, creature names, dialogues, and so on, are all jumbled together in the talk table. It sounds like a small thing, but I feel that creating a taxonomy (in the form of “lessons”) would make a big difference in the usefulness of the end product. Unfortunately, it also makes a huge difference in the amount of effort needed to extract all of this data!
How it all went
I spent quite a bit of time writing file-format-specific code, for things like the TLK table format, the BIF and KEY packed resource formats, the ERF archive format, and the generic GFF format. On top of that, then there was code to deal with the dialogue format that gets serialized into a GFF.
I started with the original Neverwinter Nights, and then moved on to Jade Empire. The console-based Aurora engine games used some variant file formats (binary 2DAs, RIM files, etc.) that needed a little extra work to deal with, but there was enough information about these available on the Internet that I was able to support them without too much hassle.
Once I had the basic file parsing code in place, it was just a matter of constructing the “recipe” of how to extract the game strings. This mostly involved sifting through all of the 2DA files for each game, looking for columns that represented “string refs” (i.e. keys into the talk table database) – extracting dialogues was much simpler since they were already in their own files, and their contents were unambiguous.
Comparison between C# and F# implementations
I had basically written all of this file parsing code before (in the C# 2.0 era, so without LINQ), but this time around I was writing it with F#. I found it very interesting to compare the process of writing the new implementation, with what I remember from working on Neverwinter Nights 2 more than 10 years ago.
The F# code is a lot more concise – I would estimate on the order of 5-7x. It isn’t quite an apples-to-apples comparison with what I did earlier (for example, serialization is not supported, only deserializataion), but it’s still much, much smaller. I suspect that adding serialization support wouldn’t be a huge amount of additional code, for what it’s worth.
Record types and list comprehensions really help condense a lot of the boilerplate code involved in supporting a new file format, and match expressions are both more compact, and safer when dealing with enumerated types and other sets of conditional expressions. I also got lots of good usage out of Option types, particularly within the 2DA handling, where it very neatly encapsulated default cell functionality.
But I think the thing that accounts for the biggest difference between my old C# implementation and the new F# implementation, is the range of functional abstract data types available – or, to put it another way, the lack of LINQ in my C# implementation. If LINQ were available at the time I was working on Neverwinter Nights 2, I think my code would have looked a lot more like the F# version, with liberal use of Select()/map() and Where()/filter(). These operations replace very verbose blocks of object construction and selective copying, often in a single line, which is an enormous savings in code size and improvement in clarity.
I feel like there is still a lot of bespoke logic involved, for extracting the individual bits and pieces of each format, but that doesn’t seem to be avoidable – the formats are not self-describing, and it seemed like it would be overkill to try and construct a meta-definition of the GFF-based formats.
Overall, I was pretty pleased with how this went. While it was a decent amount of work to support each file format, once that code was all written, the process of creating the game-specific recipe to extract strings was pretty straightforward. There weren’t really any surprises in the implementation process, which was definitely not the case for the game that I’ll talk about in my next set of posts.