About This Dictionary
The TKG Japanese-English Learner’s Dictionary (TKGJE) is for learners of Japanese who are actively trying to build their vocabulary.
I began this project in early January 2026 in the hope of creating the sort of dictionary that I desperately wanted when I was studying Japanese more than forty years ago. Nothing like this dictionary existed then. Some years later, I proposed creating a similar dictionary to several Japanese publishers, but they all rejected the idea, on the (very reasonable) grounds that it would have taken a team of lexiographers years to complete and that the dictionary would never sell enough copies to recoup the investment. I gave up the idea until very recently, when AI writing and agentic coding tools became good enough to do almost all of the work.
All of the entry writing and coding and most of the conceptualization is being done by Claude Opus through Claude Code for the Web, with occasional debugging and other help from ChatGPT and Gemini. I serve mainly as project supervisor. The content is a bit choppy, and many improvements remain to be made. But it should be usable by learners even in its present form.
All of the entry files and code are available at the dictionary’s GitHub repository. This dictionary and all of its files are released into the public domain under CC0 1.0 Universal. Anyone may use and adapt the dictionary data for any purpose. While credit to me and to Claude would be welcome, I, at least, do not insist on it.
Tom Gally
Yokohama, Japan
gally.net
Blog
March 4, 2026
I thought of adding illustrations to entries that refer to traditional Japanese objects. I had Claude prepare prompts for twenty such headwords—asking for black-and-white line drawings, as traditional for paper dictionaries—and run them through three text-to-image models at OpenRouter. At first glance, the results look pretty good, but on closer examination there are problems with almost all of them: some obvious, like too many strings on a shamisen, and others more subtle. Some are probably too subtle for me to identify: what in fact would be a typical arrangement of a tokonoma, for example? As with the audio readings of the example sentences, I decided to hold off on adding illustrations until I am reasonably confident that I can create a workflow to have AI models check them reliably.
February 27, 2026
A couple of days ago, I used Claude Code to create and run a workflow whereby it had a relatively cheap version of Gemini think of words related semantically to headwords and candidates already collected, evaluate those words for suitability for addition to the dictionary, and then add those words that passed to the Pending list. The process quickly turned up thousands of words that belong in the dictionary, and I am now having Claude create entries for them in bunches of thirty. After the Pending list gets whittled down some more, I will run this “think of related words” task a few more times. I had noticed before the last run that only one planet (Jupiter) was in the dictionary; this workflow yielded a couple more, but several are still missing.
Today, I also added a new kanji index, ordered by how many headwords each kanji appears in. I hope that this unconventional arrangement might be useful for serendipitous explorations.
February 21, 2026
I am continuing to have Claude add new entries to the dictionary based on the corpus I had it compile a few weeks ago. It is over half way through the list of word forms extracted from that corpus, and I expect it should be finished in a week or so. The words it is choosing from the list all look suitable for this dictionary and, in aggregate, should fill nearly all gaps in essential vocabulary.
After that word hunt is done, I plan to try the “think of related words” strategy again. I will have Claude go through the full list of headwords one by one. For each word, it will come up with a list of synonyms, antonyms, and related words and create entries for any of those words that are reasonably common and not yet in the dictionary. I hope that will fill any remaining gaps.
Other tasks for the weeks and months ahead: more complete information on verb conjugations and i-adjective declensions; better coverage of kanji and other orthographic variations; better cross-referencing between entries; addition of inline links to all example sentences (fewer than one-tenth of the entries have the links now); improvements to the interface; and addition of audio readings to example sentences.
At some point, I plan to create a comprehensive proofreading and evaluation agent. It will read through the entries one at a time and critically examine every element: glosses, definitions, example sentences and their translations, notes, furigana, cross-references, etc., and suggest corrections and improvements. I might run that agent with either GPT or Gemini instead of Claude, to get a second opinion, so to speak.
February 7, 2026
In addition to just having Claude think up new headwords, as I described on January 27, I am also adding words that Claude has used in example sentences but that did not yet have entries. This process is catching many common words that had been overlooked.
Also, the corpus I am compiling has reached around 250,000 characters, about two-thirds from the web and about one-third texts generated by Claude, ChatGPT, and Gemini. A couple of days ago, I had Claude use a natural language processing program it found on the web to extract the canonical forms (lemmas) of all words in that corpus. I then let Claude decide which of those words should be candidates for inclusion in this dictionary, based on a few criteria I gave it: avoid proper names, overly technical or specialized terms, etc. It is making very good choices.
For the time being, I will be adding words mainly using these two methods, as the words found in this way are, to my eye, appropiate for inclusion in this dictionary. The earlier method—having Claude just think of words—is increasingly yielding low-frequency vocabulary not so important to learners.
The headword-addition process goes slowly, about thirty entries per batch and at most four or five batches per day. The dictionary did reach my initial target of ten thousand headwords a couple of days ago, but I can see from the words still being added that that target was too low.
After the entry count reaches fifteen or twenty thousand, I will start checking for entries that can be removed or merged. I have noticed cases in which minor okurigana or kanji variations led Claude to create separate entries for what are usually regarded as the same word. There are probably many more such cases, and I will have Claude search for and deal with them systematically at some point.
February 3, 2026
Yesterday, I started adding a new feature to the dictionary: inline links from words that appear in example sentences to the entries for those words (if the entries exist). This feature required the creation of a new tagging system for the examples in the JSON entry files.
For Claude to create links to the correct files, it cannot just write a deterministic program, as the same string of characters might, depending on the context, represent different words. Instead, Claude must examine each sentence semantically, a time- and token-consuming process. The links won’t appear in all entries for quite some time. Some entries that do have the links are those for 広場, ゴム, and 普段. Both the Examples and Links toggles must be turned on to view the links.
January 27, 2026
A few notes about how headwords are being collected for this dictionary.
When I started the project early this month, I had Claude decide for itself what words to use as headwords. It reported that it started with a wordlist it found on GitHub for people studying for the Japanese Language Proficiency Test. As it started writing entries, on its own initiative it marked each entry with a JLPT level (N5, N4, N3, N2, or N1).
A few days into the project, I had it start compiling a list of candidate words. I had it write a prompt that I could give it to have it add words to the list. The prompt basically asks Claude to think of 100 words that would be suitable for this dictionary and add them to the list. I then run another prompt to have it create entries from words on the candidates list, currently 30 at a time. (I tried larger batches, but it would start cutting corners and ignore key parts of the prompt as it ran out of context.)
This method seems to be getting pretty good coverage of basic and core vocabulary, though there might be gaps. I haven’t run a comparison with existing dictionaries or wordlists, and I would like to avoid doing so if possible. This “think of words that belong in the dictionary” would certainly not be viable for a human-produced dictionary, nearly all of which are compiled based on previous dictionaries, but it might be okay when a language model trained on vast amounts of text is doing the “thinking.” We’ll see.
Today, I had it add to the candidates list most of the words that it has used so far in the notes sections of entries. This increased the candidates list from around 500 words to over 10,000, though there were a lot of spurious entries that I have started to have it cull.
The JLPT itself does not specify vocabulary for its levels; it seems to have adopted a CEFR-like approach that avoids associating particular words or skills with assessment levels. So, a week or two ago, I had Claude delete all of the JLPT levels and create instead an original three-tier vocabulary system. The initial candidates for the basic and core tiers were voted on by Claude, ChatGPT, Gemini, and Grok. I then had Claude adjust the tier membership so that each of those two tiers would be reasonably complete; that is, if a word in a distinct and limited semantic category—such as days of the week—is in the tier, then the other words in that category should be in the same tier. This seems to have worked reasonably well, though I will need to have Claude revisit and revise this tier system later.
I have also started a separate project in which I have Claude Code compile a corpus of texts in various genres from the web. When the corpus is reasonably large, I will have Claude scan it to find essential words that have not been added to the dictionary or the candidates list yet. I am hopeful that this will fill in any gaps in vocabulary that should be in any dictionary of Japanese.
January 22, 2026
Today I had Claude add a kanji index function. Click on any kanji in the headword of an entry and you will be taken to a list of all entries in the dictionary that contain that kanji.
I have also been having Claude run the continue_polishing.md prompt on batches of entries to adjust the vocabulary level in example sentences for the base and core tiers, add example sentences, and fix various problems. This is a token-heavy task, and it can only do about 20 at a time before running out of context. It might take a week or longer to go through all of the entries in this way. After it’s done, I plan to set up a similar procedure for Claude to improve the notes sections of the entries. They vary widely in length and format now, and I want them to be more consistent.
After that is done, I plan to enhance the cross-referencing between entries and add information on verb conjugations.
I want eventually to add audio readings of all of the example sentences. A couple of weeks ago, I did produce about a thousand example readings in mp3 format using OpenAI’s text-to-speech model (at an API cost of about one yen per sentence). The intonation was quite natural, but there were occasional mistakes, such as punctuation marks read aloud or the particle は read as ha instead of wa. I decided that such misreadings would be annoying to users of the dictionary, so I am currently planning to add audio files only after either TTS models improve for Japanese or I find a way for LLMs to check audio readings for accuracy automatically.
While doing these other tasks, I will continue to add new entries.