Jimbocho is a small neighborhood between the office towers and the city thoroughfare, which is also called “Hon-no-machi”, “city of books”. Nowhere else in the world are there as many bookstores in such a small area as here, mostly second-hand bookshops. They have more than ten million old titles in stock. Tokyo has been a book metropolis since the 17th century.
Today, 99 percent of Japanese people are cut off from the literary heritage that resides in these stores and archives. They buy old books but cannot read them. Texts are written and printed in Kuzushiji, a script that was abolished by the Ministry of Education in 1900. It was created by vigorous writing with a brush, as a kurrent or “running text”, as it used to be called in German. Kuzushiji translates as “collapsed characters”. About 3.5 million books and documents in Kuzushiji have never been translated into modern Japanese. Even most historians cannot read these texts, or only with difficulty. Thai literary scholar Tarin Clanuwat calculated that it would take more than a century to decipher and transcribe it if the few remaining Kuzushiji experts were to gather.
At some point, she had an idea, she says. “How much faster would it be if a computer transcribed the lyrics?”
The writing system imported from China is not really suitable for Japanese people
Clanuwat also programmed from a young age and wanted to try “artificial intelligence” – a computer should learn to read Kuzushiji on its own. Together with a team from Japan’s National Institute of Informatics, she developed an algorithm: “KuroNet”. Even in early versions, the program correctly recognized around 90 percent of characters in simpler texts. A solid value since until recently, learning Kuzushiji on a computer was considered impossible. Artificial intelligence looks for patterns and regularities. However, Kuzushiji writers reduced their characters individually, each differently. In addition, there are hundreds or thousands of different characters in the texts.
The root of the problem is that the writing system imported from China is not really suitable for Japanese. In the first centuries after Japan adopted Chinese characters, this did not matter. The few people who wrote did so in Chinese. It was, so to speak, the Japanese written language. It became more difficult when the Japanese started writing Japanese using the Chinese script in the 8th century.
Basic Chinese vocabulary words consist of a single syllable, Chinese does not conjugate or inflect. That’s why it’s good to write with ideographic characters – graphic symbols that represent an object or concept. Japanese words, on the other hand, are mostly polysyllabic, Japanese not only conjugates verbs, but even puts adjectives in the past tense. For this, his writing needs grammatical elements. The Chinese writing system offers no solution for this.
So the Japanese began to reduce some Chinese characters to a phonetic value. At the same time, they radically simplified these characters. Thus, Japanese syllabic alphabets were created, although they were never standardized until the 1900s. Everyone wrote their syllabic characters as they wanted. Some spellings prevailed, others disappeared. Documents written in Kuzushiji bear witness to this spread.
For centuries, only men wrote kanji, characters taken from China, to demonstrate their classical education and superiority. The women were said to be uneducated, using characters reduced to phonetic syllables and “collapsed” in sweeping script because they had not learned any kanji. Which was not true in some cases. But kanji were reserved for men.
“In the 14th century, a court poet defined three categories,” Clanuwat said: “He compared properly written kanji to a standing person, loosely written to a walking person. He said kuzushiji kanji runs.”
AI programmers don’t know Japanese, but they translate ancient Japanese texts
KuroNet is now freely accessible online, but only in Japanese. The system needs about two seconds per character. To improve it, three years ago, the Clanuwat Institute launched a competition for self-learning Kuzushiji reading programs on the Kaggle platform, where companies and institutes publish software problems. A total of $15,000 in prize money was up for grabs. 293 teams participated.
The second prize went to Konstantin Lopuhin from Moscow, as deep learning specialized software developer. He has previously evaluated satellite images using artificial intelligence and categorized sea lions. In the Skype interview, Lopuhin does not talk about kanji or characters, but about “objects” and “classes”. Does it make a difference if the computer has to recognize characters instead of sea lions? “Yeah,” says Lopuhin, “there were five classes for the sea lions and about twenty for the satellite images. But for Kuzushiji there were over 4,000. I also don’t speak Japanese, nor could I spot the obvious mistakes.”
To this day, Lopuhin does not know what is in the texts that his program deciphered. Cookbooks made up a third of the 45 texts to be deciphered for the competition, plus a 1639 book on Christianity, one on silkworm breeding and lots of literature. Including a copy of the first chapter of “Genji”, which is considered the Japanese book of books and the first novel in literary history. Court lady Murasaki Shikibu wrote the story of the prince’s love adventures in the first decade of the last millennium – a time when people met to read their poems about cherry blossoms, autumn leaves, fog and the moon, about love and its impermanence. They drank rice wine and had affairs. Love was free, at least for men. After all, women had their own thickets of signs. They wrote Kuzushiji – peppy, “running”.
Can the discovery of unknown masterpieces be expected when computers en masse transcribe Kuzushiji’s texts? “I don’t think so,” says Clanuwat. “But you’ll find here explanations of old concepts that we didn’t fully understand until now. Little pieces of the puzzle that suddenly make sense.”