Case Study: Machine Translation Goes Prime-Time

5 min readMar 1, 2021

FamilySearch is a non-profit organization dedicated to helping people discover their family tree. Because of diasporas throughout human history spanning the globe, a person’s family makeup is complex and likely touches several countries speaking different languages. FamilySearch has a goal to make all historical records searchable regardless of language spoken. For FamilySearch’s first ten languages, this was challenging, but straightforward. Expanding the site into an additional 30+ languages in a single year was challenging enough, but there also had to be content in those languages.

Summary

Company: FamilySearch
Objective: Get more content to international users faster by using machine translation wherever acceptable
Process: Evaluate Machine Translation (MT) providers, gather quantitative data on MT performance and qualitative feedback from human translators; find the teams, systems and processes needed to incorporate machine translated content into the site, design solution, prototype, iterate
Deliverables: Analysis on existing players in MT, process map for MT-led translation, user/volunteer journey map, wireframe of feedback experience
Team: Rob Thomas, Chris Manning, Dan Call, Bryan Austed, Bryan Robinson and Randy Hoffman
My Role: Interviewing translators, analyzing data and visualizations from quantitative evaluations, user journey maps, leading design thinking sessions, running pilots with sample data. Design the wireframes for feedback screens

Recognizing and Overcoming Objections

I was tasked with selecting several machine translation engines to evaluate. The evaluation process included rigorous blind comparisons, feedback gathered from human translators, post-edit distance scores provided by the translation management system as well as a competitive analysis of the different MT providers.

Translators often have harsh biases against MT, partially due to the perceived threat to their continued livelihood. Getting them on board and feeling comfortable with these new MT engines was of the utmost importance. Since one of the main goals was to increase translation output, I was able to frame discussions with translators as evaluating potential aids for increasing their efficiency. This not only alleviated potential distrust, but also made translators excited to share their opinions with us.

Evaluation Results

By comparing MT providers A, B, and C, including limited training data sets fed to each engine we were able to determine that segments from engine C were preferred 54% of the time over the other engines.

More importantly, comparisons showed that providing training data to the engine improved output between 2–4x depending on language. With this data, we were able to get executive buyoff to start a pilot with MT-led translation of content.

Analogies from Software Development

FamilySearch is an engineering-led organization. Like most modern software companies, they divide their work up into sprints. At FamilySearch, strings are already translated this way, being fed to the Translation Management System (TMS) from the Github repository whenever there is an update. I thought about taking this further and presented it to management.

There is an adage in the industry, “Shipped is better than perfect.” That has never applied to translation where every phrase and word is highly scrutinized. We identified low-risk content—geographic and historical facts based on country—that could be published immediately after MT. Content would be identified as translated by machine with an option to provide feedback on the translation. That way, content would go up quickly, and could be iterated on as feedback came in.

I explored two different methods of providing feedback, mapping both out in a user flow and architectural diagram.

The first was to simply give users a thumbs up, thumbs down on the content. Pros: very simple for users to engage, we would get more users to inform translation quality, and content reviews could be triggered by reaching a threshold of negative feedback. Cons: difficult to pinpoint specifically what users are referring to, reviewers would have to review the entire page.

The second method involved designing a feedback module that was more robust. It allowed users to offer an alternate translation, replace words, or even contribute their own content about their ancestors. This would create a brand new point of contact between users and FamilySearch. Pros: user feedback would be more specific, users who engaged may develop a sense of ownership and community because of their involvement. Cons: fewer users might contribute feedback, more detailed tagging of content would be required from the TMS so that segments identified by users could be mapped back to the same segment for review, which would require dev work from our TMS vendor.

Flowchart with wireframes of feedback module.

Prototype Pilots

We decided to set up small prototype projects to test both options with test groups. The scope of each was to use a small subset of content (10,000 words) in two of our largest established languages plus our new top priority language. I ran the content through machine translation and published it to two different pages.

The first option was a simple thumbs up, thumbs down feedback module. Results were tallied in the CMS as an object, borrowed from the Help Center’s feedback module. Users liked the simplicity, but some commented that there wasn’t any way to elaborate, a concern we had had going into the prototype.

The second option was presented in wireframe so that we could get it out at the same time as the simple feedback module. It provided users the opportunity to basically edit the translation themselves. In the prototype, it didn’t link back to the TMS, but all edits were saved to a spreadsheet where they could be reviewed and pasted into the TMS later. It also provided users a way to give general feedback about the content. At the time, we were concerned about some of the content that had been created giving too much of a western point of view and thought the ability to give feedback in general would be of value. Lastly, we gave users the option to contribute their own content, upload media, etc.

Learnings and Takeaways

The feedback we received from our volunteer users, as well as product owners showed us that a combination of the two options would be best. We have shelved building it until the dev work needed to send user translation changes directly to the TMS can be done. And though the feature hasn’t been implemented, it gave FamilySearch confidence to use machine translation more agressively. The annual conference FamilySearch puts on, RootsTech, had all of the presenter bios go through machine translation and then reviewed by volunteers—many of whom only spoke the target language and were only reviewing to make sure the translation made sense and didn’t contain anything strange or offensive.

More importantly, it lead to highly innovative new ways to train MT and utilize locally sourced content as ways of improving future MT. There is a lot of potential in our learnings that could lead to a huge boon for the field of MT and neural systems.