How to deal with different file types

 


Text Files (.doc/.docx/.txt)

Regular text files without special formatting or case sensitive text can be analysed using default settings with no extra steps required.

Comments and Hidden Text

Depending on the client’s wishes some texts may have client comments or hidden text that require translation as well. By default, Memsource will not pick up either during analysis. This can be added to translation by enabling it during file upload by checking the appropriate checkboxes.

Special formatting

Oftentimes clients may not require a full translation of a file as it may have parts designed for context, or already translated sections from a past order. Text might be color coded or formatted in specific ways to inform us on which sections require translation and which do not (e.g. only text highlighted in yellow requires translation). Normally, this could be extremely time consuming and labor intensive to format manually. Without preparing such a file, Memsource would not identify the correct parts needed and would become a counter productive process trying to translate whilst avoiding unnecessary work. Leaving unneeded parts within the text could also potentially cause issues if instructions are misinterpreted by a translator or human error occurs. Luckily, if the formatting and instructions are clear - we can quickly and easily “rework” the file using Word’s “Advanced Find & Replace” function. 

In this example we only need to translate text in yellow, but keep un-highlighted text intact. To make it as simple and accessible as possible to translate, our end goal is to have only the highlighted text remain and be able to easily return the “not-to-translate” text.

Do achieve this we need to do the following:

Select all of the un-highlighted text via the “Advanced Find & Replace”. We can access the “Advanced Find & Replace” tool by pressing “CTRL + H” and selecting it via the cogwheel drop down menu.

There we check the checkbox for “Highlight all items found in: Main Document” and selecting “Highlight” from the “Format” drop down menu twice.  If we select it only once, it will select the highlighted text instead (in this example we’re trying to do the opposite, but of course, the depending on situation this can be adapted)

Hide all of the selected text. This is done by right-clicking on anywhere on the selected text and picking the “Font” option. In the new screen we simply check the tick box for “Hidden”.

The end result leaves us with a sleek file with only the needed text remaining. We save it, upload it to Memsource and let the translator do their work.

By having the text hidden this way we can easily process it in Memsource and then revert it to its original shape whilst adhering to the client’s formatting. 

Once we receive the translated file we can undo the hidden text by selecting the entirety of the document with the hotkey “CMD + A”; right-clicking the selected text; going back to the “Font” option and unchecking “Hidden” - resulting in a perfectly processed file. 

Selective .txt file translation

A unique quality of .txt file processing in Memsource is the ability to use regular expressions to define translatable text. Whilst using regular expressions to convert text into tags within Memsource is possible on most file types, it's rarely the case that we can define the exact text within a file without relying on editing the file itself. 

For examples on how to work with regular expression, please see the regular expressions section further on. 

 


Excel files (.xls/.xlsx)

Regular uploading

If we use default settings when analyzing Excel files it will process it the same way it would take a word or text file - overwriting the source translation with the translation in all sheets and cells within the Excel file.

Standard features such as importing hidden cells and comments are available. We can also select to translate the name of each sheet as well - if needed.

One specific setting to make note of is "HTML Processing". Unless the file contains character entities that would not normally appear as normal letters and symbols (i.e. &lt; instead of <; &Ouml; instead of Ö), this setting should be deactivated. If it is turned on files that don't require it, it might cause content. to not be imported into Memsource or even worse - not allow the file to be uploaded at all. 


Multilingual file setup

We also have the option to create a multilingual Excel file as well. This is useful for creating a side by side comparison of the source text vs translation. Another available feature is to set character limits for each segment or column. 

Recently Memsource has released a feature that allows automatic detection and set up of multilingual files. This requires the first row of each column to denote the exact ISO code of the languages to be set up into multilingual format. This however does not allow character limits to be embedded automatically, if this required then we must proceed to set up the file manually as described bellow.

In this example we will create a simple 2 column task with character limits on individual cells.

Our end goal is to keep the source text in column A, have the translation in column B and adhere to each cell’s character limit as stated in column C

please note: each respective cell requires an associated character count, if a cell does not have a number associated with, it will assume there is no character limit for it

To achieve this we need to describe the conditions within Memsource, while uploading the file.

 

 

We need to specify 3 criteria. Where the source text is located, where the target text should be and where the character limit is specified. If required, we can designate multiple columns to work with by simply separating the column identifying letter with a comma (,).

E.g. Source text in columns A, C, D would be written as “a, c, d”. For each source column, there needs to be a target column. 

 

Setting up character limits on Memsource, creates a visual indicator to the linguist of exceeded limits, reducing possible errors occurring during translation.

 

 


CSV

 

The key difference between .xlsx/.xls files and .csv files is how much and what kind of data they contain. A CSV file is just a text file, it stores data but does not contain formatting, formulas, macros, etc. It is also known as flat files. It’s often used to export product text or website data without much effort. This usually means that only certain parts of the CSV can be changed without making the file unusable.

 

Normally if a CSV file contains clean, short text segments using only latin letters, that are separated by commas and can be wholly changed  - Memsource will have no issue in processing it using default settings. 

This never happens.

File preparation 

To process a CSV file correctly we need to make sure that these 3 criterias are correctly set up.

  1. The chosen delimiter needs to not impact naturally occuring symbols in a sentence. (if the delimiter is a comma, it may incorrectly split a long sentence that would have a comma to be grammatically correct)
  2. The output encoding needs to accommodate symbols used in the language, otherwise it may show a “?” instead of the correct letter.
  3. Only the columns that need translation are processed.

 

In the “Multilingual CSV” tab we need to choose what column delimiter to use and where the source text will be once the text is delimited. 

Next we need to make sure that the output encoding matches the language character set we need. The default encoding is “UTF-8” and for most cases it will be sufficient. This needs to be examined on a per language basis. 

Lastly, we order Memsource to treat this as a “Multilingual CSV” file type. By default it will assume it is a regular CSV format and will undo the work we did in setting everything up prior.

Once we finish uploading the file, we should always check the file in the CAT tool to make sure it is correctly formatted.

Exporting and final checks

Once the translation is finished and is ready for delivery, we must do a final check on the file to make sure that the character set we selected originally is correct and does not corrupt the translated text. 

Unlike Excel files, we cannot open the file directly into Excel. This will potentially transform the text if the encoding does not match the default encoding that Excel uses; overwriting it and ruining the file.

We can circumvent this, by importing the final within Google Sheets. This way we will not jeopardise the file’s integrity.  

 


PDF

The PDF format was created as a format used in printing, where it was useful for the DTP person to lock down information about colour profiles, layout, fonts etc. in a packaged format that would show in exactly the same way on all computers and operating systems, so they could just send it to the printer and get accurate results. Now the format has become a universal packaging format for a lot of different file types when exchanging them online and due to this, many clients think converting their original source document to PDF before sending is the best way to do it. For this reason, we always need to ask the client first for the original open source document before we quote.

When we’ve established that the client does not have the original source file and we need to work with the PDF, we should consider the quality of the PDF along with the requirements of the client - most of the time, the client doesn't need a perfect file back, just a readable text. So it is important to clarify this, when we talk with the client.

PDF files, while technically supported by Memsource, should never be used without prior preparation. PDF files need to be first converted to a format more accessible and editable by Memsource such as a Word Doc file. 

The initial step when presented with a PDF file to translate by the client is to simply inquire if a different format is available. If it is not a scanned PDF, it would have to have been created from a MS Word, InDesign or Photoshop file as a source. All of the latter formats are far more easily readable, and layout can be preserved, resulting in less time and resources spent in pre and post-DTP. 

Without going into extensive DTP procedures, we have 2 options in regard to preparing PDF files for Memsource.

  1. Converting them via the Adobe document cloud if the file is mostly text with very minimal visual elements.
    https://documentcloud.adobe.com/link/home/
  2. Using ABBYY FineReader to selectively recognize and convert parts of the PDF, leaving out visual images that may be interfering. To a certain degree, layout and even image assets can be preserved using this method.

 

Quick checklist of things to keep in mind while dealing with PDF files:

  • Never hit the Analyze button in our Platform, when the uploaded source files are PDF. It will create a messy, useless project in Memsource.
  • When the client uploads PDF files as source files, always ask for the original source files.
  • When quoting a PDF project, bear in mind the additional cost and time involved when dealing with this format if it cannot be cleanly converted to a .docx file.

InDesign & Photoshop (.indd/.idml/.psd)

All 3 of these types can be handled by Memsource with little to no issue. The key point when preparing InDesign files is selecting the correct files to translate. Oftentimes when a client requests translation of an Indesign file, they will provide an entire package of assets - font files, images, linking files, example .pdfs and, more often than not, both .indd and .idml files. If the client requires to retain visual integrity - pre-DTP and post-DTP is required. 

For the translation portion of the request we only need to upload .idml files (we can also use .indd, but it’s recommended to use .idml if available). Memsource will have no issue recognizing the file type and identifying text fields within the entirety of the file, as long as it is not an image. It’s important to communicate with the client on what exactly needs to be translated within the InDesign file and if pre-DTP is required. 

Once we send the files for post-DTP, we need to provide all of the files associated with the request, images, fonts and all. 


Programming files (.html/.php/.xml/.json/.xliff.)

These programming code files usually have an extension on the file name such as “.html/.php/.json/etc.”.  These files always have an identifier at the very beginning of the file to denote what programming language is used.

 

Memsource will use the file extension to detect what type of file it is and process it accordingly. This can cause a problem if the source code within the file is in a different programming language than the extension implies (it would be like asking a Spanish speaker to read Portuguese) 

In essence, raw programming code files are not designed to be readily translated. To the untrained eye, code files can be very difficult to identify what is translatable text and what requires the conversion of untranslatable code into tags. Without such a conversion, the chance of human error occurring is incredibly high and one misplaced comma or letter can completely invalidate the file. Depending on the type of file, different measures need to be taken.

Wordpress site exports

Often times clients send over Wordpress RSS files under the pretense of it being a website export. They are generated under the file extension of .xml. It is important to note that this is not the kind of export that allows to translate a client's website. It is merely an export designed to carry over structure, but not content. We can easily identify these files by opening them in any rudimentary text editor and seeing the initial preamble that's written in these exports as comments.

By no means these files should be used for production, nor for quoting purposes.

Without going into API integrations, there’s 2 most common methods used in WordPress website translation.

  1. Using PhpMyAdmin to export and then reimport a .csv fie

  2. Using the WPML Wordpress plugin to export XLIFF, PO and other types of file that are specifically designed for localization (the plugin is called Wordpress Multilingual, shocking I know).

For more detailed information on the exportation methods mentioned, you can read up here.

Unfortunately, this is as far as we can help the client on the matter of exporting. The nitty gritty details on a plausible way to export/import their website will need to be figured out by their tech team as different setups require different levels of designation and exportation.

Regular expressions (Regexp)

The regular expression (abbreviated regex or regexp) is a sequence of characters that form a search pattern, mainly for use in pattern-matching with strings or string-matching. That is, it functions similar to "find and replace" operations. 

 

When analysing XML, PHP, XLIFF, etc. files that have the option to "Convert to Memsource Tags" you can use the following Regex expressions to tag the string that you want.

 

 

Important: Always double check if the tags were applied as intended.

Regex expression

Converts to Memsource Tags

<[^>]+>

HTML Tags

\[[^\]]+\]

[variable]

\{[^\}]+\}

{variable}

\[\[[^\]]+\]\]

[[ variable]]

\{\{[^\}]+\}\}

\"[^"]*\"

text inside “”

\"[^\"]+\"

text inside “” (alternative)

\&.*\;

text between & and ;

\&nbsp\;

&nbsp;

\{[^\}]+\}|\[[^\]]+\]

for {variable} and [variable]

(\S*%\S*)

any string that contains %

\#[^s]+s

Everything between # and the first s

\$[^=]+=

Everything between a $ and the first = sign

(\[[^\]]++\])++

Convert XML WordPress specific "language"

\http[^ ]*\.jpg

".jpg" links that begin with "http"

(?<=\: ").*(?=")

Everything between ":" and " from the second "

separator for the case when you are using multiple regex expresions - https://en.wikipedia.org/wiki/Vertical_bar (vertical bar, not capital i)

 

To understand Regexp a bit better - see the descriptions of basic special characters below:

Character

Description

\

Escaping the above-mentioned metacharacters allows you to search the literal meaning instead of its special meaning. That is, \[ will search for the [ bracket.

.

Matches any single character.

[ ]

A bracket expression; matches a single character that is contained within the brackets.

[^ ]

Matches a single character that is not contained within the brackets.

*

Matches the preceding element zero or more times.

?

Matches the preceding element zero or one time.

+

Matches the preceding element one or more times.

|

The choice operator OR. Matches the first or second condition. Is used for combining several regexps together. For example, <[^>]+>|\{[^\}]+\} will match all <html> code and all {variables}

&

The AND operator matches the expression before and the expression after the operator.

{ }

Range quantifiers.

( )

Grouping of expressions.

< >

Anchors that specify a left or right word boundary.

-

Range in a character class (for example [A-Z])

$

End of a line.

 

HTML

HTML formatting can appear in various file types, such as Excel, XML or even TXT.

In these cases if the HTML tags are not processed, the translation job will have segments filled with HTML tags. The way to identify HTML tags is the bracket system that it uses:

Usually when you see a tag enclosed by brackets such as <sometag>, it's HTML tags.

MemSource can handle these tags by checking "Process HTML" in different file type setting boxes. Whenever this is selected, the HTML setting box also activates for this file. In the HTML setting box, choosing "Preserve Whitespace" will help ensure the text layout in the delivery output file.

 

Wordpress XLIFF

When importing WordPress XLIFF files generated by the WPML plugin, make sure to manually select the WordPress XLIFF file type in the New Job screen.