Engineers to the rescue: When your public employment service makes it impossible to apply for jobs

The background

InfoJobs.net is Adevinta’s job board in Spain and Italy. A place where companies can hire the right talent, and where people can find their next job.

InfoJobs, much like any other marketplace in Adevinta, makes use of YAMS (Your Adevinta Media Service) to store and retrieve media assets. But, unlike other marketplaces which mainly store images, InfoJobs stores, transforms and retrieves mostly documents (résumés, CVs, cover letters, etc.).

YAMS offers users the possibility to upload a document (in basically any format) and transform it into a PDF. In order to do so, YAMS parses these objects by means of two key projects:

1. LibreOffice (open source)

2. UniPDF (commercial)

The first is used to transform any document (that is not already a PDF) into a PDF, and the second is used to perform extra transformations to the output PDF, for example, applying a watermark. After the input document has been transformed, it is delivered to the requesting users via a CDN network.

We are the team responsible for creating and maintaining the YAMS service, the Edge team, and this is our story of customer obsession, collaboration and solidarity.

The issue

One Monday morning, we received a message from one of InfoJobs engineers about YAMS returning errors (HTTP 500) while trying to fetch certain PDF documents.

We examined our log files and found that a bunch of PDF files were causing transformation failures with a never-seen-before error from UniPDF.

The team dug deeper and spotted the issue: these files uploaded by some InfoJobs users looked like corrupted PDFs. Because they’re not formally valid, UniPDF fails to parse them, throwing an error. Our routines catch this error but, instead of classifying it as a file format error (HTTP 4xx), the routine thinks something broke inside YAMS itself.

We have seen plenty of examples of corrupted and invalid objects being uploaded to YAMS, so we did what we usually do. In these cases the right course of action is to instruct YAMS to:

1. Flag these files as corrupted.

2. Return the correct error code.

Then we retrieved the offending documents and added them to the YAMS test suite. This is part of our standard way of working, as it helps to make sure that we will keep returning the expected error code when similarly broken files are encountered in the future.

The doubt

Normally that would have been the end of it, if it wasn’t for the fact that we noticed that all the offending files were generated by the “Servicio Canario de Empleo” (Canary Islands Employment Service – SCE) and they were all résumés of people looking for employment. This realisation prompted us to investigate a little further.

We asked ourselves: is it a coincidence that all the files causing this new error have been generated by the same organisation? Could it be that these files are not actually corrupted, but instead conflicting with the way UniPDF does its parsing? Is there something we can do for these files to be processed correctly? Can we lend a hand to the less tech-savvy of InfoJobs’ user base?

So we tried a few things: we updated the UniPDF library to the latest version and reprocessed the broken files, but we still got the same error. Then we tried to test those files with other tools (i.e. qpdf, pdfinfo) but the validity check failed each time.

In a last-ditch attempt, we tried to open the files with different viewers (i.e. Adobe Acrobat Reader, macOS preview, LibreOffice Draw, Google Chrome, Mozilla Firefox) and, to our great surprise, they were all able to display the PDFs.

At this point, we had to get to the bottom of the issue.

A screenshot of the online CV generator tool from the Canary Islands' Employment Service

The communication

The next morning, during our daily sync meeting, it was clear that we had two options: talk to the SCE or ask the UniPDF people. So we decided to do… both.

While some of us contacted the technical office of the Canary Islands local government, others wrote a support request to UniPDF.

The people from the Canary Islands Employment Service were kind enough to provide us with a PDF generated by their systems that didn’t contain any personal information (as opposed to the files uploaded by the users via InfoJobs that contained all sorts of sensitive data). We attached this file to our UniPDF support request, hoping we could leverage their expertise to make sense of this inexplicable behaviour.

At the same time, we started our own internal investigation.

The rabbit hole

We obtained the latest version of the PDF 1.x standard (ISO 32000-2) and started learning about it. At the same time, we started taking apart the “broken” PDFs produced by the SCE. It did not take long before we realised that none of the SCE documents complied with the PDF standard.

Specifically, we found out that they were the result of two completely independent PDFs being appended one after the other inside the same file.

The first PDF is the original document (a person’s résumé), while the second is the same document again but with an additional signature footer explaining how to verify its authenticity.

This finding would explain why UniPDF was unable to parse the files. Why other readers were able to visualise the files successfully remained a mystery.

In the meantime, we received a reply from the UniPDF technical support team, which basically highlighted the same issues we had already found.

The analysis

We analysed the test PDF generated by the SCE to understand why our system was unable to handle it correctly. Our investigation revealed that the file was composed of two PDFs, joined one after the other.

As this is incompatible with the PDF standard, the tools we used to check the documents integrity reported errors when attempting to validate it.

Validity check with qpdf

$ qpdf -check cv-1.pdf
WARNING: cv-1.pdf: file is damaged
WARNING: cv-1.pdf (offset 44364): xref not found
WARNING: cv-1.pdf: Attempting to reconstruct cross-reference table
qpdf: cv-1.pdf: unable to find /Root dictionary

Bash

Validity check with pdfinfo

$ pdfinfo cv-1.pdf
Syntax Error (44372): Illegal character '{'
…

Bash

Content of the test file

As described earlier, the test file contains two complete PDFs. We were able to extract both PDFs, which are shown in the table below.

The one on the left represents the PDF that runs from the beginning of the file to the first %%EOF marker (i.e., from offset 0 to offset 35424).

The PDF on the right represents the one that starts at the second %PDF- header and continues to the end of the file (i.e., from offset 35425 to offset 80599).

The first part of the invalid PDF file, that contains the original CV as generated by the tool

The second part of the invalid PDF file, that contains the original CV plus a signature footer at the bottom

The details

Our (limited) knowledge of the PDF standard allowed us to spot a couple of reasons why the way these files are built made them non-compliant.

Double %PDF- header

According to the PDF standard, there should only be one %PDF- header in the entire file. However, the file we tested contained two such headers: one at the start of the file (where it should be) and another at offset 34425, immediately following the %%EOF (End Of File) marker.

A screenshot from a hexadecimal editor showing the first %PDF- header — First header (correct position)

A screenshot from a hexadecimal editor showing the second %PDF- header — Second header (this one should not exist)

Cross-reference (xref) table

The standard dictates that at the end of a PDF file, there must be a table called the xref table (cross-reference table). This table should contain a reference to the byte offset of every object in the PDF file.

The xref table in the PDF generated by the SCE has not been updated to reflect the new positions of the objects in the combined file. Instead, it still points to offsets as if the two PDFs had not been concatenated.

The first 10 digits in the table represent a byte offset relative to the beginning of the PDF file. The usual assumption is that the PDF starts at the beginning of the file, but in our test document, the second PDF actually begins at offset 35425 (i.e. the offset of the second %PDF- header). For this reason, all offsets in the second xref table should have been updated to reflect their new positions in the file, but they have not. This causes some parsers to be unable to locate the objects in the file, as they are not where the xref table indicates they should be.

Comparison between the current xref table and what it should look like

A side-by-side comparison of the xref table in the file (with wrong offsets) versus what the correct xref table should look like — Comparison between the current xref table and what it should look like

The solutions

There might be at least two solutions to improve the way that the SCE produces and authenticates these documents:

1. Update all offsets in the second document.

2. Use incremental updates instead of appending two PDFs one after the other.

Solution #1

We noticed that, although it does not fully comply with the PDF standard, correcting the offsets in the second xref table makes the entire file significantly more usable (and much closer to a proper PDF).

The Adevinta Edge team has developed a small tool to correct the offsets in this type of file.

The tool, instructions and related source code can be found in this public GitHub repository: SCEfix.

Solution #2

The PDF standard seems to address this exact use case.

If you have a PDF document whose content you do not want to modify (for instance, because it has been digitally signed, and any changes would invalidate the signature), you can apply what is known as an incremental update.

In essence: you take the original PDF as it is, without making any changes, and then append only the updated parts at the bottom. In this case, the updated parts simply include the table with authenticity checks, the barcode, the QR code, etc.

More formally, after the trailer of the original PDF, you only append the new body, the xref table (with correct offsets), and the corresponding trailer section (note: no new header is added).

A table showing the structure of an incrementally updated PDF file

This new PDF will be composed of two sections: the original and the updated part, and it will fully comply with the PDF standard, while retaining all the characteristics of the current SCE method.

The future

The team reported our detailed findings to the Canary Islands local government, with the recommendation to implement at least one of the two proposed solutions. We took the liberty of sending our recommendations because we care for the InfoJobs user base and for our brothers and sisters in the Canary Islands.

Four weeks later we received a reply to our email.

The technical support office told us that they created a low priority ticket to address the incompatibility we reported, and that they will let us know as soon as they have an estimated delivery date. They also said they will be in touch, should they need any technical help from Adevinta.

We are confident that this whole ordeal will end up improving the way these CVs are generated, which will be a good thing for all the parties involved: the Canary Islands’ local government, InfoJobs, YAMS and – most importantly – the end users.

The lesson

We, as members of the Edge team, learned a lot during this journey. Yes, we acquired new technical know-how (which is never a bad thing), but we also learned something about us, as a team and as individuals.

The team didn’t just stop at fixing the technical issue for our service, we tried to understand and address the root cause. Moreover, by engaging multiple stakeholders, including the Canary Islands Employment Service and UniPDF developers, the team demonstrated that complex problems often require cooperation and knowledge-sharing across organisations. We committed to deeply understanding the issue, even down to the internals of the PDF standard, and this highlights our persistence and technical curiosity in resolving unusual challenges.

More importantly, because the affected files belonged to job-seekers, the team showed solidarity with less tech-savvy users by actively seeking a solution that improved their experience, rather than dismissing the problem as external.

As far as we are concerned, when technology serves people, especially in sensitive areas like employment, it’s not just about fixing errors: it’s about understanding, empathy and striving to make a difference. And this is the Edge team’s trademark.