Comparing a PDF to a golden master

·

3 min read

To create a report, I had to combine the contents of several PDFs into one. Thanks to iTextSharp, it's really easy. Then I had the problem, how do I test this?

When I started this feature, I got an example how the PDF should look like. I knew how to get to the information to recreate the example PDF. So I had a golden master to test against and I could recreate it. Now I had to be sure that the test compared the generated PDF correctly to the original.

The first and most easy solution was to load both files with System.IO.File.ReadAllBytes and check if the byte arrays are the same. They never were. On several test runs, each time a different byte caused the comparison to fail.

After a bit of searching (and no small amount of swearing), I discovered that when you load a PDF with ReadAllBytes, you also load the metadata. Since two files can never be created at the same time, this option was no good.

After a bit of thinking (and a few more swears), I remembered that I can load a page via iTextSharp. So I wrote a bit of code that loads all pages from a PDF into a byte array and I compared those arrays. Lo and behold, the test passed.

Here is the code to load the content of a PDF:

private byte[] GetPdfContent(string filename)
{
  if (!File.Exists(filename)) return new byte[0];
  var content = new List<byte>();
  using (var reader = new PdfReader(filename))
  {
    for (var page = 1; page <= reader.NumberOfPages; page++)
    {
      var pageContent = reader.GetPageContent(page);
      content.AddRange(pageContent);
    }
  }
  return content.ToArray();
}

Small note, the pages start from position 1, not 0.

The test itself then becomes very easy:

public void VerifyGeneratedPdf()
{
  var goldenMasterContent = GetPdfContent("goldenMasterLoction");
  var generatedPdfContent = GetPdfContent("generatedPdfLocation");
  Assert.AreEqual(goldenMasterContent, generatedPdfContent);
}

It would also be possible to get the byte array from the code that generates the PDF. In this case, I want to store the PDF for later use, so I load the stored PDF.

So if you want to check if a PDF matches another PDF, just load every page one by one and compare those byte arrays.

EDIT:

Unfortunately, this broke after I changed some code to make the generation of the PDF more efficient. The goldenMasterContent contains approximately 4000 bytes and after optimizing the PDF generation, the generatedPdfContent contains only about 300 bytes. So there is a huge mismatch, yet the two PDFs contain the same data and look the same. Unfortunately they are now very different and I was back to square one.

To solve this little conundrum, I decided to do a less accurate check and load the text from the PDFs and compare that. The new GetPdfContent method now looks like this:

private static string GetPdfContent(string filename)
{
  if (!File.Exists(filename)) return string.Empty;
  var pdf = new StringBuilder();
  using (var reader = new PdfReader(filename))
  {
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (var page = 1; page <= reader.NumberOfPages; page++)
    {
      var currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
      currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default,
                             Encoding.UTF8,
                             Encoding.Default.GetBytes(currentText)));
      pdf.Append(currentText);
    }
  }
  return pdf.ToString();
}

Now it works more consistently, but the check is less accurate. I'm not happy with it because it doesn't check layout, but this is what I'll use until I find a better solution.