# Comparing a PDF to a golden master To create a report, I had to combine the contents of several PDFs into one. Thanks to [iTextSharp](https://www.nuget.org/packages/iTextSharp/), it's really easy. Then I had the problem, how do I test this? When I started this feature, I got an example how the PDF should look like. I knew how to get to the information to recreate the example PDF. So I had a golden master to test against and I could recreate it. Now I had to be sure that the test compared the generated PDF correctly to the original. The first and most easy solution was to load both files with `System.IO.File.ReadAllBytes` and check if the `byte` arrays are the same. They never were. On several test runs, each time a different byte caused the comparison to fail. After a bit of searching (and no small amount of swearing), I discovered that when you load a PDF with `ReadAllBytes`, you also load the metadata. Since two files can never be created at the same time, this option was no good. After a bit of thinking (and a few more swears), I remembered that I can load a page via iTextSharp. So I wrote a bit of code that loads all pages from a PDF into a `byte` array and I compared those arrays. Lo and behold, the test passed. Here is the code to load the content of a PDF: ``` private byte[] GetPdfContent(string filename) { if (!File.Exists(filename)) return new byte[0]; var content = new List(); using (var reader = new PdfReader(filename)) { for (var page = 1; page <= reader.NumberOfPages; page++) { var pageContent = reader.GetPageContent(page); content.AddRange(pageContent); } } return content.ToArray(); } ``` > Small note, the pages start from position 1, not 0. The test itself then becomes very easy: ``` public void VerifyGeneratedPdf() { var goldenMasterContent = GetPdfContent("goldenMasterLoction"); var generatedPdfContent = GetPdfContent("generatedPdfLocation"); Assert.AreEqual(goldenMasterContent, generatedPdfContent); } ``` It would also be possible to get the `byte` array from the code that generates the PDF. In this case, I want to store the PDF for later use, so I load the stored PDF. So if you want to check if a PDF matches another PDF, just load every page one by one and compare those `byte` arrays. **EDIT**: Unfortunately, this broke after I changed some code to make the generation of the PDF more efficient. The `goldenMasterContent` contains approximately 4000 bytes and after optimizing the PDF generation, the `generatedPdfContent` contains only about 300 bytes. So there is a huge mismatch, yet the two PDFs contain the same data and look the same. Unfortunately they are now very different and I was back to square one. To solve this little conundrum, I decided to do a less accurate check and load the text from the PDFs and compare that. The new `GetPdfContent` method now looks like this: ``` private static string GetPdfContent(string filename) { if (!File.Exists(filename)) return string.Empty; var pdf = new StringBuilder(); using (var reader = new PdfReader(filename)) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); for (var page = 1; page <= reader.NumberOfPages; page++) { var currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy); currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); pdf.Append(currentText); } } return pdf.ToString(); } ``` Now it works more consistently, but the check is less accurate. I'm not happy with it because it doesn't check layout, but this is what I'll use until I find a better solution.