Down the Rabbit Hole of Using SelectPdf on Microsoft Azure
Initially I thought it would be trivial to use SelectPdf to convert an HTML page into a PDF document as a service on Microsoft Azure, since I was easily able to do this as a .NET console app.
Everything was going well to begin with. I'd set up everything in the Resource Group: The Logic Apps that write to and read from the Service Bus, the Storage Account for the template files and generated documents, the Function App that's supposed to handle the document generation itself...
The first problem I ran into was the following console message:
2023-11-07T18:06:45.299 [Error] Executed 'Functions.ConvertDataToPDF' (Failed, Duration=5911ms)Could not get conversion result header. Data transfer error. Data transmission error 109
In my head, I knew the error had something to do with the object being returned from the PdfDocument class in the DLL, but I can only guess that the Function App was unable to read the first 16 bytes of that object (the PDF file header). It might be the case that no bytes were being returned, for whatever reason.
There is no documentation for this error, at the time of writing. I did find a couple of Stack Overflow posts suggesting it's something to do with the resource limitations of Function Apps on the standard Azure subscription. I still needed to rule out the possibilities of a bug in the version of SelectPdf I was using, and of something being wrong with my implementation of it.
The first thing I checked was that all the dependencies for SelectPdf were present in the Function App's filesystem, and they were referenced correctly in the code (we probably only need the main assembly reference):
#r "..\Select.Pdf.dll"
#r "..\Select.Tools.dep"
#r "..\Select.Html.dep"
When that didn't work, I downloaded a NuGet package for a recent optimised version and used 7Zip to extract the files before copying them over to the Function App's storage. I also tried the standard v23.1.0 assembly files. That didn't work either.
After several more hours' debugging, I pretty much ruled out everything on my end:
Adding a load of custom log messages to narrow down the problem.
Making sure I had the right version of SelectPdf for Azure deployments.
Adding the 'GlobalProperties.EnableRestrictedRenderingEngine=true' string to the Function App's configuration. I don't think that actually does anything.
Swapping the HTML string variable with a URL. This resulted in the Function App generating a corrupted PDF instead of the previous error.
Setting a rendering timeout window, using MinPageLoadTime and MaxPageLoadTime.
Removing all references to external images, CSS and JavaScript files in the HTML.
Replacing the original HTML file with a very basic one.
Replicating the Function App and running it locally, which produced a watermarked PDF document.
Option 1: Deploy a .NET console app to Azure
This is the more complex (and consequently less ideal) workaround, which involves moving the document generator part of the code from the Function App to an Azure service that hosts a .NET console application. I knew that SelectPdf works with the latter.
Conceptually it appears easy, because all the .NET console app would need to do is read a message from the Service Bus, extract the HTML markup string, convert that string using SelectPdf, then send the output to another Service Bus queue.
Unlike a Function App, a WebJob isn't started by an HTTP trigger. It needs to run continually and check a Service Bus queue for messages. Also, it's not executed once per message, unlike a Function App, so it needs to be coded to handle events asynchronously.
The other problems are that two additional Service Bus queues are required for the WebJob itself, the Logic App would need to selectively read messages for different document types, and a substantial part of the design would need changing.
Security
The Web Job should be able to use SAS and a connection string to access the Service Bus. The connection strings and parameters for this can be found in the Shared access policies view in Azure Portal. The SAS key method is less secure, so we want to use Managed Identity, if possible. The permissions in the Service Bus's Access control view might need to be modified for this.
Option 2: Change the Azure subsciption
Upgrading the Azure subscription for the Function App solved the first problem, but the Logic App kept writing a blank PDF document to the Storage Account. It took me a while to figure out what the problem was - It appears the Function App returned an object with the first 16 bytes and the data structure of a PDF document, but the encoding of the data being injected into that data structure was wrong.
To fix this, it's important to be using the correct DLLs and assembly references, as there are multiple versions of SelectPdf that do slightly different things:
#r "..\Select.HtmlToPdf.dll"
using System.IO;
using System.Text;
using System.Drawing;
using SelectPdf;
There are only three essential lines of code in the PDF generator method:
PdfDocument doc = new PdfDocument();
doc = converter.ConvertHtmlString(strHtml);
return doc;
In the main or calling method, the following is important for getting a Function App to return the output of SelectPdf as a usable object.
I recommend instantiating HtmlToPdf (not SelectPdf) as follows:
SelectPdf.HtmlToPdf converter = new SelectPdf.HtmlToPdf();
As with most other implementations, my code reads the input from the Logic App as a string and passes it to the document generator method:
string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
var outputDocument = GenerateExportDocument(pageMarkup.ToString(), converter, log);
But the Function App can't return the SelectPdf document object directly. It must be a data file, and it's possible to write this as a MemoryStream() object:
MemoryStream pdfStream = new MemoryStream();
outputDocument.Save(pdfStream);
pdfStream.Position = 0;
The compiler didn't like me reading back out from the memory stream and returning that, so I converted it into a byte array first:
byte[] fileBytes;
fileBytes = pdfStream.ToArray();
The byte array is returned to the Logic App:
return new OkObjectResult(new FileContentResult(fileBytes, "application/pdf"))
{
StatusCode = (int)HttpStatusCode.OK
};
The next couple of problems: The Function App returns JSON, with the file content within the fileContents object, and this means we must use a custom expression in the Logic App to extract that and perform another conversion:
Use this in the Logic App step:
"body": "@base64ToBinary(body('ConvertDataToPDF')?['fileContents'])",