Converting untrusted PDFs into trusted ones: The Qubes Way
Arguably one of the biggest challenges for desktop security is how to handle those overly complex PDFs, DOCs, and similar files, that are so often exchanged by people, or downloaded from the Web, and that often provide a way for the attacker to compromise the user's desktop system.
Today I would like to discuss a recent innovation we created for Qubes OS that allows to securely convert those pesky PDFs (as well as essentially any graphics files) into trusted PDFs. Here by a “trusted PDF” I mean a file that should be harmlessto the user's system, so, a non-malicious PDF.
Approaches to converting PDFs
The problem of converting a potentially untrusted PDF into a harmless one is certainly not a new one. Some tools have already be created for this task.
The typical approach is to parse the original PDF, look for “potentially dangerous things” there, and remove them. As simple as that! This is, of course, a typical AntiVirus approach to the problem. And, typically as it is for most AV approaches, it's completely useless against any more skilled and determined attacker (and these are the ones we fear the most, don't we?).
The typical approach is to parse the original PDF, look for “potentially dangerous things” there, and remove them. As simple as that! This is, of course, a typical AntiVirus approach to the problem. And, typically as it is for most AV approaches, it's completely useless against any more skilled and determined attacker (and these are the ones we fear the most, don't we?).
A somehow better approach is to parse the original PDF, disassemble it into pieces, and then reassemble them into a new PDF only using the “trusted” pieces – this, I think, could be called a whitelisting approach.
Anyway, the fundamental problem with the approaches mentioned above, is that all of them require parsingof the original PDF file. And parsing is where the “big bang” usually happens. Parsing is where our, normally pretty decent, code, comes in close, intimate contact with some unknown complex input data, which often leads to a successful abuse or exploitation.
Parsing PDFs safely
So, how to perform parsing safely? Of course, that's simple! Let's run the parser in an isolated container – in case of Qubes we already have an ideal such container: it's the Disposable VMs.
But, before we get too excited, let's think more about it – say we run the parser safely isolated in a Disposable VM, meaning it couldn't harm any of the rest of the system, except for the Disposable VM itself, which we however won't worry much about, because it is disposable... But then what?
We want our PDF back in our original VM, to actually use it, right? But we cannot just copy the result from the Disposable VM, because if it got compromised, as a result of parsing of the malicious PDF, then we would like get... a compromised converted PDF. So, this approach gives us nothing!
(Even though our “solution” incorporates all the obligatory buzzwords: “Disposable VMs” (“Micro Disposable VMs”?), “VMs isolated using hardware Intel vPRO technology” and, of course, the “hypervisor”! Sometime just the mere fact we use “hardware virtualization” buys us nothing... People seem to forget about this sometimes.)
So, the trick to make this approach meaningful is to introduce what I will call a “Simple Representation” of the input file. More on this, straightforward concept, below. The idea is that our parser (that runs in a Disposable VM) will be expectedto return the Simple Representation of the original PDF. Of course, it might very well go wild (as a result of exploitation by the PDF it parses), and don't obey our expectations, and instead return something totally different and potentially malicious. But that doesn't matter! The whole point of the Simple Representation is that it should be, well... simple to parse it safely and discard in case what we're getting doesn't look like the Simple Representation.
Ok, so what's the simplest possible representation of an arbitrary PDF file? Yes, it's the RGB format, which is essentially just a raw array of RGB values for each pixel. In fact, I'm not sure there could be anything simpler in the Known Universe to represent a PDF file...
Now this is all becoming simple: we would expect our parser to send us just two things: the dimensions (W x H) of the bitmap representation of each of the page of the PDF in question, and each of the PDF page itself converted into a raw RGB format. If the parser didn't obey, we would still interpret whatever stream of bytes we get as a RGB bitmap – in the worst case the PDF we create would look like un-tuned analog TV screen.
The diagram below summaries this idea:
Implementing this all on Qubes
Now I would like to show how easy it is to implement such PDF converter service using the Qubes advanced infrastructure that we call qrexec, and which is part of Qubes core for quite some time now.
First, let's choose the PDF and image conversion tools. The choice of PDF converter is not security critical, because it will run in an isolated Disposable VM. Here I decided to use pdftocairo converter, which is part of the poppler-utils package on Fedora. We will also use ImageMagick's “convert” command to convert the PNG files (produced by pdftocairo, one for each PDF page) to the raw RGB format. Incidentally ImageMagick supports RGB format natively. As mentioned above, in addition to sending the raw RGB file, we would also need to send the width and height of the pixmap – those can easily be obtained using ImageMagick's “identify” command. Again, all those programs discussed so far are not security critical – they might get exploited during the processing of the untrusted input PDF file, and we don't worry about that at all.
On the receiving side, however, we need to use a foolproof parser for the RGB format. Again, this is what we gain in this whole process – instead of requiring a foolproof-and-also-being-able-to-produce-non-malicious-PDFs parser, we only require a foolproof RGB parses, and that's quite a gain! The ImageMagick's convert comes to mind again here, and one might want to use it like this:
convert page.rgb page.pdf
Unfortunately this would be wrong, because the convert program would still try to detect the “real” format of the page.rgb file, and, if it looked more like, say, JPEG or PDF, it would parse it accordingly, compromising all our careful plan! What we really need is to tell our convert program to always treat the input as raw RGB file, instead of trying to be (too) smart and trying to guess the format by itself. This can be achieved by adding the “rgb:” prefix in front of the input argument, which provides explicit input format specification:
convert -size ${IMG_WIDTH}x${IMG_HEIGHT} -depth ${IMG_DEPTH} rgb:$RGB_FILE pdf:$PDF_FILE
Now also needed to add size and depth explicitly, because the raw RGB format doesn't convey such information (well, it has no header of any sort at all!). Of course we need to obtain the width and height from the parser, but we can validate such input rather easily. In addition we make sure that the received RGB file has exactly the size as indicated by width and height. With those precautions in place, there would have to be really a gappinghole in the ImageMagic's RGB parsing code for the attacker to exploit this. Perhaps instead of using the ImageMagick's convert I should have written a small script in python that would parse the received RGB file (and save it into a... RGB file, for later processing by ImageMagick), but I sincerely think this would be an overkill here.
Finally we can write the following two simple bash scripts, one for client: qpdf-convert-client, and the other one, qpdf-convert-server, for the server (which runs in a Disposable VM).
Additionally we also need to create a policy file in Dom0 in /etc/qubes_rpc/policy/to allow to use this service. The policy file content for this service should look like this:
... which is pretty self explanatory. When I do development I also add another line to the policy file like this:
... to allow me to run the server inside my 'devel-vm' VM, instead of running it in Disposable VM every time, which would be very inconvenient for development, as it would require me to update the Disposable VM template each time I wanted to test a new version of qpdf-convert-server.
$anyvm $dispvm allow
... which is pretty self explanatory. When I do development I also add another line to the policy file like this:
$anyvm devel-vm ask
... to allow me to run the server inside my 'devel-vm' VM, instead of running it in Disposable VM every time, which would be very inconvenient for development, as it would require me to update the Disposable VM template each time I wanted to test a new version of qpdf-convert-server.
The policy file should be placed in Dom0 in /etc/qubes_rpc/policy/qubes.PdfConvertfile – here the name of the file must be the same as the name of the service, as invoked via qrexec_client_vmcommand, discussed below.
And, one last thing, in the destination VM we must also create a file that will map the service name (so, the qubes.PdfConvert in our example) to the actual binary that should be called in the VM when the service is invoked. So, the file should be named: /etc/qubes_rpc/qubes.PdfConvert(again, this is now in a VM, not in Dom0, also note the lack of policy/ subdir), and it is another one-liner with the following content:
/usr/lib/qubes/qpdf-convert-server
The full source code of qpdf-converter can be seen and downloaded from this git repo.
We're ready now to test our qubes.PdfConvert service: in the requesting VM, i.e. the one from which we want to initiate the conversion process we do:
[user@work Downloads]$ /usr/lib/qubes/qrexec_client_vm '$dispvm' qubes.PdfConvert /usr/lib/qubes/qpdf-convert-client ITLquote.pdf-> Sending file to remote VM...-> Waiting for converted samples...-> Receving page 2 out of 2...-> Merging pages into a single PDF document...-> Converted PDF saved as: ITLquote.trusted.pdf-> Original file saved as .ITLquote.pdf
Again, for development process I would replace '$dispvm' with something like 'devel-vm'.
The qrexec_client_vmcommand, used above, is not actually intended to be used by user directly (that's why it's installed in /usr/lib/qubes instead of /usr/bin/), and so when one creates a Qubes qrexec service, it's customary to create also a small wrapper around qrexec, like this one, that makes using the service simple.
The presented converter saves the original file as .${original_pdf}making it a hidden file to help the user avoid accidental opening. The new, converted file gets .trusted.pdfsuffix appended to the base name of the original file. I discuss more issues regarding the human factor and avoiding accidental opening in one of the next paragraphs below. The converter can also be used to convert essentially any image file, such as JPEG, PNG, etc, into a PDF, using the same method.
As you can see creating client-server services in Qubes is very simple – in fact it took me just one afternoon to get the inital working version of the converter (with subsequent "polishing" over the next 2 days).
The qrexec infrastructure takes care about all the under-the-hood tasks, such as starting the necessary VMs, e.g. creating Disposable VM to handle the service request,establishing communication channels between VMs (which are ultimately implemented on top of Xen's shared memory), redirecting client and server's stdin and stdout to each other, so that writing services is very simple, even in shell, and, of course, obeying policies defined centrally in Dom0.
Most “inter-VM” features in Qubes, such as secure file copy between domains, opening files in Disposable VMs, time synchronization, appmenus synchronization, etc, are all implemented on top of qrexec. A notable exception is clipboard exchange, which is implemented as part of the GUI protocol, but still uses the same common qrexec code for policy processing (e.g. I use this policy to block clipboard and file exchanges between my work and personal domains).
Limitations, other Simple Representations
The obvious disadvantage of converting a PDF to an RGB representation is that one looses text search, as well as copy and edit capabilities (e.g. in case of PDF forms). So, converting Intel's IA32 Software Developer's Manual this way would certainly not be a good idea... But, hey, such large PDFs can always be opened in a Disposable VM – they would be fully functional then, only that you would need to wait a few seconds for the PDF window to pop up. Or, better yet, why not keep all such PDFs in a dedicated domain? E.g. I have a VM called “work-pub” where I keep tons of various, publicly available PDFs, such as the mentioned Intel's SDM, as well as various chipset docs, conferences papers and slides, and generally lots of stuff. The key point is that all in this VM is public material (and also all is related to my work), so that I don't really care if any of those PDFs compromises my work-pub domain. In the worst case, I will revert the VM from backup and download any missing PDFs again from the web. They are public after all.
But the PDF conversion described above comes extremely useful in case of all the various invoices, Purchase Orders, NDAs, contracts, and god-knows-what-else PDF documents, which I'm forced to deal with in my “work” domain (where my email client runs). Most of those are one pagers, or maximum a few pages long documents, so the fact that they got converted to a bitmap provides me with very little discomfort. At the same time I gain incredible freedom in opening all those documents natively in my work VM, without fearing that one of those invoices will comrpomise my work domain (which would be a rather sad thing for me, although the really sensitive stuff is still in some other domains ;)
An interesting question is, however, can we come up with another form of Simple Representation that would allow e.g. to preserve the text searching ability of the converted PDFs (and DOCs, PPTs)? Probably... yes. The choice of the Simple Representation should be thought of as of a trade-off between security and document's features preservation. I'm not an expert on PDF and DOC formats (and I'm not sure I want to be) but it seems plausible that one could disassemble PDF into simple pieces, select the really simple ones, send those pieces as a Simple Representation back to client, and have them reassembled back into a almost-fully-functional PDF. Here, again, the point is that the PDF parsing is done in isolated Disposable VM, while the reassembly in the trusted VM. Anyway, let me leave it as a exercise for the reader :)
Preventing user mistakes
Being able to right-click on a PDF file and have it converted into a trusted PDF is one thing. Having this mechanism adopted by users and actually making their daily computing safer, is another story.
Users will likely have hundreds of PDF spread over their home directories, and the real challenge is how to make sure that the user never accidentally opens the unconverted, untrusted PDF. We can think of several approaches to this problem:
- We modify the Thunderbird, Firefox, etc, e.g. by providing specific plugins, to always perform PDF conversion on each file that we got via email or downloaded from the Web. Additionally we convert all the already present PDFs in the user's home directory (file system?). And, additionally, we modify Qubes file copy operation to also always do automatic PDF conversion whenever one transfers files from other domains (if Qubes qrexec policy allows for such transfer in the first place, of course).
This approach would not be optimal, because some PDFs, as we discussed above, might not be well suited for conversion-through-bitmap process – they might be large PDFs where text search is crucial, some conference papers for review, where text copy is crucial, or some editable forms. That's why it seems better to take a slightly different approach:
- We modify mime handlers for PDF files (as well as any other files that our converter supports) and then upon every opening of the file (e.g. via mouse click in a file manager) our program gets to run and its job is to determine whether the file should be opened natively, converted to a trusted PDF, or perhaps opened in a Disposable VM. Of course, upon “first” opening we should probably ask the user about the decision, if this cannot be determined automatically. E.g. if we can reliably determine the file is already converted, we can safely open it without prompting the user, but if it's not, we should ask – perhaps the user would like to open it in a Disposable VM instead of converting, or perhaps the file should be considered trusted anyway, because it was created by the user herself.
This second approach seems like a way to go, and we will likely implement it sooner or later (probably sooner, but after the upcoming R2 Beta2). It should also be noted, that typically user would need such mechanism only in some domains – e.g. I really feel the need for such protection in my “work” domain, but not in any other. But that, of course, depends on how one partitions their digital life into security domains.
One important detail worth mentioning here, is that we should unconditionally disable “Thumbnail View” in whatever file manager we use (which itself is really a stupid feature – can people not read filenames anymore or something?).
The mechanism introduced today, in addition to the Disposable VMs mechanism introduced earlier, represents a trend in Qubes development of “stepping down” into AppVMs in order to also make the VMs themselves somehow more secure (in addition to the isolation between the VMs).
Originally Qubes aimed at containers isolation only. This included protecting the system TCB where techniques such as deprivileged networking stacks (and optionally also deprivileged USB stacks) have been deployed, as well as custom GUI virtualization, and generally somehow “hardened” Xen configuration. This also included protecting the VMs from each other, where techniques such as secure clipboard, secure file copy and generally secure qrexec infrastructure have been introduced, as well as trusted GUI subsystem with explicit domain decorations.
But now, Qubes is stepping down into the AppVMs in order to make the VMs themselves also less prone to compromise. We surely will be working on more such mechanisms in the future. We still are only at the beginning of the quest to create a Reasonably Secure Desktop OS!
PS. The presenetd converter will be part of the Qubes R2 Beta 2, that is expected to be released... in the comming days. Experienced users of Qubes R1 and R2 Beta 1 can install the converter immediately by building the rpms from the git repo.
PS. WTF is happening with the Blogger web interface? Seriously, I don't remember being so frustrated using any software in the recent years that I am right now, when editing this post (as well as the last several ones). It sometimes honours the line breaks, sometimes do not, sometimes inserts a couple of new lines, sometime removes them, sometime mysterious spaces appear at the end of lines, sometime those cannot be removed... It doesn't allow to paste pre-formatted code-listing (at least I couldn't figure out how to make it honour tabs). And yes, I'm using the "Compose mode", because when I try to switch to the HTML mode, not only I'm overwhelmed with tons of HTML markups, nobody knows what for, but also when I switch back to the Compose mode, my article tends to get even more fucked up! Really, a shame. I wish I could go away to some other blogging service, but I'm afraid that converting all my posts would be even a bigger PITA... Sigh.