There are many solutions that will allow a whole range of file formats that can be imported. Document restrictions for invoice automation are not new, however, solutions are becoming more flexible in allowing additional formats. but this may not be the correct approach though for several reasons.
Firstly, there is an impact on the ability to extract the data or OCR the document to read the characters. Secondly, you must consider legal admissibility in terms of whether you should be receiving documents that can be edited, as there are technical implications for these, but the primary concern should be admissibility.
With more suppliers now able to send PDF invoices directly to your mailbox, there is an increase of PDFs being processed now having an embedded text layer. Most OCR solutions will utilise this text layer for indexing purposes. However, don’t confuse OCR with the ability to auto-index and understand the document context, layout and the information that is being validated. With an embedded text layer all the characters are known and it then becomes easier for the indexing process to logically use this information.
Common image file formats such as PDF image TIFF files are usually the result of a document being scanned. Whenever a document is scanned image quality is critical for an OCR process, it benefits from being done with a configuration of black and white with 300dpi on your scanner. Anything above this is still usable, however, the file size can grow quickly and is unnecessary. Anything below this will be of too poor quality to ensure the character recognition has a high success rate.
Word and Excel both have text layers, however, they can cause technical issues. For example, if the file is created with a US date format but when you open these on a server that is configured for UK date, when the file is processed the date value can be flipped.
Critically, the main implication of these file formats is their editability for an admissibility purpose and governance. For PNG and JPG image file formats it solely comes down to the quality of the image they provide. They are normally far smaller in size and the dots per inch and are typically lower than Tiff. Additionally, they are more commonly used in email signatures and will be imported to your invoice processing solution as a result of allowing that file type. This will lead to a higher number of documents needing to be manually reviewed and rejected.
The Extensible Markup Language (XML) is a structured data format, it doesn’t require ‘OCR’ as all the characters are known and is typically used as part of an EDI process.
When undertaking Digital Transformation, one of the fundamental requirements of compliance and audit is that the document captured must not be edited. Additionally, you must be able to prove it has not been tampered with. With MS Word and Excel documents, this is not possible.
From a tax perspective and reporting tax liabilities, the danger associated with sending an editable document is that information on the invoice would no longer reflect what the supplier is submitting to the tax authority. The document could be edited with good intentions but when it comes to managing invoice documents they should be final.
There is also a concern that someone could edit the document internally. For example, if someone noticed an invoice as a Word document, they could amend the supplier details with their own bank account information. You should have certainty that the document you are receiving is the document you intend to pay.