Garreth Hanley:
This is Excel Tips, a podcast for accounting and finance professionals brought to you by CPA Australia.
Neale Blackwood CPA:
Welcome to the Excel Tips podcast. My name is Neale Blackwood. Would you like to extract data from PDF files, things like statements, delivery dockets, invoices? PDF files hold a lot of information, but in the past, they've been difficult to get into, if you like, and hard to extract the data from.
There are third-party products out there that can bring data into Excel, but they tend to require a manual intervention to get that to happen. You can use the built-in Power Query feature in Excel to import PDFs and to automate the process. This could be really useful for things like statement reconciliations. Maybe you need to capture the list of products that are on a particular invoice and manipulate them in some way, and being able to automate the process of extracting the information from the PDF can be really useful.
Now one word of warning here, the PDFs that I'm talking about here are the system-generated ones. Now you can use a photocopier and create a PDF of a document. That doesn't tend to work with this situation. So you do need system-generated PDFs, which is typically what you'll get via email from suppliers, et cetera.
Now a word of warning with PDFs as well. As a data source, they're less reliable than, let's say, CSV files. In the data world, CSV files are the simplest files and they're the easiest to import to Power Query. PDFs, on the other hand, are a lot more difficult, mainly because there is a lot more flexibility in how a PDF is laid out. So the structure of the PDF usually determines how easy it is to import the data from the PDF. So this is really useful for those PDFs that might have high-volume transactions. So you might have hundreds of rows of data on a PDF, maybe as a statement or as an invoice.
So let's take a statement as an example. You might want to reconcile a statement against your system accounts. You could import all of the entries from the statement via Power Query into Excel, and then, at the same time, maybe bring in the transactions from your own system into Excel and then compare them. In some cases, you can partially automate the reconciliation process.
You might be able to use the old 80-20 rule, and hopefully you can reconcile 80% of the transactions. Then you might have another 20% of the transactions that you might require some human intervention.
Okay, so let's go through some of the steps that you need to take when you're importing a PDF. All of the Power Query features are accessed via the Data ribbon. They're all on the left-hand side of the Data ribbon. Basically you click the Get Data dropdown. There is a From File dropdown and there is also a From PDF dropdown. So basically you choose that, then you navigate to wherever the PDF file is.
Now in the example video and in the article that I wrote, I've actually used an ATO PDF. So this is a publicly available PDF. It's also a little bit complex because it's got four separate tables on each page, and there's 11 pages with those four tables on them. Once you get the PDF into Power Query, the first thing you've got to think about, it's called the Kind column, and it's either going to have a table or a page in that column.
Now, in general, PDFs are split up between tables and pages, but they are actually the same thing. If a table can be recognised in the PDF, then Power Query will recognise it. It also lists the pages separately.
So it's actually a duplication. So you'll see the tables listed first, and then you'll see the pages listed after. Now they are the same thing.
So generally you start off trying to import the tables, because the tables should be laid out a little bit better than the pages. Sometimes, depending on the PDF, it might be easier to import the pages. It just depends on the PDF.
The companion video only looks at the table importation for this ATO PDF, which is fairly well laid out. So there is only one glitch in the companion video that I'll explain.
So be careful you don't use tables and pages because that does duplicate the data. So it's one or the other. So, as I mentioned, typically start with a table. If that works, great. If it doesn't, then switch to try with pages.
That's something that you've got to get used to with PDFs, so you might need to do a bit of trial and error to get the right solution. So PDFs, as I mentioned, are a little bit more complex than CSV files. And so, there is this trial and error required.
The ATO document had four tables across the page and 11 pages. Power Query is going to combine all of those tables into a single table that's going to go all the way down the page. It's going to have over 2,900 rows in the final table. So it's a table of monthly salaries and how much tax you should deduct from that using the tax-free threshold. Then the other column is the non-tax-free threshold. So they're the three columns in the table that we want to import.
So one thing to pick up in this structure is that the headings on each table are repeated throughout the data. And so, in the companion video, I show you how you can remove those headers just using a simple filter in the search box. On the filter, you can just type a letter and that will then limit all of the entries in the filter. In our case, the heading started with M. So I just typed an M and that enabled me to remove all of those extra header rows from the table.
That's really all I had to do. After that, I changed the data type to be a currency, and then that was it. So I downloaded the data, and then I saw that there were some errors. That's because the last page of the PDF had a different structure to the first 11 pages, and that's not unusual for PDFs. But in our case, we needed to automate the process of removing those excess rows, the ones we don't need, because those tables were related to something else. They weren't related to what we were importing.
So when errors are picked up in a Power Query, it'll give you a little hyperlink warning on the right-hand side in the queries and connections task pane. You can click on that hyperlink and that will open a query which actually shows you all of the errors.
In this case, we could see that all of the row numbers for the errors were at the end of the table, which was where page 12 is going to be. And so, basically we just went back into the query and we needed to get to the end of the query.
Now, unfortunately, Power Query is really slow for scrolling. So to scroll down 2,900 rows would take a while. But there is a feature in the Transform tab in Power Query. It's called Reverse Rows. And so, that's a really quick and easy way to get the bottom rows up the top of the screen. And so, that's what I used.
When we looked at the data there, it was pretty obvious that the table that we'd brought in right at the end was a different type of table, and there was a rule that we could use to eliminate those rows.
Also, there were some errors in those rows. There were seven of them, because that's how many were listed. They related to the fact that in that column, when we said that it was a data type for a currency, there must have been some text or something in that column. And so, all of those text items would've generated an error.
You can right-click a column and remove errors. That's pretty simple. But to get rid of those extra rows that we didn't need, column three turned out to have a higher value than column one.
Now that wasn't the case for our tax table, because column one was the salary for the month. So we could create an extra column that shows false on the rows that we wanted to get rid of and true on the rows that we want to keep. So once you put that column in, you just filter out the falses, so all of the rows we don't need are then gone, and then it was pretty much ready to go.
So you can delete that column because that was like a temporary column. Then once that's done, then you just use that reverse rows feature again. So we put the table back in the correct sequence, and that was it. So that pretty much completed the importation of that PDF, which, as I mentioned, ended up being over 2,900 rows of data.
Now something to think about when you are importing PDFs, when you import the file name, that is captured in the query. And so, if that file name changes, you're going to need to adapt the query. It's on the source steps. So the very first step in the query is called source, and that's the one that captures the whole path and the file name.
If you are importing something like a statement, you might need to use a generic name like statement and just save the file, save the PDF, as statement to the same folder. That way when you refresh the query, you'll actually bring in the latest statement.
What we just did with the ATO file was a one-off. So we only wanted to bring the data in once because the ATO file doesn't get updated very often. It only gets updated when the tax rates change.
So if you're doing a PDF that you're going to do every month, the idea is just to use a generic name for that PDF file, and each month just save on top of the one you had. Then when you click refresh in the table, it'll bring in the latest data for that particular statement.
Just a heads up, you'll probably need to do a separate Power Query for each statement. But you might get lucky because coming out of different systems, the PDFs do tend to be the same. So depending on how many suppliers you might work with for statements, for example, then you might get lucky and some of the statements might be in the same structure. So you can reuse a Power Query.
So as I mentioned, PDFs, they hold a lot of data, but in the past, it's been difficult to get to that data. With Power Query in Excel now, you can automate some tasks in terms of importing the PDF data, and possibly even creating some automation in terms of your reconciliations and maybe some of your other processes. Hope you found that useful. Thanks for listening.
Garreth Hanley:
Thanks for listening to Excel Tips, a CPA Australia podcast. If you've enjoyed this episode, help others discover the show by leaving us a review or sharing this episode with colleagues, clients, or anyone else looking for the best Excel tips and tricks. To find out more about our other podcasts and CPA Australia, check out the show notes for this episode. And we hope you can join us next time for another episode of Excel Tips.