Extracting Data From PDF Annual Reports

Here's a tutorial on how you can use python to scrape annual report data from publicly published PDFs. I'm using DBS's 2019 Annual Report as reference.

It is important to note that it is not a straightforward process to apply to every annual report, since each company may present these report differently, on different pages, in landscape or in portrait format. I would love to have a cookie-cutter method and simply plug each PDF into the code and walk off for a coffee, coming back to all the data nicely extracted, but the irregularity impedes the notion.

In the code I mainly use pandas to manipulate the table data and camelot-py to extract the tables from the PDFs. Camelot requires a little bit of set-up prior to use, so do read up on the documentation on installation in case you don't have it installed yet.

In our scenario, since the company chose to present the 2 reports on the same page in the landscape format, Camelot picks out one combined table which needed splitting.

The next little roadbump that popped out with the balance sheet, tables extracted did not come with the report headers. This was solved by simply recreating the headers and joining that to the dataframe.

Following this, the remaining cashflow statement table provided no real difficulty. The 2 tables on the page were one report, and merely needed to be joined. After that, all reports were extracted and put into dataframes, we just need to save it to excel.

Page updated

Report abuse