I'm making a large effort to be as completely transparent about the inner workings as I possibly can, with the view that the progression can be tracked over time as a kind of scaffolding to how I managed to reach my conclusions. Without the fundamental backstory, independent viewers may get the wrong impression - like how folk who think the Gaza pyramids were built by aliens, fail to consider the (inferior) earlier pyramids scattered around which demonstrate nicely how the builders learnt quite a lot from their early attempts.
Comparison with the Pyramids is of course hyperbolic. The system is actually reasonably straight forward from an information technology perspective. The bar list documents, which I use as the source for all the data, are readily available from the internet and most can be downloaded publicly. For example, at time of writing, the SLV PDF is available here, and the URL does not really change.
The variety of the file formats present the first challenge for the software. For many years, the SLV PDF file was encrypted, which means that it was not easily exportable - I've got just one of those early files here. There is software which can decrypt this file and allow export, but my budget hasn't allowed the purchase of that software just yet and for just one file it's not worth my while just yet.
Some of the files are behind a password, like BullionVault and this makes the software a bit more complicated because I need to have an account with them and emulate the login process as additional steps prior to the download. This can be a pain in the butt, especially when the URL's change.
Some source files (like Julius Baer) are only available by email, and like some of the others, these are zipped using a proprietary zip format which means to automate it I need to find the library suited to unzip the file first before it gets processed.
Anyway, the database keeps a listing of the download locations for each fund, that we have discovered. I have some custom-built software which will pull a copy of the file to the computer, and places it into a folder called '_Rename_Queue'. This is a picture of the file system:
From here the file progresses from stage to stage, but the first stage is the most important. It is handy for the system to normalize the file name, so it will construct a new name to get something that looks like this:
SLV.20120907.071740.JasperReports.(BullionWeightListReport).pdf
The pattern is very basic, you can see it is constructed of {CODE}.{Date}.{Time}.{Name}.pdf, where the date and time is the published date of the document, taken from the properties of the PDF file. This is sort of important - we download SLV's file every day to see if it has changed. If it hasn't then the system detects it as a duplicate and puts it in the 'duplicates' folder during the rename process. Sometimes the size of the file (in bytes) and the date time stamp of the file will change, which means it's been modified - but sometimes the total ounces and total bar count will be the same. In our system we still log this as a unique document since it still represents a change. A chart showing the update frequency of SLV can be seen back on the SLV Database 3 article.
If the file is new then it moves to the '_Extraction_Queue', where another process will take the PDF file and convert it to a text file automatically, in preparation for the next stages. The extraction queue places the TEXT file in the _Database_Queue, and moves the original PDF to the '_Cloud_Queue' where yet another thread takes the file and automatically archives it to the cloud storage.
The _Database_Queue is the biggest bottleneck, currently there are quite a few files in there which need processing. For each of these I need to write a 'filter' which is like a translation routine which will recognise the columns and data and be able to read the file. This is one of the hardest parts of the software, due to the massive variety of formats across all the documents, however efficiencies of scale have started to play a part - the very first SLV filter took about two weeks to write, the second filter took 1 week, and these days I can write a new file filter in about an hour. These filters often need maintenance because the source documents are often messy. The worst is when the software which wrote the original report spills a line over onto two lines, causing an 'overflow'. It is difficult because it's sometimes tricky to tell whether the next line is a serial number or a serial number plus refiner code, because sometimes the refiner codes are short and the serial numbers do sometimes have letters in them too - basically you have to tell the code to consider the conditions under which the next line could be part of the first - in the example below, the highlighted yellow part is part of the serial number.
Fixing issues like this stuff takes up about 20-30% of the time taken to build all this. It's difficult because the source document processing is important - any errors here can have an effect later on down the chain.
The next biggest time soak is the new brand listings - every fund has their own method for indicating brand, and we've got a list of over 600 different variants of refiner name, with about 100 refiner records. When a new brand shows up, I have to figure out which refiner it belongs to, and ensure that it's not just a guess - but also back it up from research - normally using my book. Here's what all the different entries which all match to the same refiner, 'Henan Yuguang Gold and Lead Co Ltd':
And it's from there that the entries get processed into a big table - either WAREHOUSE_GOLD or WAREHOUSE_SILVER, depending on the metal. By the time it reaches these tables, the data is all highly normalized and multiple funds data can be processed using the same routines.
Anyway I think that's enough for today. Remember, this page is here to demonstrate the evolution of the software and increase overall transparency - I want to demonstrate that the processes sitting behind this is not just ad-hoc, it's a proper data warehousing operation with a specific flow that picks up many different files as well as Platinum and Palladium records.
< next snapshot: storage and processing details >
No comments:
Post a Comment