Big data Chain Ladder analysis in Python
From now there will be fundamental changes to this blog. Active Analytics Ltd is now not just an R-based company, we embrace any open source statistical/software development tools that we feel suits our aims. This blog will now reflect that, there will be much less emphasis on R and more on big data, and more focus on other programming languages and tools. I hope that this will be more stimulating and useful for those that read this blog, and that it will help readers generate alternative interesting ideas.
This blog entry follows on from the previous one on Big data Chain Ladder analysis in R where I used the rhdf5 and ChainLadder packages, and some quickly implemented code to carry out chain ladder analysis. This blog entry does the same thing in Python.
I first start by comparing R and Python for statistical analysis in the R vs Python section, feel free to skip over this section - it is quite long.
-The R Side
For many months now I have been itching for an excuse to do a blog posting on data analysis in Python. The brilliantly written Python h5py module for HDF5 file I/O is a good excuse for me to start doing this.
Before I begin I would like to say that nothing I say here is a 'dis' on R. I love R, it is the program I started coding in, but there are many tools and programming languages out there, and a good coder and developer will select the right tool for the right job - if given the choice.
The important feature of R is its philosophy which can be read in my introduction to R course notes that are freely available here. This philosophy is to 'turn ideas into software quickly and faithfully', and R does this in data analysis. For trying out an idea and getting something to work quickly in statistical data analysis the S language has had a big advantage over Python but this has changed!
These features have been baked into the R language from the very beginning:
- The data frame (either data.frame/data.table). The pandas package is available in Python and it is brilliant, a lot of the features available in R's data.frame/data.table objects have been ported into this package, it is very feature rich and well worth checking out. I would say that the main difference in the way pandas works is that the functions applied to the DataFrame object are explicitly method based, so if you are coming from an R background it may take a little while to get used to this but it is well worth the time investment!
- The formula class and model.matrix. Formula objects are extremely important in creating models. Lets face it, data comes in tables, and has complex types. The formula class allows data to be reshaped and prepared for models using functions such as model.matrix(). Formula also plays important roles in plotting and many other features and quite frankly deserves its own blog entry. The Python analogue is in the patsy package.
- The factor class. This is very interesting both as a form of compression, and as a data class. Categorical data is very important in many modelling applications and the integration between the
factor class, formula and model matrices is seamless in R both for regression and hypothesis testing. It is also used in summary statistics and split-apply type jobs using
reshape2packages and many many more. The pandas package has categorical data coding and this is seamlessly interpreted by the statsmodel package.
- The apply functions,
reshape2. While some of these functions are relatively easy to implement, they are extremely useful. Some of this functionality is already implemented in one way or another in the pandas package.
A lot of very useful stuff has been built on these 'killer' features and this is why R programmers put up with so many less desirable challenges of coding in R - because the language is so darn useful for data analysis. However Python is already catching up and is already claiming ground from R as the scripting language of choice in statistical analysis and data mining applications.
-The Python Side
I think of Python as being R's big brother, and when young siblings fight big brothers often win. Regardless of what happens to R, I think that its philosophy will carry on. In terms of philosophy and language, there is some overlap in R and Python. They are both scripting languages and Python is also designed to allow programmers to implement things quickly, but while R comes from statistics Python comes from physics and engineering. In Python for example the numpy module, you will find much more emphasis on multidimensional arrays (maybe for those 4th rank stiffness tensors physicists play with). But because Python is built by physicists and engineers, the quality is better, it is faster, it is more reliable, the procedures are more consistent (few nasty surprises), the documentation is impressive in its thoroughness (it make you feel safe, warm and cuddly inside), the modules are also more mature, there are many more features for cross platform (in terms of OS, and connections to other programming languages) than R. It is a serious full featured, grown-up, kitchen sink included scripting programming language and it is easy to use, and when I say easy, I mean EASY! If you want to prototype or for serious development, you can do it quickly in Python. In fact, I don't believe that there is a better language that embraces the philosophy of the S language than Python. More and more features in R are being implemented in Python e.g. pandas, statsmodels, scikits, scipy, and numpy.
I will be using three Python modules today, h5py for HDF5 file I/O, numpy for numerical data crunching, and the time package for timing processes. I am using Python 2.7.5+ on Ubuntu 13.10. I am using the same computer as the equivalent analysis in R with no multi-threading/parallel processing tricks.
We now load the packages
import h5py as h5py import numpy as np import time as time
In the spirit of porting R-features into Python (or other languages), I hope to slowing develop a theme in my blogs called littleR. These will be exploring how to implement 'killer' R features in other languages (probably just Python for now). But really if you are an R programmer seriously considering using Python you should really check out the pandas and statsmodels packages they are awesome.
In R, there is a function called
jitter(), it adds a little noise to each point that you submit to the function. Below is my version of jitter, which I call
because I like beginning function names with capitals (I also camel-case). As you can see, it is a very simple function, three lines long, each number is taken to be normally distributed around itself and the standard
deviation is assumed to be
sigmalScale times smaller (which defaults to 100). Could it be simpler?
def Jitter(array, sigmaScale = 100): array = np.random.normal(loc = array, scale = array/sigmaScale) return array
Next is the
Lapply() function which stands in for the
lapply()/sapply() function in R. It will return a list objects if given a list or a numpy array if given
a numpy object.
# Function to find if item is numpy-type def isnumpy(item): return 'numpy' in str(type(item)) def Lapply(_list, fun, **kwargs): iSize = _list.__len__() out = list() for index in range(0, iSize): out.append(fun(_list[index], **kwargs)) if isnumpy(_list): out = np.array(out) return out
Next is the
LowerTri() function with acts a bit like
lower.tri() in R but returns indices rather than logical.
# Okay I don't like typing, so I will declare LowerTri def LowerTri(mMatrix): return np.tril_indices_from(mMatrix, -1)
Again this is pretty straight forward. These functions will be used extensively in the Chain Ladder demonstration below.
Chain Ladder analysis in Python
Okay lets get down to business. In this section we will implement in Python the analysis done in the previous posting in R. Consider you have a lot of claims triangles that will not fit into memory. I claim that HDF5 (I like to use H5 to mean the same thing) is a very good option for fast I/O and data analysis. The procedure is to simulate the data and write it to H5 format, and then read back that data, analyse it using the Chain Ladder method, and then write the results to another H5 file.
Again I will say that the h5py package is quite simply awesome, if you work with big data, you should take a look at it, the documentation is amazing. In this blog I make no attempt to optimize performance at all, there is no multi-threading or parallel processing, though these are straightforward to include since the API is so darn good. I could also have written the data set into a three-dimensional array instead of separate data items but this is simply a quick hack to get an idea of the comparison with R.
Simulating the data
First I create the Age-to-Age factors:
# Age to Age factors ageFact = ageFact = np.linspace(1.9, 1, 10) ageFact = ageFact.cumprod()
As before we next create a function to generate a triangle row (equivalent to a development year).
# This is a function to create a row of the triangle def GenerateTriangleRow(iDev, dFactors = ageFact, dInflationRate = 1.02, initClaim = 154): dFactors = np.concatenate((, dFactors)) dClaims = Jitter(initClaim)*Jitter(dFactors)*(dInflationRate**iDev) return dClaims
Now we create a function to simulate the full triangle
def GenerateTriangle(iSize, **kwargs): indices = np.arange(1, iSize + 1) mClaimsMatrix = Lapply(indices, GenerateTriangleRow, **kwargs) # Reverse the columns mClaimsMatrix = mClaimsMatrix[:, ::-1] # Assign nan to lower triangle mClaimsMatrix[LowerTri(mClaimsMatrix)] = nan # Reverse columns to get the claims triangle mClaimsMatrix = mClaimsMatrix[:, ::-1] return mClaimsMatrix
Next we create a function to write the triangle to H5 format. See how easy it is to use the h5py API?
def WriteToH5File(sName, h5File): h5File[sName] = GenerateTriangle(11) return None
Next we create the H5 file and the matrix names
# H5 file sClaimsH5 = "data/ClaimsTri.h5" claimsH5File = h5py.File(sClaimsH5) # Matrix names sMatrixNames = Lapply(range(2000), str)
Finally we execute the process to simulate the data and write to file 2000 times as in the R case
# Executing the process to simulate data & write to file np.random.seed(0) xA = time.time() null = Lapply(sMatrixNames, WriteToH5File, h5File = claimsH5File) claimsH5File.flush() xB = time.time() print 'The elapsed time is ' + str(xB - xA)
This is the time taken to run # The elapsed time is 1.108 s
Say what? Yes more than an order of magnitude faster than the R case with no tweaking at all. Be careful however, all I may have proved is that h5py is faster than rhdf5.
Chain Ladder analysis
To my knowledge there is no python Chain Ladder package sitting out there, so lets quickly hack our own function.
First we create function to extract the age-to-age factor for a particular column pair
def GetFactor(index, mTri): colSum = mTri[:np.negative(index + 1), index:(index + 2)].sum(0) return colSum/colSum
We then use this function to apply Chain Ladder to the triangle and return the completed claims development 'square' - because it is a matrix.
def GetChainSquare(mClaimTri): iSize = mClaimTri.shape dFactors = Lapply(np.arange(iSize - 1), GetFactor, mTri = mClaimTri) dAntiDiag = mClaimTri[:,::-1].diagonal()[1:] for index in range(dAntiDiag.size): mClaimTri[index + 1,np.negative(index + 1):] = dAntiDiag[index]*dFactors[np.negative(index + 1):].cumprod() return mClaimTri
Next we create a function to write this completed square to file
def WriteSquareToFile(sName, destH5, sourceH5): destH5[sName] = GetChainSquare(sourceH5[sName][...]) return None
Then we create the H5 file to store the processed files (there is no reason why we cannot simply store this in the same file as the source data)
# Creating the H5 files for the processed triangles sProcessedH5 = "data/ClaimsSquare.h5" processedH5 = h5py.File(sProcessedH5) # Executing the process to do chainladder and write to file xA = time.time() null = Lapply(sMatrixNames, WriteSquareToFile, sourceH5 = claimsH5File, destH5 = processedH5) processedH5.flush() xB = time.time() print 'The elapsed time is ' + str(xB - xA)
This is the time taken to run # The elapsed time is 1.224 s
No you haven't gone blind, its just another scintillating performance from the h5py package
I enjoyed writing the code for this blog - everything I attempted worked mostly on the first try. Python is robust no doubt. All programming languages have their disadvantages and frustrations, but R programmers put up with a lot for the immediacy and convenience of their language. Python is a richer more stable and better performing language, and it embraces what is best about S, to create ideas quickly and faithfully, it is an easy language to code with many similarities to R. I think the performance numbers speak for themselves but as I have said, they may be benchmarking h5py against rhdf5 rather than R and Python.
Stay tuned to this blog and expect to see a lot more Python related activities here.
Data Science Consulting & Software Training
Dr. Chibisi Chima-Okereke, R Training, Statistics and Data Analysis.