Another method to perform faster convolutions when the filter is large, is to do it in the DFT domain. Taking DFT takes O(PlogP) where P is the total number of pixels.
For example, assuming the filter is NxN and the image is NxN, and we want to do the convolution with zero padded image. Since convolution in DFT is a circular convolution (its like repeating the image horizontally and vertically, and then performing the convolution), we need to zero pad the image and the filter to 2Nx2N. Now, the full operation takes ~ O(N^2 log N) (calculating DFT using FFT for image and filter) + O(N^2) (multiply DFT of image and filter) + O(N^2 log N) (get image back using FFT).
Depending on the implementation, this method may be slower on smaller filters. If the filter is not separable, then this may be a good method. Even if the filter is separable, this may save some computation if done for the 1D convolution as well.