The police tests don't apply directly, because according to the wording, the witness, given any mix of cabs, would get the right answer 80% of the time. Thus given a mix of 85% green and 15% blue cabs, he will say 20% of the green cabs and 80% of the blue cabs are blue. That's 20% of 85% plus 80% of 15%, or 17%+12% = 29% of all the cabs that the witness will say are blue. Of those, only 12/29 are actually blue. Thus P(cab is blue | witness claims blue) = 12/29. That's just a little over 40%.
Think of it this way... suppose you had a robot watching parts on a conveyor belt to spot defective parts, and suppose the robot made a correct determination only 50% of the time (I know, you should probably get rid of the robot...). If one out of a billion parts are defective, then to a very good approximation you'd expect half your parts to be rejected by the robot. That's 500 million per billion. But you wouldn't expect more than one of those to be genuinely defective. So given the mix of parts, a lot more than 50% of the REJECTED parts will be rejected by mistake (even though 50% of ALL the parts are correctly identified, and in particular, 50% of the defective parts are rejected).
When the biases get so enormous, things starts getting quite a bit more in line with intuition.