To prioritize quality assurance efforts, various fault prediction models have been proposed. However, the best performing fault prediction model is unknown due to three major drawbacks: (1) comparison of few fault prediction models considering small number of data sets, (2) use of evaluation measures that ignore testing efforts and (3) use of n-fold cross-validation instead of the more practical cross-release validation. To address these concerns, we conducted cross-release evaluation of 11 fault density prediction models using data sets collected from 2 releases of 25 open source software projects with an effort-Aware performance measure known as Norm(Popt). Our result shows that, whilst M5 and K∗ had the best performances, they were greatly influenced by the percentage of faulty modules present and size of data set. Using Norm(Popt) produced an overall average performance of more than 50% across all the selected models clearly indicating the importance of considering testing efforts in building fault-prone prediction models.