Regression-based predictive modelling of software size of fintech projects using technical specifications
DOI:
https://doi.org/10.22581/muet1982.3289Keywords:
K-fold cross validation, Lines of code , Multiple linear regression, Size prediction model, Software size prediction, Technical specificationsAbstract
This research aims to develop a predictive model to estimate the lines of code (LOC) of software projects using technical requirements specifications. It addresses the recurring issue of inaccurate effort and cost estimation in software development that often results in budget overruns and delays. This study includes a detailed analysis of a dataset comprising past real-life software projects. It focuses on extracting relevant predictors from projects' requirements written in technical and easily comprehensible natural language. To assess feasibility, a pilot study is conducted at the beginning. Then, Simple Linear Regression (SLR) is employed to determine the relative predictive strength of eight potential predictors identified earlier. The number of API calls is found to be the strongest independent predictor (R2 = 0.670) of LOC. The subsequent phase entails constructing a software size prediction model using Forward Stepwise Multiple Linear Regression (FSMLR). The adjusted R2 value of the final model indicates that two factors – the number of API calls and the number of GUI fields – account for more than 80% of the variation in code size (measured using LOC). Model validation is performed using k-fold cross-validation. Validation results are also promising. The average MMRE of all folds is 0.203 indicating that, on average, the model's predictions are off by approximately 20% relative to the actual values. The average PRED (25) is 0.708 implying that nearly 71% of predicted size values are within 25% of the actual size values. This model can help project managers in making better decisions regarding project management, budgeting, and scheduling.
Downloads
References
N. Nan and D. E. Harter, “Impact of Budget and Schedule Pressure on Software Development Cycle Time and Effort”, IEEE Transactions on Software Engineering, vol. 35, no. 5, pp. 624-637, Sept.-Oct. 2009, doi: 10.1109/TSE.2009.18. DOI: https://doi.org/10.1109/TSE.2009.18
B. Curtis, H. Krasner, and N. Iscoe, “A field study of the software design process for large systems”, Communications of the ACM, vol. 31, no. 11, pp. 1268–1287, Nov. 1988, doi: https://doi.org/10.1145/50087.50089. DOI: https://doi.org/10.1145/50087.50089
J. T. Dhas, “Importance of Software Sizing in Software Project Management: A Study”, Italian Journal of Pure and Applied Mathematics, vol. 118, pp. 269–273, Mar. 2020
B. Boehm, “Cost estimation with COCOMO II”, ResearchGate, Nov. 14, 2002. https://www.researchgate.net/publication/228600814_Cost_estimation_with_COCOMO_II (accessed Mar. 08, 2025).
D. Garmus, D.P. Herron “Function Point Analysis: Measurement Practices for Successful Software Projects”, Addison-Wesley Information Technology Series, 2001.
Y. Zheng, B. Wang, Y. Zheng, and L. Shi, “Estimation of software projects effort based on function point”, 2009 4th International Conference on Computer Science & Education, Jul. 2009, doi: https://doi.org/10.1109/iccse.2009.5228317 DOI: https://doi.org/10.1109/ICCSE.2009.5228317
E. N. Regolin, G. A. de Souza, A. R. T. Pozo, and S. R. Vergilio, “Exploring machine learning techniques for software size estimation”, 23rd International Conference of the Chilean Computer Science Society, 2003. SCCC 2003. Proceedings., Chillan, Chile, 2003, pp. 130-136, doi: 10.1109/SCCC.2003.1245453. DOI: https://doi.org/10.1109/SCCC.2003.1245453
N. A. Zakaria, A. R. Ismail, A. Y. Ali, N. H. Khalid, and N. Z. Abidin, “Software Project Estimation with Machine Learning”, International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, 2021. doi:10.14569/ijacsa.2021.0120685
Sharma and D. S. Kushwaha, “Estimation of Software Development Effort from Requirements Based Complexity”, Procedia Technology, vol. 4, pp. 716–722, 2012, doi: https://doi.org/10.1016/j.protcy.2012.05.116 DOI: https://doi.org/10.1016/j.protcy.2012.05.116
T. E. Ayyildiz and A. Koçyigit, “A Case Study on the Utilization of Problem and Solution Domain Measures for Software Size Estimation”, 2016 42th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Limassol, Cyprus, 2016, pp. 108-111, doi: 10.1109/SEAA.2016.13. DOI: https://doi.org/10.1109/SEAA.2016.13
K. Lind and R. Heldal, “A practical approach to size estimation of embedded software components”, IEEE Transactions on Software Engineering, vol. 38, no. 5, pp. 993–1007, Sep. 2012, doi: 10.1109/tse.2011.86. DOI: https://doi.org/10.1109/TSE.2011.86
P. R. Hill, “Practical Software Project Estimation: A Toolkit for Estimating Software Development Effort & Duration”, First edition. New York: McGrawHill Education, 2011. Available: https://www.accessengineeringlibrary.com/content/book/9780071717915
IBM, “Downloading IBM SPSS Statistics 29”, www.ibm.com, Nov. 17, 2022. https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-29
“Statistic - IntelliJ IDEs Plugin | Marketplace”, JetBrains Marketplace, Dec. 27, 2023. https://plugins.jetbrains.com/plugin/4509-statistic
B. C. Gupta, I. Guttman, and K. P. Jayalath, “Simple Linear Regression Analysis”, Statistics and Probability with Applications for Engineers and Scientists using MINITAB, R and JMP, John Wiley & Sons, Ltd, 2020, pp. 622–692. doi: https://doi.org/10.1002/9781119516651.ch15.
B. C. Gupta, I. Guttman, and K. P. Jayalath, “Multiple Linear Regression Analysis”, Statistics and Probability with Applications for Engineers and Scientists using MINITAB, R and JMP, John Wiley & Sons, Ltd, 2020, pp. 693–756. doi: https://doi.org/10.1002/9781119516651.ch16.
R. D. Cook, “Detection of Influential Observation in Linear Regression”, Technometrics, vol. 42, no. 1, pp. 65–68, Feb. 2000, doi: https://doi.org/10.1080/00401706.2000.10485981. DOI: https://doi.org/10.1080/00401706.2000.10485981
B. A. Kitchenham, L. M. Pickard, S. G. MacDonell, and M. J. Shepperd, “What accuracy statistics really measure”, IEE Proceedings - Software, vol. 148, no. 3, p. 81, 2001, doi: https://doi.org/10.1049/ip-sen:20010506. DOI: https://doi.org/10.1049/ip-sen:20010506
D. Berrar, “Cross-Validation”, Encyclopedia of Bioinformatics and Computational Biology, vol. 1, pp. 542–545, 2019, doi: https://doi.org/10.1016/b978-0-12-809633-8.20349-x DOI: https://doi.org/10.1016/B978-0-12-809633-8.20349-X
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Mehran University Research Journal of Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
How to Cite
Similar Articles
- Ibadullah Safdar , Muhammad Zahid , Tehseen Ilahi , Sohaib Tariq , Tausif Zahid , Majid Ali , Determining historic (1950 - 2000) average precipitation and temperature for Pakistan by using climate downscaling technique , Mehran University Research Journal of Engineering and Technology: Vol. 44 No. 2 (2025): April Issue
- Sanam Irum Memon , Abdul Wahab Memon , Umaima Saleem Memon , Nadir Ali , Assessment of the impact of different weave geometries on the crimp factor of woven preforms made from high-performance carbon filaments , Mehran University Research Journal of Engineering and Technology: Vol. 44 No. 2 (2025): April Issue
You may also start an advanced similarity search for this article.