Sage Journals: Discover world-class research

Abstract

Crowd-Sourced software development (CSSD) is getting a good deal of attention from the software and research community in recent times. One of the key challenges faced by CSSD platforms is the task selection mechanism which in practice, contains no intelligent scheme. Rather, rule-of-thumb or intuition strategies are employed, leading to biasness and subjectivity. Effort considerations on crowdsourced tasks can offer good foundation for task selection criteria but are not much investigated. Software development effort estimation (SDEE) is quite prevalent domain in software engineering but only investigated for in-house development. For open-sourced or crowdsourced platforms, it is rarely explored. Moreover, Machine learning (ML) techniques are overpowering SDEE with a claim to provide more accurate estimation results. This work aims to conjoin ML-based SDEE to analyze development effort measures on CSSD platform. The purpose is to discover development-oriented features for crowdsourced tasks and analyze performance of ML techniques to find best estimation model on CSSD dataset. TopCoder is selected as target CSSD platform for the study. TopCoder’s development tasks data with development-centric features are extracted, leading to statistical, regression and correlation analysis to justify features’ significance. For effort estimation, 10 ML families with 2 respective techniques are applied to get broader aspect of estimation. Five performance metrices (MSE, RMSE, MMRE, MdMRE, Pred (25) and Welch’s statistical test are incorporated to judge the worth of effort estimation model’s performance. Data analysis results show that selected features of TopCoder pertain reasonable model significance, regression, and correlation measures. Findings of ML effort estimation depicted that best results for TopCoder dataset can be acquired by linear, non-linear regression and SVM family models. To conclude, the study identified the most relevant development features for CSSD platform, confirmed by in-depth data analysis. This reflects careful selection of effort estimation features to offer good basis of accurate ML estimate.

Keywords

Software effort estimation crowdsourcing crowdsourced software development (CSSD)topcoder machine learning (ML)

1. Introduction

S Software development effort estimation (SDEE) is considered one of the most crucial software project management activities by researchers and practitioners [12, 17, 46]. SDEE is inevitable since underestimation can lead to insufficient resources allocation and poor-quality projects, consequently, results in incomplete project requirements [28, 36]. On the other hand, overestimation induces unnecessarily high budget and resources in project. As a result, a rise in project costs and bidding disadvantages may be encountered. Thus, accurate effort assessment at early stages of software lifecycle is essential. Development effort can be calculated based on task completion [78], measured in person-month (PM) or person-hour (PH) [13]. Various SDEE techniques have been proposed over the past four decades. Jorgensen and Shepperd [38] conducted a systematic literature review (SLR) based on 304 studies. Reported from SLR, estimation is done majorly using expert judgment [37], statistical analysis of historical project data [8] or machine learning (ML) techniques [4, 19, 23, 52]. There has been an increasing trend of incorporating ML methods for SDEE as they work better in modeling complex relationship between effort and software features [56]. Especially when dependent and independent features follow non-linear relationship and don’t seem to have any predetermined form. ML techniques are also known for overcoming biasness [25]. Decision tree (DT), neural network, support vector regression (SVR) and ensemble methods are majorly explored in SDEE domain by recent researchers.

Crowdsourcing is an evolutionary problem-solving platform, which works in distributed environment and combines human-intensive work with machine computation [6]. Software crowdsourcing (SWCS) or Crowdsourcing software development (CSSD) is an evolving domain of crowdsourcing, integrates traditional software engineering task in competition-based crowdsourced environment. Usually, CSSD enables software development through an open call format. Recent implementations of CSSD platforms are first devised by Jeff Howe in 2006 [32] and contain similarity with earlier idealistic development models, established by Free Software Movement.

For any CSSD project, “task” is characterized as the starting point, which is basically one work unit made available on CSSD platform. CSSD task represents need or problem in hand (i.e., software development, design, or quality assurance). General architecture of CSSD involves three types of actors or stakeholders: Client, also known as task requesters (TR), Developer, also known as crowd worker (CW), and CSSD platform collaborating TR and CW. In competitive CSSD platform, a TR requests a task and offers award for task completion, while CW (software developer/designer/tester), participates in given software task. Crowdsourcing platform provides an online marketplace within which TR and CW can collaborate (also reflected in Fig. 1).

Figure 1.

An overview of proposed work.

Multiple CSSD platforms including Amazon Mechanical Turk, TopCoder, TaskCity, eLance, vWorker, Guru, oDesk and Taskcn are providing crowdsourced services to larger software community. TopCoder among all CSSD platforms, contains the largest software developer community, and known to produce projects for well-recognized organizations such as Google, Facebook, Microsoft and AOL [68]. TopCoder platform is geared towards highly skilled CWs, who undertake time-consuming, complex, and quality software development tasks. The size of requested projects on TopCoder bear complex inferences and larger magnitudes compared to microtask platforms (i.e., Amazon Mechanical Turk). Thus, the necessity of coming up with a justified price with respect to amount of work required for the task is an imperative decision problem for the stakeholders involved [72].

[43] identified key challenges faced in CSSD platform, such as ineffective decision making on task selection, task completion uncertainty, costing issues and task quality assurance [35]. From CW’s perspective, adapting a task while not considering effort, may lead to low capital consumption efficiency or task starvation [26]. Since award money is established at the time of task posting by TR, which should be sufficiently attractive to the crowd. From TR’s perspective, inappropriate price can often lead to loss of potential crowd as well, hence price must be in accordance with effort demanded by the task in-hand. Valid pricing is well-known motivational factor behind CW’s task selection [43, 59], one of the top five key challenges for crowdsourced software development [42]. Further, task completion uncertainty prevails if unconscious attempt in task selection is made, i.e., not considering effort to be consumed on task. Effort based pricing strategy is crucial since mere intuition of TR about the price does not provide justified costing mechanism for CSSD task. Moreover, estimation needs to be supported by a good set of features to formulate an accurate effort estimation model. Since crowdsourced task requires in-time decision by CW whether to opt for the task or not, so features included in estimation must be those which are already available with the task, rather to rely on previous data. Considering the need of effort estimation in the field of CSSD with highly relevant, development-centric features, this study aims to incorporate ML based effort estimation for CSSD tasks. TopCoder is selected as CSSD platform for the study, due to its increasing utilization by recent researchers [1, 48, 80]. Basic working of the study is shown in Fig. 1 and main contributions of the study are as follows:

•

Defining a dataset for software development tasks posted on TopCoder platform, with features readily available and not having any dependability on previous phases data (i.e., software requirement, specifications, or design).

•

A detailed statistical, regression and correlation analysis is performed on selected TopCoder features along with model significance and normality considerations to justify relevance of dataset features.

•

Performing ML-based effort estimation with an empirical analysis using 20 ML techniques to identify the best performing estimation algorithm for TopCoder dataset.

The rest of the paper is organized as follows: Section 2 presents related work of past studies covering SDEE and CSSD literature and Section 3 listed the problem formulation. Section 4 contains methodology used for this study and complete framework description in underlying sections. Section 4.3 listed effort estimation algorithms used for the study, Section 5 includes experimental setup having performance metrices. Results are discussion are elaborated in Sections 6 and 7 while Section 8 define threats to the study’s validity. Section 9 describes future for the study.

2. Related work

SDEE is a well-recognized field in traditional and agile software project management and worked-upon a lot in past. CSSD on the other hand, is comparatively newer field but due to its underlying potential, researchers are widely using it for past 15 years [68]. This section further describes previous work done in the field of SDEE and CSSD.

2.1 Related work on SDEE techniques

As mentioned earlier researchers are contributing to SDEE and proposed many techniques in last four decades. Overall, these techniques are grouped into three main categories [20]: expert judgment, which takes effort opinion of one or more experts to determine the effort required by project [33]; parametric techniques, utilizing statistical and/or numerical analysis of historical project data [10] and machine learning (ML) techniques, based on a set of AI algorithms, including as artificial neural networks (ANN), genetic algorithms (GA), analogy-based or case based reasoning (CBR), decision trees, and genetic programming [34, 78].

Use of Random Forest (RF) is added to effort estimation by [53] and performance is compared with regression tree model. Experiments are performed on a benchmark dataset named ISBSG R8, Tukutuku, and COCOMO. The study concluded that RF provides good estimation compared to simple regression tree in terms of Pred (0.25), MMRE and MdMRE regression metrices. Another contribution in traditional SDEE is made using Support vector regression (SVR) by [18]. The study discussed that SVR provides good estimation when it comes to the problem related to higher dimensionality and outliers. Another SDEE work done by Zare et al. [81] used three level Bayesian network. This work also utilized optimization algorithms (GA and PSO) extracting optimal coefficients of effort on COCOMO NASA benchmark dataset.

Work of [47] in SDEE domain was to analyze the estimation power of different neural network variants. A comparative study is conducted on Multilayer Perceptron (MLP) and radial basis function (RBF) against multiple linear regression model (MLR) and concluded that neural network-based regression models perform statistically better than linear regression models. Effort metrices used for the work are MAR with MdAR with appropriate consideration of normalization and cross-validation on ISBSG benchmark dataset. An emperical study conducted by [69] analyzed various algorithms included Support Vector Machine (SVM), Multi-Layer Perceptron Neural Network (MLPNN), Linear Regression (LR) and K-Nearest Neighbor (KNN). Performance of SDEE model is estimated in terms of “Coefficient of determination ( $R^{2}$ ) and results revealed that MLPNN outperformed other models with 79% $R^{2}$ score. Another work presented by [65] performed effort estimation with Decision Tree (DT), Radial Based Function Neural Network (RBFNN) and Extreme Learning Machine (ELM). Based on size category of software, effort is evaluated on ISBSG Release 11 dataset. Results revealed that RBFNN works better for small software while DT can provide better estimation for small to medium software category. SDEE study conducted by [62] utilized ensemble model to estimate effort and duration required for software. Averaging ensemble is applied to ISBSG Release 12 dataset using Support Vector Machines (SVM), Multi-layer Perceptron Artificial Neural Network (MLP-ANN) and Generalized Linear Models (GLM). SVM outperformed other models, but results claimed that ensemble model reduces the bias and variance. Moreover, ensemble prevents under/over fitting and handle noisy data with outliers.

2.2 Related work on CSSD

CSSD is a relatively newer domain yet gained a great deal in recent software practices. Research work contributing in CSSD domain are listed below.

Table 1
Summary of previous CSSD work

Study	Main finding	Outcome	Actor	ML algorithms used	CSSD platform
[49]	Price prediction model based on 16 cost drivers metrices.	Pricing estimation	TR	LogR, LReg C4.5, CART, QUEST KNN-1, KNN-k $\in$ 3,7 NNet, SVMR	TopCoder
[2]	Dependency of worker’s reliability rating on workers behavior in task completion and evaluating CW’s tendency to complete a task.	CW’s reliability	TR	N/A	TopCoder
[79]	RF used to assign probability scores, prediction is made which CW most likely to complete, quit or win the task.	Task success rate prediction	TR & CW	RF	TopCoder
[1]	Context-centric tasks pricing estimate with ML techniques and tasks pricing ranges mapped to different topics (context) of tasks.	Pricing estimation	TR	LDA, LR, kNN, CART, Naive Bayes, SVM and NNet	TopCoder
[45]	Predict quality assurance work effort and explored its reduction, Project quality examination from project rating, effort, and quality assurance effort.	Project quality	TR	N/A	TopCoder
[77]	Impact of proactive (overpriced) and egocentric (underpriced) pricing strategies on worker’s behaviors and task quality.	Price-based worker’s behavior	TR	LR, kNN and SVM	TopCoder
[80]	Evaluation aimed to support CWs in overcoming the onboarding barriers that they face in CSSD platforms.	Barrier minimization strategies	CW	N/A	TopCoder
[48]	Monetary prize assessments based on task complexity factors and competition levels factors using regression and predictive modeling.	Pricing estimation	TR	LR, logR and kNN	TopCoder

Work done by Mao et al. [49] was the first study to address the pricing issue for TopCoder CSSD platform using machine learning models. The studyconstructed structural and non-structural pricing models on 490 TopCoder projects having 16 price drivers. Four categories of price drivers (input features) are formed including (1) Development Type (DEV), (2) Quality of Input (QLY), (3) Input Complexity (CPX), and (4) Previous Phase Decision (PRE). Pricing models are evaluated using three traditional methods (COCOMO’81, the random guessing, the Naïve model) and estimation is performed using: regression models (Multiple Linear Regression, Logistic regression), analogy-based schemes (KNN-1, KNN-k), and ML models (C4.5, CART, QUEST, Neural Network (NNet, SVM). The study concluded that price can be relatively predictable for this new paradigm. Also, C4.5 gives good prediction with more than 80% (84.3%) of estimates with an error lower than 30%. The study provides the grounds that high predictive quality is possible to achieve, which can outperform existing pricing models, facilitating TR to adopt actionable insight. Another CSSD cost estimation work is done by [1] named as Context-Centric Pricing (CCP) method using 6 cost drivers extracted from task description of TopCoder. 7 cost models are created using ML techniques including Linear Discriminant Analysis (LDA), Linear Regression (LR), kNN, CART, Naive Bayes, SVM and NNet. Compared to [49] cost models, CCP method can work with limited information available on task description, reducing the dependability on previous task data. However,it is not providing improved results with ML models. Moreover, cost drivers related to CW’s working ability and competitor’s status are not included which limits the decision support ability for constructing costing model.

Another cost estimation study is conducted by [48]. Pricing models using LR, logR and kNN are established on TopCoder dataset and models are empirically evaluated. Initially, 13 features are considered, out of which 9 are dropped due to their least relationship with pricing. Results concluded that logistic regression outperformed with 90% accuracy while kNN ( $k=$ 7) ranked second with 64% accuracy.

CWs’ dependability in tasks selection for CSSD is examined by [2]. Study investigated crowd worker’s behaviors with empirical analysis and behavioral aspects included are: (a) workers’ behavior at time of registration and task announcement; (b) worker’s performance and award money relationship; (c) groupwise effect of development type; and (4) evolution in workers’ behavior due to skill development. Study findings conclude that task response by most reliable CW’s group is within 10% of registration time and finished the task within 5% of allotted time.

Research presented in [79] worked on analyzing CW’s behavior and proposed dynamic decision systems for CWs. The study aims to make recommendations on CW’s best matching tasks with high winning rate. Influencing factors of CW’s behaviors are investigated in a competitive CSSD environment using RF and predicted that who will be most likely to complete/quit or win the task. Study results confirmed that a particular CW can be categorized as a “potential winner” with average 78% and 88% precision and recall respectively when all decisive features are incorporated. Further, models created have shown high performance recall for “submitter” class.

Another work by [45] identified factors influencing defects which can occur in resulting software. The work represented “project rating” to characterizes the “in-process quality of a project”, “amount of work in the process” to indicate “project effort”, and “amount of work required in the maintenance process” to indicate “quality assurance effort”. Features included for estimation are project size, total prize investment, and number of contests, with proposed Multiple Linear Model (MLM).

An empirical study is conducted by [77] to determine extend of task pricing strategies implemented in CSSD platform. Study worked on the data from TopCoder, while an algorithm is devised to analyze impact of CW’s behavior on pricing strategies. Tasks are labeled as over-price, under-price and Nominal. To represent “Nominal” price for each task, ML algorithms such as LR, kNN and SVM are used. The work is done from TR’s perspective and conclusion is made that overpriced tasks gained attention of more workers. They tend to register and submit for those kinds of tasks and show higher task completion velocity. Similarly, underpriced tasks gained comparatively less registrants and submitters along with lower task completion velocity. However, when it comes to delivering actual task outcome, i.e., task score, there might be chances that worker may exert less effort due to higher competition in overpriced task. The study further concluded that overpricing will not guarantee the quality of final deliverable by merely attracting more submitters. Another work in CSSD domain by [80] evaluated strategies on overcoming onboarding barriers or challenges faced by CWs. Barriers include task finding based on CW’s ability, environment setup for performing the task, and managing one’s personal time. Six strategies are proposed and evaluated through web-based questionnaire from CSSD experts. Results concluded that CSSD should incorporate task matching system with CW’s profile, virtual environment, and stable communication channel for requirement clarification.

2.3 Summary of literature work limitations

Table 1 shows a summarized version of previous CSSD work, including what outcome each study has produced, actor who is mainly facilitated by the work and whether they used any ML scheme to facilitate the estimation. Table 1 also established the fact that TopCoder is a majorly utilized platform CSSD research. This brings motivation for this work to consider TopCoder as exploratory CSSD platform. However, from the past SDEE and CSSD work analyzed above, limitations arise are:

1.
Most of SDEE work gathers around using benchmarks datasets of PROMISE and ISBSG repositories, which majorly contain in-house or traditional software projects, while crowdsourced software projects are not considered to perform any effort estimation work.
2.
In CSSD literature, cost estimation and pricing model for CSSD was the major focus of past literature as almost 50% work is about pricing estimation techniques. For CSSD actors (TR and CW), it is crucial to have an estimation about total effort needed for task completion, which is not much discovered in past.
3.
Work of [45] considered effort prediction but used as one of the factors to estimate task’s quality. That is, the focus was mainly on task quality assessment, not effort requirement. Also, estimation is done using three project features for building ML model, which is relatively fewer feature count for achieving better performing ML model [66, 73].
4.
[49] has considered features of previous software development phase i.e., software design for cost estimation. Often, it is challenging to get previous record of the task in CSSD platform, and a task may require just in-time development with no record previous phases.
5.
ML techniques are rarely explored for CSSD estimation domains and where used, very few techniques are considered. Mostly parametric and linear regression is followed [1, 45, 49] while other families of ML techniques are not examined to verify whether estimation is giving stable results in all contexts.

3. Problem formulation

This work attempts to address following issues, from the limitations identified in literature (Sections 2.1 and 2.2):

•
No joint framework is present in CSSD platform for facilitating both task requester (TR) and crowd worker (CW)/developer to conduct a decisional process about task commencement. For TR, it’s necessary to know about effort involved in task before posting it, so winner’s cost can be adjusted accordingly. For CW, again getting intuition on whether to accept a task based on effort required is necessary. In this study, a framework for predicting effort is defined, facilitating both (TR and CW) to estimate effort based on task features.
•
Cost analysis on TopCoder tasks is mainly focused on the past studies [1, 49]. However, core focus on effort for CSSD task is still unaccounted. Consequently, no proper criteria of cost justification are present. Effort consideration holds impact as TR desires to get idea whether cost of winner(s) is justified according to effort required by task. For CW’s perspective (developer), task completion can also be ensured if CW can assess the effort of posted task beforehand. In this work, core effort estimation for TopCoder development tasks are taken into account, facilitating TR and CW to estimate effort for task in-hand, and make criteria of cost justification.
•
Extraction of appropriate features and data collection play a vital part in crowdsourced environment analysis [22]. TopCoder dataset features used by earlier research are limited in number as well as dependent on previous software development phase. Dependency on previous-phase features becomes problematic for estimation since it is difficult for developers to get access of any previous tasks data. To mitigate this issue, all features available on the time of task-publishing are incorporated in the study to estimate effort. Extracted features encompass multiple categories like characteristics of application being developed, developer’s working abilities, task outcomes and size.
•
Descriptive statistics of all features and dataset normality are another important parameters, especially when ML based estimation systems are developed. The reason is estimation result may vary according to normal and non-normal distribution of dataset. Data normality measures are missing in most of the studies incorporated ML algorithms in their working. This will leave the ML model difficult to comprehend in terms of statistical distribution of values. Also, what distribution dataset follows and what impact it brings to model’s performance is not reported in any CSSD study. In this work, a detailed descriptive statistical analysis of input features as well as output variable is examined alongside normality distribution assumptions. Investigation is made to analyze if model is still significant enough after transforming the dataset into normal distribution.
•
Analyzing the impact of individual independent (input) variable on dependent (output) variable and examining the worth of keeping certain feature in dataset is necessary for building estimation model. For that reason, model significance and detailed multiple regression analysis is performed for justifying the importance of each extracted feature and scrutinizing which feature needs to be controlled to minimize effort. Along with this, multicollinearity among input features, as well as correlation between input and output is determined to justify that model is highly suitable to perform ML analysis for forthcoming effort predictions.
•
For verifying estimation stability and to avoid biases associated with individual approach, ML techniques with all major families are explored. ML families included for the study are linear/non-linear regression family, decision tree family, SVM family, neural network family, heterogenous and homogenous ensemble family. Two techniques from each family are taken to establish effort prediction model.

4. Methodology

This section presents the methodology used to build TopCoder Effort Estimation Model for predicting effort consumed on a CSSD task. Figure 2 shows the proposed framework, which consists of three main stages:

Figure 2.

Basic structure of proposed framework.

Data Acquisition Stage (Section 4.1); describes the details of extracted data. Features which are part of dataset and their relevance with effort is elaborated.

Data Preparation Stage (Section 4.2); contains detailed analysis performed on dataset to shape it in a form suitable for feeding to ML algorithm. Data is investigated in terms of statistical analysis (Section 4.2.1), normality checking (Section 4.2.2) regression analysis (Section 4.2.4) and correlation analysis (Section 4.2.5).

Effort Prediction Stage (Section 4.3); presents effort estimation modeling with ML techniques, to identify best performing ML algorithm for TopCoder dataset.

4.1 Data acquisition stage

Data of 1500 TopCoder development tasks are extracted using Octaparse crawler.1

¹
https://www.octoparse.com/.

Details of winner of each task (also referred as “developer” in this work) are extracted from his/her profile. Overall, profiles of 500 developers (winners of each task) are gathered. Details of features related to task and developer are explained further in this section. Since we are considering TopCoder platform, term CW or developer will be used interchangeably.

Data The following criteria is applied for task selection: dataset contains TopCoder development tasks completed between past year, i.e., 2020 to 2021 are taken. All tasks are accomplished, with 60 above submission score. For the sake of simplicity, only those tasks are selected with one winner (i.e., award money is given to the first winning submission). Tasks with source code information (repository or Gitpatch) are selected. After applying selection criteria, 1200 tasks remain to be part of dataset.2

https://drive.google.com/drive/folders/1uwGEME4spi2WlIGRxERC5dkY1pk0DntW?usp=sharing.

Further details on TopCoder dataset features are as follows:

Dataset contains a total of 15 input features (also referred as “Effort drivers”) and 1 output feature, i.e., Effort. Input features are grouped into 5 categories, depending upon the data they represent. Each feature contains numerical or ordinal values. Ordinal levels assigned to the features are “very high”, “high”, “nominal”, “low” and “very low” and each level is given a numerical value from 1 (very low) to 5 (very high) to establish ML estimation models later in the study. Table 2 listed all input features along with their names, category, abbreviation, and data types described further as follows:

Development characteristics (DVCH) category shows what type of development a task requires.

–

The first feature of this category, APPT shows type of application under-development. Application types found for most tasks in extracted dataset are desktop, mobile, web applications, API, frontend (UI/Ux/Cx) and frontend-backend integration patch. Higher ordinal levels are assigned to more complex application, i.e., desktop and mobile application (dashboards, management information systems) since these applications involve large backend integration requirement [58]. Nominal to low ordinal levels are assigned to small scale websites, APIs and frontend applications (UI/UX/Cx) since they require less lines of code and mostly involve the designing part of software application [16, 54].

–

Next feature, DEVT, defines whether the task implies developing entirely new application or adjustment/updation in a previous task. Higher ordinal values are assigned for “from-the-scratch” application development, while updation tasks are given lower ordinal values.

Personnel Metrics (PERS) category illustrates aspects of developer’s skill, reflecting his/her development ability.

–

First feature of this category, win percentage, WINP determines developer’s winning tendency, i.e., trend of developer winning a submitted task. Higher WINP reflects good analytical capability of developer, which in turn affects the required effort i.e., less effort required by developer with higher WINP.

Table 2

Descriptive statistics of TopCoder dataset effort drivers

Category	Features	Description	Data type	Min		Max	Mean		Median		Standard deviation
Development	APPT	Application type under development (web/android/iOS)	Ordinal	1		5	3		3		0.92
characteristics	DEVT	Required development of new application or adjustment in previous	Ordinal	2		4	3		4		0.67
(DVCH)
Personnel Metrics	WINP	Developer’s tendency to win a submitted task	Numerical	4	.76	100	38	.57	33	.33	21.21
(PERS)	SBRT	Developer’s tendency to Submit a registered task	Numerical	2	.30	100	50	.00	43	.00	27.80
	RATN	Developer’s rating achieved	Numerical	844		1972	1475	.19	1487		181.27
	EXPR	Developer’s experience in years	Numerical	0	.08	9	2	.28	1	.83	1.95
	TSCR	Developer’s skills score	Numerical	0	.16	1	0	.72	0	.75	0.25
	NSUB	Number of submissions made by developer	Numerical	1		9	2		1		1.36
Task complexity	NoLN	Programming languages required for task	Numerical	1		4	1		1		1.06
(TCPX)	PRLN	Complexity level of programming language(s) required for task	Ordinal	1		5	3		3		0.94
	TOOL	Complexity level of tool/technology required for task	Ordinal	2		5	4		4		1.23
	CNST	Environmental constraints imposed on task development	Numerical	0		5	2		1		1.03
Artifact complexity	ASST	Helping artifacts/material provided with task	Ordinal	1		5	3		3		1.14
(ACPX)	DLVB	Deliverable required with task	Ordinal	1		5	3		3		0.92
Size	LOC	Line of codes	Numerical	6		215	60	.55	57		24.67

–

SBRT shows the tendency of developer to submit a task for which he/she has registered. Numerical values of both WINP and SBRT are directly extracted from the developer’s profile. Higher SBRT shows developer’s continuity in completing tasks and enhanced development capabilities.

–

EXPR defines developer’s experience in performing tasks of TopCoder development category. Numerical values are assigned as duration (in years) between first submitted task and latest submitted task by developer in development category.

–

RATN shows quality of work produced by developer in previous tasks. Developer’s rating (RATN) is directly extracted from profile.

–

TSCR is technology/language/tool score of developer. Developers’ skills (listed in developer’s profile, verified by TopCoder platform) are extracted, matched with skills required for task and score is assigned. For instance, if a task needs implementation of 3 languages/tools/technologies and developer previously has worked with 2 of them in past tasks, then TSCR will be 2/3 (0.67).

–

NSUB is determined as total number of submissions made by developer which reflects developer’s analysis ability of current task and thoroughness of implemented code.

Task complexity (TCPX) shows complexity level infused by programming language, tools, and technology on the task.

–

NoLN, is the number of languages required to develop a particular task.

–

PRLN is assessed by type/complexity of programming languages required for development. Complexity is based on language generation, built-in libraries and automated function calls offered by language. Newer generation languages generally enable the provision of off-the-shelf libraries which in turn improves the development efficiency and subsequently decrease effort.

–

TOOL represents tool/technology needed for task development. Like PRLN, modern tools provide built-in functionalities with service invocation via interface, and are less code-oriented, hence reduce development effort. Values of both features are assigned based on complexity and ease of development they offer, i.e., recent generation programming languages (PRLN) and tool/technology (TOOL) are given higher ordinal level.

–

CNST referred to several constraints imposed on development or deployment process, such as design, platform and environment constraints. Extensive measures are needed if code is built with constrained environment, hence results in increased effort.

Artifact complexity (ACPX) defines; what kind of provided and required artifacts involved in task and what is their complexity level.

–

ASST feature of this category, reflects facilitation level provided by documents/artifacts posted with the task. From TopCoder data, commonly found assets are: design files, prototype links, branch codes, micro frames, repos (containing code for the updation or bug-fixing tasks), which client (TR) provides with the task. Higher ordinal levels are assigned if artifact has increased helping level.

–

DLVB evaluates the overhead created due to the deliverables required to submit at the end of task. In addition to source code, more deliverables need extra effort for a task. From extracted data, deliverables found are: git patches, bug report, verifications videos/documents and database schema. Higher values are assigned if a higher number of deliverables must be produced by the developer.

Size category determines task size. Software lines of code (LOC) are taken as size measures of effort calculation, for which numerical values of task size are obtained. Higher size reflects greater task effort.

Effort is a dependent feature of the dataset and measured based on size and completion time of development task. Since challenge/task posted on crowdsourced platform opens for 24-hours till the deadline, so keeping in mind time differences of country a developer belongs, unit selected for Effort is person-hour (PH) in this study.

4.2 Data preparation stage

After data is extracted, statistical and multiple regression analysis is performed next, to shape the data suitable for effort estimation model. Details are presented further in this section.

4.2.1 Statistical analysis of TopCoder dataset

This stage starts with descriptive and statistical analysis of input effort drivers. Table 3 presents the statistical measures for all independent attributes (effort drivers) in dataset. For ordinal features, statistical values are considered to get the idea which data values against these ordinal features are most prevalent in dataset. For instance, we can conclude from Table 3 average tasks in dataset belong to “nominal” level of application type (APPT), i.e., most tasks require to develop WebApp (Backend) or Application programming interface (API). For DEVT feature, most tasks belong to “nominal” level as represented by their means since most published tasks required updation of any a previous project. Coming to Personnel Metrics (PERS) category, average developers in dataset have 39% winning percentage (WINP), while highest WINP is 100 which shows, some developers have won all tasks they have registered for. Same for SBRT, maximum 100% submission rate is found which again reflects, at maximum, developers submitted all tasks they have registered for, while 50% submission rate is found on average. For RATN feature, maximum rating achieved by developer is 1972, while on average developers belong the rating of 1467. Developers are present with an average EXPR of almost 2 years and the highest experienced developers contain 9 years of experience. Maximum technology expertise (TSCR) of developer is 1, depicting developer have worked with all skill in past as required by task. For TCPX measures, average tasks need development of using only 1 language, but at max, some tasks may require 4 languages to work on. Programming languages used (PRLN) in task are mostly on “nominal” complexity level, indicates that most tasks require 3G languages.3

³
https://www.computerhope.com/jargon/num/1gl.htm.

Sophistication level of tools (TOOL) have “high” ordinal value on average (i.e., 4), shows most of the tasks need relatively modern tools with built-in functionalities. Platform and environmental constraints (CNST) imposed by most of the tasks have ordinal value “2” (with ordinal level “high”), considering that developed application must run on windows and iOS. For ACPX category, artifacts (ASST) with “nominal” helping level (i.e., API/designing details) are given at the start of development. On the other hand, on average required artifact (DLVB) by tasks are again of “nominal” level (i.e., verification document with source code), while at minimum, tasks just required source code.

Table 3

Descriptive statistics of TopCoder output variable (effort)

Effort distribution statistics	Min	Max	Mean	Median	Standard deviation	Skewness	Kurtosis
	8.2	5333.7	801.5	559.5	728.29	2.57	9.59

After statistically analyzing input effort drivers, statistical details of output variable i.e., Effort are reported in Table 3. Effort is considered in person-hours (PH) measure. The level of effort recorded was between 8.2 and 5333.7 PH (with an average of 801.5 PH). Skewness and Kurtosis measures (Table 3) verified that collected data is positively (right) skewed (i.e., Skewness $=$ 2.57 $>$ 1) and effort distribution having higher and sharper central peak (i.e., Kurtosis $=$ 9.59 $>$ 3). Quantile-wise analysis of effort values are also performed by dividing the dataset into 3 parts according to the Effort quantiles (i.e., 0–25%, 25–75% and 75%). Results of Table 4 demonstrated that 2 ${}^{\text{nd}}$ quantile of effort values contains more data samples with average of 633.06 PH [30].

Table 4

Effort distribution in TopCoder dataset

Effort ranges	Quantile	No. of tasks	Mean	Median	Standard deviation (SD)
Q0 $\sim$ Q0.25	361.78	156	204.86	192.4	91.09
Q0.25 $\sim$ Q0.75	1054.73	748	633.06	559.5	195.62
Q0.75 $\sim$ Q1.00	4412.40	296	1728.02	1378.2	867.29

Figure 3.

Effort distribution for TopCoder dataset.

4.2.2 Normality check

Normality measures in effort values are verified using Shapiro-Wilk test [50] at significance level ( $\alpha$ ) of 0.05, and graphically illustrated via histogram of Fig. 3. Histogram shows, majority of records having effort located towards lower values and some very high external values, verifying non-normal behavior of dataset. Results from Shapiro-Wilk test generated $\rho$ -value much less than $\alpha$ , rejecting null hypothesis that data came from normal distribution. To tackle this, box-cox transformation which helps in correcting variable asymmetry, variance, and non-linearity [31]. Box-cox transformed distribution to a reasonable extent compared to Kolmogorov-Smirnov test which can produce false positives reinforced by extremely low $p$ -values [27]. Box-cox is applied with Guerrero’s method (Eq. (1)) [29], where lambda parameter is used for minimizing coefficient of variation for subseries of Effort. Histogram and subsequent Shapiro-Wilk test results (after box-cox application) are shown in Fig. 4. Plot shows largest concentration of data is around mean, and samples frequency is closer to the limits. Also, Shapiro-Wilk test ( $\rho$ -value $>\alpha=$ 0.05), confirmed that data distribution is not significantly different from normal distribution.

$\displaystyle y=\left\{{{\begin{array}[]{ll}\frac{x^{\lambda}-1}{\lambda}&% \text{if }\lambda\neq 0\\ \log(x)&\text{if }\lambda=0\\ \end{array}}}\right.$ (1)

Figure 4.

Effort distribution after transformation.

4.2.3 Model significance with

F

test and

R^{2}

test

Statistical inference of collected TopCoder dataset started with the $F$ test [71], an overall test to see if the independent variables (Effort drivers) explain a significant amount of the variation in dependent variable (Effort). Reason for conducting $F$ test is to determine whether the model is intercepting only model or not. If overall $F$ test is not statistically significant, there is no point on going further. For a model with no independent variables (intercept-only model), all the model’s predictions equal the mean of the dependent variable. Consequently, if the overall $F$ test is statistically significant, prediction of model trained on given dataset, is an improvement over mean of dependent variable. Percentage of the variation in dependent variable is represented by $R^{2}$ (coefficient of determination). The hypotheses formulated for TopCoder $F$ test is shown in Eq. (2):

$\displaystyle\left\{{{\begin{array}[]{l}{H_{0}:\beta_{1}=\beta_{2}=\beta_{3}% \ldots\beta_{n}=0}\\ {H_{1}:\beta_{1}\left|{\beta_{2}}\right|\beta_{3}\ldots|\beta_{n}\neq 0}\\ \end{array}}}\right.$ (2)

Null hypothesis (H0): no predictive relationship between the effort drivers and “Effort” in the population i.e., the model with no independent variables fits the data as well as the model.

Alternative Hypothesis (H1): at least one effort driver fits the data better than the intercept-only model.

In our model, $F$ test’s null hypothesis claims that effort of collected dataset is pure randomness and has no regard for the values of any input features, i.e. Effort $=$ $\beta$ o $+$ $\varepsilon$ , while alternative hypothesis describes impact of at least one or more independent variable on model. Since effort values don’t follow normal distribution, so we have transformed effort values using box-cox transformation, as explained in Section 4.2.2. In Fig. 5 diagnostic plot is created using transformed effort values with residuals along with regression line (shown in red). Presence of outliers or noisy data can reduce effectiveness of estimate [39]. So, for fitting the model closer to regression line, cook’s distance is incorporated with traditional cut-off at 4/(n-p), where n is number of samples and p is number of fitted values in our model [75]. Models with non-transformed Effort values showed $R^{2}$ and F-statistics as 0.79 and 45.78 respectively. It is clear from Fig. 5, transformed Effort values give better $R^{2}$ and model fit is closer to regression line. Table 5 shows overall model’s significance. A fairly large Coefficient of determination $R^{2}$ shows that relationship is a strong one. $R^{2}$ 0.852 or 85.2%, also verifies that the explanatory variables (effort drivers) have explained 85% of the variation in “Effort”. This means 14.6% of the variation left unaccounted and is contributed by other factors. Model’s $\rho$ -value (Significance F) is smaller than 0.05, indicating a reasonably significant model and rejecting the null hypothesis (i.e., the effort drivers do not help to predict “Effort”). This analysis provides sufficient evidence to conclude that our regression model, built using TopCoder dataset, fits the data better than the model with no independent variables. This further highlights the fact that all features selected for dataset (Section 4.1) contribute well in model’s significance.

Table 5

Model significance analysis

Model significance statistics	Multiple R	$R^{2}$	Adjusted $R^{2}$	Standard error	F-statistics	Significance F
	0.844	0.852	0.831	193.3	59.658	2.76E-66

Figure 5.

Diagnostic plot of residuals with transformed effort.

4.2.4 Regression analysis of TopCoder features

This section explains regression details between dependent and independent variable. After confirming over all statistical significance of model, we proceeded with statistical inference using $\tau$ -test for individual regression coefficients. Regression analysis is necessary since it describes whether individual input feature influence effort after adjusting (or controlling) other input features (i.e., holding them constant). We need to explain the response between the variables as how effort drivers influence effort values. Regression equation for TopCoder dataset is shown by Eq. (3):

$\displaystyle\textit{Effort}=883.96+70.61\textit{APPT}+129.02\textit{DEVT}-484% .61\textit{WINP}-444.20\textit{SBRT}\!\!{}-0.56\textit{RATN}-53.85\textit{EXPR% }-255.01\textit{TSCR}-54.15\textit{NSUB}+107.68\textit{NoLN}\!\!{}-46.26% \textit{PRLN}-43.43\textit{TOOL}+50.99\textit{CNST}-103.96\textit{ASST}+38.82% \textit{DLVB}\!\!{}+17.79\textit{LOC}$ (3)

Table 6

Regression analysis details between effort and effort drivers

	$\beta$	Standard error (SE)	$\tau$ -value	$\rho$ -value	95% confidence interval (CI)
$\beta$ o	883.96	170.74	5.18	6.52E-07	[546.83, 1221.1]
APPT	70.61	18.00	3.92	0.000128	[35.08, 106.15]
DEVT	129.02	22.99	5.61	8.37E-08	[83.62, 174.42]
WINP	$-$ 484.61	84.62	$-$ 5.73	4.76E-08	[ $-$ 651.69, $-$ 317.54]
SBRT	$-$ 444.20	63.95	$-$ 6.95	8.38E-11	[ $-$ 570.47, $-$ 317.93]
RATN	$-$ 0.56	0.09	$-$ 6.16	5.37E-09	[ $-$ 0.75, $-$ 0.38]
EXPR	$-$ 53.85	8.83	$-$ 6.10	7.32E-09	[ $-$ 71.28, $-$ 36.42]
TSCR	$-$ 255.01	61.09	$-$ 4.18	4.83E-05	[ $-$ 375.63, $-$ 134.4]
NSUB	$-$ 54.15	11.56	$-$ 4.68	5.91E-06	[ $-$ 76.98, $-$ 31.31]
NoLN	107.68	17.99	5.99	1.31E-08	[72.16, 143.19]
PRLN	$-$ 46.26	16.37	$-$ 2.83	0.005308	[ $-$ 78.58, $-$ 13.93]
TOOL	$-$ 43.43	13.13	$-$ 3.31	0.001153	[ $-$ 69.34, $-$ 17.51]
CNST	50.99	16.51	3.09	0.002358	[18.4, 83.59]
ASST	$-$ 103.96	14.05	$-$ 7.40	6.70E-12	[ $-$ 131.7, $-$ 76.22]
DLVB	38.82	18.43	2.11	0.0367	[2.43, 75.22]
LOC	17.79	0.89	19.91	$<$ 2.00E-16	[16.03, 19.56]

Table 6 presents complete regression details of all input features where Standard error (with 95% CI) is showing the coefficient variation under repeated sampling. Next, $\tau$ -test (and its associated $\rho$ -value) are incorporated to verify statistically significant relationship of individual input feature with Effort. Equation (4) represents hypothesis for the $j^{\text{th}}$ regression coefficient $\beta j$ as:

$\displaystyle\left\{{{\begin{array}[]{l}{H_{0}:\beta_{j}=0}\\ {H_{1}:\beta_{j}\neq 0}\\ \end{array}}}\right.$ (4)

Null hypothesis (H0): effort driver (with regression coefficient $\beta_{j}$ ) holds no impact on Effort.

Alternative Hypothesis (H1): effort driver (with regression coefficient $\beta_{j}$ ) put some (increasing or decreasing impact on Effort.

As it is clear from Table 6, for all effort drivers, $\tau$ -value (and $\rho$ -value) are less than significance level ( $\alpha=$ 0.05), hence all input features contribute to measuring effort for a given task. Intercept or constant term, $\beta$ o $=$ 883.96, that gives the predicted value for “Effort” when all input variables are 0. That is, $\beta$ o suggests typical effort consumption on a task with no DVCH measures, no consideration for PERS features, no TCPX and ACPX measures and size consideration. Further, features including APPT, DEVT, NoLN, CNST, DLVB AND LOC are putting increasing impact on effort, i.e., increased values of such features will increase effort. On the other hand, WINP, SBRT, RATN, EXPR, TSCR, NSUB, PRLN, TOOL, ASST, with negative regressions coefficient ( $\beta$ ) have decreasing effect on Effort.

In DVCH category, regression coefficient for DEVT feature (129.02) is higher than other feature of same category, APPT (70.61) indicates that, development type has more impact on increasing the effort. Further, all else equal, a task with developing new application puts an additional 129 PH (on average), compared to task that required just updating of previous application (as shown by DEVT feature). Compared with APPT, tasks that arrive with Android/iOS development, for instance, adds up average 71 PH compared to a task requiring front end development (or any relatively simpler application type). For PERS category, the most impacting regression coefficient is for WINP ( $-$ 484.61 or $-$ 4.84%), indicating that developer having higher rate of winning a task would require less 487 PH. Moreover, the influence of WINP in decreasing the effort is more than the other features of PERS category. After WINP, the technology score of a developer (TSCR) considerably reduces the effort (i.e., TSCR $=-$ 255.01). In general, all features of PERS category create decreasing impact on effort. From this, we can conclude that improvement in developer’s characteristics would generate less effort-requiring task. Moreover, usage of modern programming languages and tools with more built-in functionalities (PRLN and TOOL feature) are contributing to reduced effort (less 46 PH and 43 PH respectively). Regression of ACPX category features suggested that more focus should be given on providing artifacts (ASST) along with task as it will reduce effort to 104 PH on average. Further, the presence of helping artifacts have more inclination on effort reduction than deliverable’s effort increase, since DLVB adds up 39 PH with one increased deliverable. To conclude, the highest increase in effort is offered by development type, while most decreasing impact is ensured by submission rate.

4.2.5 Correlation analysis of TopCoder features

In addition to reporting descriptive, statistical and regression analysis of dataset, this stage of proposed framework applied Pearson’s correlation [55], referred as linear correlation. Multicollinearity problems arise when input features are too similar with each other and hold degree of relationship. This will raise the challenge of having similar effect of all explanatory variables (input features). The model, in turn, will not identify the impact of individual input variable in estimating dependent variable. A correlation coefficient close to zero implies no relationship between two features while values closer to 1 or $-$ 1 suggests a good Pearson correlation. From the perspective of [11], values between [0.5, 1.0] indicate a high positive and between [ $-$ 0.5, 1.0] a high negative correlation. From an estimation model’s perspective, we want multicollinearity to be as low as possible and correlation between input and output variables as high as possible.

Figure 6.

Pearson coefficient multicollinearity on TopCoder features.

The extent of the multicollinearity is examined in Fig. 6 representing Pearson correlation coefficient among all input features. A small to moderate amounts of multicollinearity is seen in the TopCoder dataset features, which is usually not a problem [71]. Negative Pearson correlation values in analysis range from [ $-$ 0.32, 0] which is less than the suggested limit of multicollinearity i.e., $-$ 0.5 [11]. Positive multicollinearity among features ranges from [0, 0.41] which is again less than the suggested limit and shows quite moderate multicollinearity. This depicts that TopCoder dataset features do not contain overlapping information, making it favorable dataset for applying ML estimation algorithms. Features like CNST and ASST are less similar with most of other features, while NoLN found similar with most of the features of dataset. Extremely positive correlation is found between features LOC and DLVB with correlation coefficient 0.41. Extreme negative correlation is between EXPR and SBRT with coefficient $-$ 0.32, still maintaining their Pearson correlation coefficient in moderate range.

Figure 7.

Positive Pearson correlation in TopCoder feature.

Figure 8.

Negative Pearson correlation in TopCoder features.

Correlation between dependent and independent features is analyzed next, again using Pearson correlation at significance $\alpha=$ 0.05. Figures 7 and 8 demonstrated the magnitude of the correlation between effort driver (y-axis) and Effort (x-axis) along with correlation statistics (Pearson correlation coefficient (R), $\rho$ -value with 95% confdence interval). Features including APPT, DEVT, NoLN, CNST, DLVB and LOC are showing positive correlation with Effort (Fig. 7) while negatively correlation is observed between Effort and RATN, EXPR, TSCR, NSUB, PRLN, TOOL, ASST (Fig. 8). We can conclude from the analysis that acceptable level of correlation exists between all input features and output feature since all correlation show $\rho$ -value less than $\alpha=$ 0.05. Further, the highest increase in effort is offered by development type since from the scratch development require developer to spend morePH as compared task requiring only software enhancement. Most decreasing impact is observed by submission rate of developer, hence assigning task to developer with more submissions will be needing much less effort.

Table 7

Effort estimation techniques used in study

Family	Algorithm	Previous work	Parameters setting
Rule-based (Rule)	Zero Rule (ZeroR)	[14]	Scale method $=$ Pearson Chi-square
	One Rule (OneR)	[14]	Scale method $=$ Pearson Chi-square
Linear regressor (LR)	LASSO	[70], [49], [1], [77], [48]	Error distribution $=$ Gaussian No. of regularization parameters (n $\lambda$ ) $=$ 100 elastic net mixing ( $\alpha$ ) $=$ 1
	Bayesian Linear Regression (BayesLR)	[60]	Error distribution $=$ Gaussian Mapping function (link) $=$ Identity Estimation algorithms $=$ Sampling
Non-linear regressor (NonLR)	Logistic Regression (LogReg)	[24], [49], [1], [48]	Error distribution $=$ Gaussian Mapping function (link) $=$ Identity Convergence tolerance ( $\varepsilon$ ) $=$ 1e-08 IWLS iterations (maxit) $=$ 25
	Polynomial Regression (PolyReg)	[41]	Polynomial degree $=$ 3
Naive Bayes	Multinomial Naive Bayes (MultiNaiveB)	[82], [1]	Laplace $=$ 0.5
(NaiveB)	Bernouli Naïve Bayes (BernNaiveB)	[82]	Laplace $=$ 0
Decision Tree (DT)	CART	[67, 76], [49], [1]	Min split $=$ 20 Complexity parameter (C) $=$ 0.01 max depth $=$ 30
	RF	[5], [79]	No. of forest trees (ntree) $=$ 500 max depth $=$ 100 No. of features each node (mtry) $=$ 5
SVM	SVM-Radial Base Function (svmRadial)	[57], [49], [1], [77]	Complexity parameter (C) $=$ 1 Deviation tolerance ( $\varepsilon$ ) $=$ 0.1 Sample influence ( $\gamma$ ) $=$ 0.066
	SVM-Polynomial Function (svmPoly)	[9]	Polynomial degree $=$ 3 Complexity parameter (C) $=$ 1 Deviation tolerance ( $\varepsilon$ ) $=$ 0.1
Analogy based (CBR)	kNN5	[61], [49], [1], [48]	Similarity function $=$ Euclidean solution function $=$ mean, $k=$ 5
	kNN10	[77]	Similarity function $=$ Euclidean solution function $=$ mean, $k=$ 10
Neural network (NNet)	Multi-Layer Perceptron NNet MLPNN	[62], [77]	Size $=$ 10 Activation Function $=$ Logistic max iterations $=$ 100
	Radial Basis Function NNet (RBFNN)	[64]	Size $=$ 10 Activation Function $=$ Linear Learning function $=$ Radial Basis
Homogenous ensemble (HM)	Bagging	[15]	No. of bags $=$ 25 Min split $=$ 2 Complexity parameter (C) $=$ 0
	Gradient boosting machine (GBM)	[74]	Error distribution $=$ Gaussian No of trees $=$ 100 shrinkage $=$ 0.1
Heterogenous	Weight assignment (En_WtAvg)	[51]
ensemble (HT)	Averaging (En_Avg)	[63]

4.3 Effort estimation stage

After analyzing and preparing the TopCoder dataset, model training to estimate development effort comes next. For effort prediction, 10 ML families are considered, and two algorithms per family are part of experimentation. Table 7 shows each ML family, their respective two algorithms and parameter setting used to train each model in this study. Parameters of selected models are tuned using Grid search (GS) [51]. The selected ML algorithms are also used in SDEE literature to generate considerable results. Moreover, previous CSSD work also incorporated some of ML algorithms for estimation (Table 1). ML techniques mentioned in bold text (Table 7 Column 3) are used by previous CSSD work while non highlighted techniques are incorporated in tradition SDEE literature. For HT ensemble, the best performing algorithm from each family is selected and combined in form of ensemble. The idea behind choosing these algorithms is to incorporate estimation techniques from the different flavors of the ML domain and to broaden the spectrum of experimental evaluation. Moreover, it is established in previous literature (Section 2.1) [3, 78] NNet, SVM, Decision tree (RF and CART) and Linear Regression are frequently used for SDEE. NNet and SVM also outperformed in most of the studies. Among NNet family, MLPNN and RBFNN are widely utilized versions of neural network. Ensemble models, on the other hands, are also proved to generate better estimation accuracy by avoiding bias created by individual model [66, 74]. Besides this, we also incorporated comparatively less renowned ML models (ZeroR, OneR Naive Bayes) to give better judgement about working of ML techniques on our dataset.

5. Experimental setup

In order to assess the effectiveness of each estimation model, we applied 10-fold cross validation (10-fold-CV) by randomly dividing dataset into 70-30 ratio. All experiments and statistical evaluation are performed in RStudio using R programming language. Performance evaluation metrices (Section 5) are applied on each model. All experiments are executed for 500 repetitions to get unbiased results from each algorithm and mean and Standard Deviation (SD) of the error is calculated. Besides, we performed statistical tests, i.e., Welch’s $t$ -test [40] to evaluate the performance significance of the estimation models using relative $\rho$ -value. Algorithm 1 represents the working of complete effort estimation cycle with model evaluation. Evaluation metrices used in this study are Mean squared error (MSE), Root mean square error (RMSE), Mean Magnitude of relative error (MMRE), Median Magnitude of relative error (MdMRE) and PRED (25), which are represented in Eqs (5)–(10).

$\displaystyle\textit{MSE}=\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}({\textit{% Effort}_{\textit{actual}}-\textit{Effort}_{\textit{estimated}}})^{2}$ (5) $\displaystyle\textit{RMSE}=\sqrt{\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}({% \textit{Effort}_{\textit{actual}}-\textit{Effort}_{\textit{estimated}}})^{2}}$ (6) $\displaystyle\textit{MRE}=\left|{\left({\frac{\textit{Effort}_{\textit{actual}% }-\textit{Effor}_{\textit{estimated}}}{\textit{Effort}_{\textit{actual}}}}% \right)}\right|$ (7) $\displaystyle\textit{MMRE}=\frac{1}{T}\mathop{\sum}\limits_{t=1}^{T}\textit{% MRE}_{t}$ (8) $\displaystyle\textit{MdMRE}=\textit{median}({\textit{MRE}_{t}})t\in\{{1\ldots T}\}$ (9) $\displaystyle\textit{PRED}(d)=\frac{100}{T}\mathop{\sum}\limits_{t=1}^{T}\left% \{{{\begin{array}[]{ll}1&\text{if }\textit{MRE}\leqslant d\\ 0&\text{otherwise}\\ \end{array}}}\right.$ (10)

Where $t\in[1\ldots T]$ is CSSD task from dataset T. PRED (d) is included to comprehend which percentage of MRE is less than or equal to the value (d) among all tasks. The generally acceptable range of PRED (d) is d $=$ 0.25. So it will be assessed what portion of dataset tasks lie under the given range of MRE accuracy, i.e., 25%. Models with better performance exhibit lower MSE, RMSE, MMRE, MdMRE and higher PRED values.

Although with evaluation metrices, we have populated performance results of each ML model. To ensure if models results are statistically significant [21], we used Welch’s $t$ -test to check the significance difference between error measures of ML models [40]. Significance level ( $\alpha=$ 0.05) is applied to minimize Type I error [7] The null and alternative hypothesis of significance test are shown in Eq. (11):

$\displaystyle\left\{{{\begin{array}[]{l}{H_{0}:\mu_{k}=\mu_{m}OR\mu_{k}-\mu_{m% }=0}\\ {H_{1}:\mu_{k}\neq\mu_{m}OR\mu_{k}-\mu_{m}\neq 0}\\ \end{array}}}\right.$ (11)

Algorithm 1: Proposed Effort Estimation Module
Input: Algorithm_family ( $Z j$ ); where { $Z$ } $\in$ {Rule, LR, NonLR, NaiveB, DT, SVM, CBR, NNet, HM, HT}
ML_Algorithm ( $X_{i,j}$ ); Algorithm ${}_{i}$ belong to family ${}_{j}$
mean $=\mu$ ; standard deviation $=\sigma$
Output: Optimal Effort Estimation Model
Begin:
1. Algorithm_family ( $Z j$ )
2. ML_Algorithm ( $X i, j$ )
for cross-validation fold [1: 10] do
3. Divide dataset $D^{T}$ into by [70: 30] ratio
for each ML_Algorithm ( $X_{i,j}$ ) do
for iteration $=$ 500 do
On training data, $D^{T}_{\textit{train}}$
4. Perform training on $D^{T}_{\textit{train}}$ using ML_Algorithm ( $X_{i,j}$ )
model ${}_{k}$ $=$ train ( $X_{i,j}$ , $D^{T}_{\textit{train}}$ )
5. Get prediction of above trained models on $D^{T}_{\textit{train}}$
pred ${}_{k}$ $=$ prediction (model ${}_{k}$ , $D^{T}_{\textit{train}}$ )
On test data, $D^{T}_{\textit{test}}$
6. Apply model ${}_{k}$ to $D^{T}_{\textit{test}}$ for testing
model ${}_{\textit{k\_test}}$ $=$ model ${}_{k}$ ( $X_{i,j}$ , $D^{T}_{\textit{test}}$ )
pred ${}_{\textit{k\_test}}$ $=$ prediction (model ${}_{\textit{k\_test}}$ , $D^{T}_{\textit{test}}$ )
end for
end for
end for
Performance evaluation
7. Calculate mean and standard deviation of error RMSE for model ${}_{\textit{k\_test}}$
$\mu_{\textit{k\_RMSE}}$ [RMSE (model ${}_{\textit{k\_test}}$ )]
$\sigma_{\textit{k\_RMSE}}$ [RMSE (model ${}_{\textit{k\_test}}$ )]
8. Calculate mean and standard deviation of error MSE for model ${}_{\textit{k\_test}}$
$\mu_{\textit{k\_MSE}}$ [MSE (model ${}_{\textit{k\_test}}$ )]
$\sigma_{\textit{k\_MSE}}$ [MSE (model ${}_{\textit{k\_test}}$ )]
9. Calculate mean and standard deviation of error MMRE for model ${}_{\textit{k\_test}}$
$\mu_{\textit{k\_MMRE}}$ [MMRE (model ${}_{\textit{k\_test}}$ )]
$\sigma_{\textit{k\_MMRE}}$ [MMRE (model ${}_{\textit{k\_test}}$ )]
10. Calculate mean and standard deviation of error MdMRE for model ${}_{\textit{k\_test}}$
$\mu_{\textit{k\_MdMRE}}$ [MdMRE (model ${}_{\textit{k\_test}}$ )]
$\sigma_{\textit{k\_MdMRE}}$ [MdMRE (model ${}_{\textit{k\_test}}$ )]
11. Calculate mean and standard deviation of error Pred25 for model ${}_{\textit{k\_test}}$
$\mu_{\textit{k\_Pred25}}$ [Pred25 (model ${}_{\textit{k\_test}}$ )]
$\sigma_{\textit{k\_Pred25}}$ [Pred25 (model ${}_{\textit{k\_test}}$ )]
Return Model with min (RMSE), min (MSE), min (MMER), min (MdMRE), max (Pred25),
End

Null hypothesis (H ${}_{0}$ ): Difference between error means of two compared models is zero, i.e., a model ${}_{k}$ (with error mean $\mu_{k}$ ) have similar performance compared to model ${}_{m}$ (with error mean $\mu_{m}$ ).

Alternative hypothesis (H ${}_{1}$ ): Difference between error means of two compared models is not zero i.e., a model ${}_{k}$ doesn’t hold equal performance model ${}_{m}$ .

6. Results

This section discussed the experimental results after applying ML algorithms mentioned in Section 4.3 using TopCoder collected data.

6.1 Results of performance metrices evaluation

For all experiments, mean and standard deviation (SD) of each evaluation metrics (Section 5) are calculated, with 500 iterations of each experiment Table 8 shows the resulting means ( $\mu$ ) and SD ( $\sigma$ ) of errors for all models evaluated on TopCoder dataset. Models with goodperformance measures are shown in bold text. Based on RMSE, OneR, LASSO, logReg, MultiNaiveB, kNN5, RF, svmPoly, MLPNN and GBM has shown better performance as compared to their respective family member algorithm, so they are chosen for HT ensemble creation.

Table 8
Results of ML algorithms applied to TopCoder dataset

Algorithm	Time	RMSE		MSE		MMRE		MdMRE		Pred (25)
	(sec)	$\mu$	$\sigma$	$\mu$	$\sigma$	$\mu$	$\sigma$	$\mu$	$\sigma$	$\mu$	$\sigma$
ZeroR	3.26	0.862	0.024	0.744	0.020	0.852	0.014	8.345	1.031	0.50	15.77
OneR	20.3	0.595	0.033	0.355	0.039	0.503	0.038	3.420	0.499	3.79	20.47
LASSO	2.67	0.052	0.010	0.003	0.001	1.027	0.105	0.105	0.017	2715.05	85.74
BayesLR	1630.13	0.186	0.019	0.035	0.007	0.306	0.018	0.783	0.080	36.55	51.86
logReg	10.06	0.070	0.010	0.005	0.001	0.048	0.005	0.295	0.049	44.38	72.25
PolyReg	12.57	0.436	0.038	0.192	0.034	0.378	0.039	3.233	0.704	5.24	57.69
MultiNaiveB	10.92	0.262	0.062	0.073	0.036	0.190	0.061	1.136	0.610	10.87	59.75
BernNaiveB	8.31	0.360	0.035	0.131	0.025	0.282	0.035	1.949	0.041	8.29	49.10
kNN5	550.05	0.095	0.023	0.010	0.005	0.060	0.010	0.358	0.049	34.62	67.78
kNN10	697.19	0.104	0.024	0.011	0.005	0.063	0.010	0.360	0.054	36.03	64.89
CART	6.67	0.112	0.022	0.013	0.005	0.074	0.010	0.434	0.047	28.71	64.42
RF	79.16	0.085	0.023	0.008	0.004	0.051	0.008	0.299	0.042	43.32	70.02
svmRadial	12.93	0.085	0.028	0.008	0.005	0.045	0.009	0.229	0.427	53.83	79.25
svmPoly	11.24	0.084	0.018	0.007	0.003	0.050	0.007	0.282	0.037	45.54	79.94
MLPNN	63.26	0.077	0.029	0.007	0.005	0.052	0.014	0.342	0.075	39.26	82.47
RBFNN	52.49	0.120	0.020	0.015	0.005	0.085	0.011	0.532	0.101	26.48	78.67
Bagging	85.36	0.099	0.023	0.010	0.005	0.062	0.009	0.367	0.059	36.05	58.16
GBM	18.8	0.091	0.024	0.009	0.005	0.057	0.008	0.349	0.053	37.80	68.00
En_WtAvg	2.02	0.158	0.025	0.026	0.008	0.111	0.013	0.658	0.072	18.57	66.84
En_Avg	2.25	0.19	0.021	0.032	0.008	0.134	0.013	0.778	0.117	16.59	68.82

Figure 9.

RMSE boxplot of ML models.

Figure 10.

Welch $t$ -test plot for best performing models.

Analyzing the results obtained in Table 8, it is possible to conclude that estimation obtained from regression family (both linear and non-linear), NNet family and SVM family gained smallest number of errors in most of metrices. Figure 9 shows the boxplot graph with RMSE of each model, which represents those models like LASSO, BayesLR, SVM, HT, DT families and En_wtAvg are not showing outliers. It is possible to observe that Rule based models have overestimated the model while LASSO, logReg, kNN5, MLP and SVM family have shown relatively better performance. Besides, boxplot of LASSO and logReg are less distorted than those of the other models while MultiNB has shown larger amount of estimation outliers, influencing RMSE mean to higher end.

6.2 Results of statistical significance test

For As mentioned earlier, Welch’s $t$ -test is used with 5% significance $\alpha=$ 0.05) and 95% confidence interval. From the Table 8, it is very clear that TopCoder dataset works well for predicting effort with LASSO (linear) and logReg (non-linear), after which svmPoly and MLPNN performed better among other ML models. To confirm whether the results of the latter two techniques are significantly different from basic regression techniques (LASSO and logReg), a statistical test is conducted. For that purpose, Welch’s $t$ -test with Bonferroni correction [44] is applied using RMSE of four best-performing techniques i.e. LASSO, logReg, svmPoly and MLPNN and results are represented in Fig. 10.

Models are significantly different if they show difference between their error means less than 0.05. The plot clearly depicts that LASSO and logReg are giving significantly good performance compared to MLPNN since $\hat{\mu}_{\textit{mean}}$ of LASSO and logReg (0.05 and 0.07 respectively), is lesser than of MLPNN (i.e. $\hat{\mu}_{\textit{mean}}=$ 0.08). Plot also shows difference in error means of the models i.e., $\rho_{\textit{Bonf-Corrected}}$ , used to analyze significance difference between models’ performance. All models are showing $\rho_{\textit{Bonf-Corrected}}<$ 0.05 compared to each other, which rejects the null hypothesis (Eq. (11)). Hence the performances of all models are statistically different from our two best performing models LASSO and logReg. It is to be noted that there also exist statistical difference between LASSO and logReg (i.e., $\rho_{\textit{Bonf-Corrected}}=$ 4.03 e ${}^{-13}$ ).

Table 9
Results of Welch’s $t$ -test for model significance

Algorithm	Win count	$\rho$ -value	Algorithm	Win count	$\rho$ -value	Algorithm	Win count	$\rho$ -value
Lasso Vs ZeroR	19	2.71 $\times$ 10 ${}^{-16}$	kNN5 Vs ZeroR	12	2.86 $\times$ 10 ${}^{-16}$	Avg Vs ZeroR	6	3.19 $\times$ 10 ${}^{-16}$
Lasso Vs OneR		4.05 $\times$ 10 ${}^{-16}$	kNN5 Vs OneR		4.39 $\times$ 10 ${}^{-16}$
logReg Vs ZeroR	18	2.77 $\times$ 10 ${}^{-16}$	Bagging Vs ZeroR	11	2.88 $\times$ 10 ${}^{-16}$	bayesLR Vs ZeroR	5	3.25 $\times$ 10 ${}^{-16}$
logReg Vs OneR		4.19 $\times$ 10 ${}^{-16}$
MLPNN Vs ZeroR	17	2.80 $\times$ 10 ${}^{-16}$	kNN10 Vs ZeroR	10	2.89 $\times$ 10 ${}^{-16}$	MultiNaiveB Vs ZeroR	4	3.66 $\times$ 10 ${}^{-16}$
MLP Vs OneR		4.24 $\times$ 10 ${}^{-16}$
svmPoly Vs ZeroR	16	2.82 $\times$ 10 ${}^{-16}$	CART Vs ZeroR	9	2.93 $\times$ 10 ${}^{-16}$	BernNaiveB Vs ZeroR	3	4.38 $\times$ 10 ${}^{-16}$
svmPoly Vs OneR		4.29 $\times$ 10 ${}^{-16}$
RF Vs ZeroR	15	2.82 $\times$ 10 ${}^{-16}$	RBFNN Vs ZeroR	7	2.96 $\times$ 10 ${}^{-16}$	polyR	2	NA
RF Vs OneR		4.30 $\times$ 10 ${}^{-16}$
svmRadial Vs ZeroR	14	2.82 $\times$ 10 ${}^{-16}$	wtAvg Vs ZeroR	7	3.09 $\times$ 10 ${}^{-16}$	OneR Vs ZeroR	1	8.15 $\times$ 10 ${}^{-16}$
svmRadial Vs OneR		4.30 $\times$ 10 ${}^{-16}$
GBM Vs ZeroR	13	2.85 $\times$ 10 ${}^{-16}$				ZeroR	0	NA
GBM Vs OneR		4.36 $\times$ 10 ${}^{-16}$

Besides four best performing models, Welch’s $t$ -test is conducted for all possible pairwise comparisons using their RMSE. Table 9. presents “Win coun” (based on results of Welch’s $t$ -test) of each model, compared to all other models. The $\rho$ -value is shown for only those models, surpassing 10% of significance level (i.e., $\alpha=$ 0.1). As it can be seen, the win count of LASSO is 19, meaning it has shown significant results (on $\alpha=$ 0.05) compared to all 19 models, however, for $\alpha=$ 0.1, LASSO surpassed only ZeroR and OneR. Similarly, the second highest Win count (i.e., 18) is for logReg, and again both Rule based models are less significant than logReg on significance level $\alpha=$ 0.1. ZeroR is the only model having no significant performance compared to any model, while polyR surpassed 2 models (Win count $=$ 2), only on significance of $\alpha=$ 0.05 not 0.1.

7. Discussion

This study focuses on the fact that effort consumed on crowdsourced task is comparatively appropriate criteria for task selection rather than mere intuition. Work established vast range of TopCoder dataset feature covering major crowdsourced development aspect to estimate effort. Mostly dataset extracted in CSSD work have not performed any analysis to ensure whether the selection for attributes is justified in terms of statistical, regression and correlation measures. Section 4.2 clearly depicts that selected attributes for dataset possess model significance, regression, and correlation relationship with the output feature i.e., “Effort” making dataset more reliable for ML experimentation. Performing regression and correlation analysis further revealed which feature to focus on, for manipulating the effort. In our analysis, development type, assets available for the task and developer’s enhanced skill set are those domains which are influencing task effort the most.

The features identified for this study are independent from previous phases i.e., features related to software requirement specification and design (as included in previous studies [49]), for example, winner’s submission score attained in design phase, number of registrants and submissions in design phase, number of pages in requirement/component specification. Rather focus is made on just in-hand features, relevant to task characteristics, developer’s skill set and task artifacts.

Catering to the matter of accuracy in estimation, intelligent methods are known for their accurate and unbiased results. Furthermore, taking the opinion of multiple ML models is another way to achieve effort estimation accuracy. There is trend of utilizing ML for estimation in CSSD work (Table 1), but number of models used for estimation are not sufficient. This work incorporated 10 ML families with 2 algorithms each, hence total 20 models are trained on TopCoder extracted dataset (as mentioned in Table 7 Further, most of the past ML-based CSSD studies relied on only linear, tree based or neural network-based models. This wok shows, some algorithms which are not part of any estimation in previous CSSD work are also providing considerable performance, such as Bayesian Linear Regression, Polynomial Regression, Bernouli Naïve Bayes, Bagging, Gradient boosting machine, with RMSE values 0.186, 0.436, 0.360, 0.099 and 0.091 respectively.

Comparing our work with the initial study made on CSSD cost estimation by Mao et al. [49], ML models employed and their achieved performance in MMRE is as follows: SVMR (0.46), CART (0.216), Linear regression (0.279), Logistic regression (0.350), NNet (0.324) and kNN (0.397). Relevant models in our study achieved MMRE values as: svmRadial (0.045), svmPoly (0.050), CART (0.074), logReg (0.048), MLPNN (0.052), RBFNN (0.085, kNN5 (0.060), kNN10 (0.063). It is evident from the results; all models of this study have achieved 60% improved MMRE. This clearly shows careful and relevant features selection in dataset.

8. Threats to validity

This section presents threats to validity of the proposed study. For internal validity, several aspects need to be discussed, e.g., ML techniques to construct estimation model, choice of data transformation technique and parameter setting methods can be common threats to internal validity of ML based effort estimation.

For this study, a large variety of ML estimation techniques are applied to identify a relatively unbiased technique working for TopCoder CSSD platform. ML techniques in previous effort estimation work (Table 7) are also part of this study. However, there is a margin of testing other ML learners to verify if performance tends to improve. Further, ML learners are implemented under grid search-based parameter settings rather than assigning default values, providing more context-oriented results. However, grid search involves larger computation compared to other automated parameter tuning techniques i.e., swarm intelligence or evolutionary methods. Hence computation-effective solutions can be incorporated. Another internal validity threats can be found when effort drivers are indirectly related to the software efforts. In this work, input features included to construct dataset possess enough correlation with dependent feature (effort) in terms of Pearson correlation, hence suitability of dataset for applying supervised ML learning to predict effort is ensured.

External validity is determined by the extent to which the study’s conclusion can be generalized. ML algorithms and ensembles used in the study are implemented with parameter values coming from grid search tuning. For other datasets, additional parameter setting may be required for obtained optimum prediction model. Optimized parameter settings can be used to drive similar conclusions. In our work, effort estimation scheme is primarily designed for crowdsourced tasks, structured on the format followed by TopCoder platform. The dataset obtained for proposed scheme implementation is dependent on CSSD platform, i.e., features selected for each crowdsourced tasks can be found in tasks published on TopCoder. Execution of proposed scheme can be generalized for any other CSSD platform, using similar task posting mechanism like TopCoder. Moreover, this study mainly focuses on tasks from “TopCoder development category”. So, features selected for estimating effort are related to development/coding perspectives. For other TopCoder task categories (design, testing), different features set is needed.

Conclusion validity is achieved by measuring the degree to which the study’s conclusions are reasonable. We used the 10-fold CV method to confirm the ability of a ML models to predict new data. Furthermore, all experiments are executed with 500 repetitions for having enough iterations to conclude results. This procedure is adequate to prevent biasing the results and minimizes training sample dependences. Similarly, these redundant tests are carried out for performance metrices, ensuring no evaluation metric can bias the conclusion results.

9. Conclusion and future work

Effort estimation measures are crucial in crowdsourced task for a well-established CSSD platform like TopCoder, having thousands of clients and developers. Effort estimation can serve the basis of award money decision associated with each task. In this work an effort estimation framework is proposed for TopCoder CSSD platform facilitating both client and developer by analyzing the task in terms of effort.

The proposed framework includes three stages. Data from TopCoder development tasks is extracted including highly development-oriented features (Data acquisition stage). Data is prepared in terms of normality and $F$ test measures, after which statistical, regression and correlation analysis are performed (Data preparation stage). In end, prepared dataset is used to train estimation models from 10 ML families with 2 respective ML algorithms, for achieving best model for TopCoder dataset. Results are analyzed with 5 performance metrices and Welch $t$ -test for significance. ML families used for estimation consist of linear/non-linear regression, NaiveBayes, CBR, DT, SVM, NNet, HM and HT with their two respective algorithms. Results have confirmed that linear/non-linear regression algorithms provided good results in most evaluation metrices. After that SVM and NNet family is giving good results for TopCoder dataset.

As future work, we aim to analyze other task categories on TopCoder platform i.e., software design and QA, to identify appropriate features relevant to software design and testing aspects. Furthermore, CSSD platforms apart from TopCoder will be explored for dataset preparation and establishing effort estimation model to verify generality of proposed scheme.

References

Alelyani

Mao

Yang

, Context-centric pricing: early pricing models for software crowdsourcing tasks, in: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, 2017, pp. 63–72.

Alelyani

Yang

, Software crowdsourcing reliability: an empirical study on developers behavior, in: Proceedings of the 2nd International Workshop on Software Analytics, 2016, pp. 36–42.

Ali

Gravino

, Using bio-inspired features selection algorithms in software effort estimation: a systematic literature review, in: 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), IEEE, 2019, pp. 220–227.

Amazal

F.A.

Idri

Abran

, Software development effort estimation using classical and fuzzy analogy: A cross-validation comparative study, International Journal of Computational Intelligence Applications 13(03) (2014), 1450013.

Azzeh

Nassif

A.B.

Martín

C.L.

, Empirical analysis on productivity prediction and locality for use case points method, Software Quality Journal 29(2) (2021), 309–336.

Bakici

, Comparison of crowdsourcing platforms from social-psychological and motivational perspectives, International Journal of Information Management 54 (2020), 102–121.

Banerjee

Chitnis

Jadhav

Bhawalkar

Chaudhury

, Hypothesis testing, type I and type II errors, Industrial Psychiatry Journal 18(2) (2009), 127.

Basri

Kama

Sarkan

H.M.

Adli

Haneem

, An algorithmic-based change effort estimation model for software development, in: 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), IEEE, 2016, pp. 177–184.

Ben Ishak

, Variable selection using support vector regression and random forests: A comparative study, Intelligent Data Analysis 20(1) (2016), 83–104.

10.

Boehm

B.W.

, Software engineering economics, IEEE Transactions on Software Engineering (1) (1984), 4–21.

11.

Boslaugh

, Statistics in a nutshell: A desktop quick reference, O’Reilly Media, Inc., 2012.

12.

Brena

R.F.

Garcia-Ceja

, A crowdsourcing approach for personalization in human activities recognition, Intelligent Data Analysis 21(3) (2017), 721–738.

13.

Brooks

F.P.

Jr, The mythical man-month: essays on software engineering, Pearson Education, 1995.

14.

Cabral, d. A. J. T. H., A. R. de A., N. J. P. and d. O. A. L. I., Heterogeneous Ensemble Dynamic Selection for Software Development Effort Estimation, in: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 2017, pp. 210–217.

15.

Carvalho

, Ensemble regression models for software development effort estimation: A comparative study, International Journal of Software Engineering Applications 11(3) (2020).

16.

Cataldo, Marcelo, De Souza RB

, Exploring the impact of API complexity on failure-proneness, in: 2014 IEEE 9th International Conference on Global Software Engineering, IEEE, 2014, pp. 36–45.

17.

Čeke

Milašinović

, Early effort estimation in web application development, Journal of Systems Software 103 (2015), 219–237.

18.

Corazza

Di Martino

Ferrucci

Gravino

Mendes

Engineering

J.E.S.

, Investigating the use of support vector regression for web effort estimation, 16(2) (2011), 211–243.

19.

de Ávila Mendes

da Silva

L.A.

, Modeling the combined influence of complexity and quality in supervised learning, Intelligent Data Analysis 26(5) (2022), 1247–1274.

20.

de Barcelos, Iris Fabiana da Silva

J.D.S.

Sant’Anna

, An investigation of artificial neural networks based prediction systems in software project management, Journal of Systems Software 81(3) (2008), 356–367.

21.

Demšar

, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research 7 (2006), 1–30.

22.

Deng

Xiang

, Multistep planning for crowdsourcing complex consensus tasks, Knowledge-Based Systems 231 (2021), 107–447.

23.

Edinson

P.M.

, Latha, Performance analysis of FCM based ANFIS and ELMAN neural network in software effort estimation, Int. Arab J. Inf. Technol 15(1) (2018), 94–102.

24.

Elish

M.O.

Aljamaan

Ahmad

, Three empirical studies on predicting software maintainability using ensemble methods, Soft Computing 19(9) (2015), 2511–2524.

25.

Fahse

Huber

Giffen

B.v.

, Managing bias in machine learning projects, in: International Conference on Wirtschaftsinformatik, Springer, 2021, pp. 94–109.

26.

Faradani

Hartmann

Ipeirotis

P.G.

, What’s the right price? pricing tasks for finishing on time, in: Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

27.

Filion

G.J.

, The signed Kolmogorov-Smirnov test: Why it should not be used, Gigascience 4(1) (2015), s13742-015-0048-7.

28.

Grimstad

Jørgensen

Moløkken-Østvold

, Software effort estimation terminology: The tower of Babel, Information and Software Technology 48(4) (2006), 302–310.

29.

Guerrero

V.M.

, Time-series analysis supported by power transformations, Journal of Forecasting 12(1) (1993), 37–48.

30.

Hair

J.F.

Jr Hult

G.T.M.

Ringle

C.M.

Sarstedt

Danks

N.P.

Ray

, Partial Least Squares Structural Equation Modeling (PLS-SEM) Using R: A Workbook, ed: Springer Nature, 2021.

31.

Hosni

Idri

, Software effort estimation using classical analogy ensembles based on random subspace, in: Proceedings of the Symposium on Applied Computing, 2017, pp. 1251–1258.

32.

Howe

, The rise of crowdsourcing, Wired magazine 14(6) (2006), 1–4.

33.

Hughes

R.T.

, Expert judgement as an estimating method, Information and Software Technology 38(2) (1996), 67–75.

34.

Idri

Abran

Khoshgoftaar

T.M.

, Estimating software project effort by analogy based on linguistic values, in: Proceedings Eighth IEEE Symposium on Software Metrics, IEEE, 2002, pp. 21–30.

35.

Jiang

Zhou

Jiang

Cao

, Batch allocation for decomposition-based complex task crowdsourcing e-markets in social networks, Knowledge-Based Systems 194 (2020), 105522.

36.

Jørgensen

, The influence of selection bias on effort overruns in software development projects, Information and Software Technology 55(9) (2013), 1640–1650.

37.

Jørgensen

Halkjelsvik

, The effects of request formats on judgment-based effort estimation, Journal of Systems Software 83(1) (2010), 29–36.

38.

Jorgensen

Shepperd

, A systematic review of software development cost estimation studies, IEEE Transactions on Software Engineering 33(1) (2006), 33–53.

39.

Khoshgoftaar

T.M.

Rebours

, Evaluating noise elimination techniques for software quality estimation, Intelligent Data Analysis 9(5) (2005), 487–508.

40.

Kocherlakota

N.R.

, Analytical formulae for accurately sized t-tests in the single instrument case, Economics Letters 189 (2020), 109053.

41.

Kumar

Mandala

Chaitanya

M.V.P.

, Fuzzy logic for software effort estimation using polynomial regression as firing interval, International Journal of Computer Technology 2(6) (2011).

42.

Lakhani

K.R.

Garvin

D.A.

Lonstein

, Topcoder (a): Developing software through crowdsourcing, Harvard Business School General Management Unit Case, 2010, 610–032.

43.

LaToza

T.D.

Van Der Hoek

, Crowdsourcing in software engineering: Models, motivations, and challenges, IEEE Software 33(1) (2016), 74–80.

44.

Lee

D.K.

, What is the proper way to apply the multiple comparison test, Korean Journal of Anesthesiology 71(5) (2018), 353.

45.

, Evaluation of software quality for competition-based software crowdsourcing projects, in: Proceedings of the 2018 7th International Conference on Software and Computer Applications, 2018, pp. 102–109.

46.

Løhre

Jørgensen

, Numerical anchors and their strong effects on software development effort estimates, Journal of Systems Software 116 (2016), 49–56.

47.

Cuauhtémoc

L.-M.

Abran

, Neural networks for predicting the duration of new software projects, Journal of Systems Software 101 (2015), 127–135.

48.

Mostaan

L.S.

Saremi

Martinez-Mejorado

, How much should i pay? an empirical analysis on monetary prize in topcoder, in: International Conference on Human-Computer Interaction, Springer, 2020, pp. 202–208.

49.

Mao

Yang

Harman

, Pricing crowdsourcing-based software development tasks, in: 2013 35th International Conference on Software Engineering (ICSE), IEEE, 2013, pp. 1205–1208.

50.

Mishra

Pandey

C.M.

Singh

Gupta

Sahu

Keshri

, Descriptive statistics and normality tests for statistical data, Annals of Cardiac Anaesthesia 22(1) (2019), 67.

51.

Mohamed

Ali

Alain

Bou

N.A.

, On the value of parameter tuning in heterogeneous ensembles effort estimation, Soft Computing 22(18) (2018), 5977–6010.

52.

Mohamed

Ali

, Bou

Alain

, Heterogeneous ensembles for software development effort estimation, in: 2016 3rd International Conference on Soft Computing & Machine Intelligence (ISCMI), IEEE, 2016, pp. 174–178.

53.

Mustapha

Abdelwahed

, Investigating the use of random forest in software effort estimation, Procedia Computer Science 148 (2019), 343–352.

54.

Nurpalah

Pasha

Rhamdhan

Maulana

Rafdhi

A.A.

, Effect of UI/UX designer on front end, International Journal of Research Applied Technology 1(2) (2021), 335–341.

55.

Obilor

E.I.

Amadi

E.C.

, Test for significance of Pearson’s correlation coefficient, International Journal of Innovative Mathematics, Statistics Energy Policies 6(1) (2018), 11–23.

56.

Oliveira

A.L.

, Estimation of software project effort with support vector regression, Neurocomputing 69(11–15) (2006), 1749–1753.

57.

Palaniswamy

S.K.

Venkatesan

, Hyperparameters tuning of ensemble model for software effort estimation, Journal of Ambient Intelligence Humanized Computing 12(6) (2021), 6579–6589.

58.

Palpanas

Chowdhary

Mihaila

Pinel

, Integrated model-driven dashboard development, Information Systems Frontiers 9(2) (2007), 195–208.

59.

Pee

L.G.

Koh

Goh

, Trait motivations of crowdsourcing and task choice: A distal-proximal perspective, International Journal of Information Management 40 (2018), 28–41.

60.

Pendharkar

P.C.

, Ensemble based point and confidence interval forecasting in software engineering, Expert Systems with Applications 42(24) (2015), 9441–9448.

61.

Phannachitta

, On an optimal analogy-based software effort estimation, Information and Software Technology 125 (2020), 106330.

62.

Pospieszny, Przemyslaw, Czarnacka-Chrobot, Beata, Kobylinski and Andrzej, An effective approach for software project effort and duration estimation with machine learning algorithms, Journal of Systems, & Software 137 (2018), 184–196.

63.

Pospieszny

Czarnacka Chrobot

Kobyliński

, Application of function points and data mining techniques for software estimation-a combined approach, in: Software Measurement, Springer, 2015, pp. 96–113.

64.

Jing

X.-Y.

Zhu

Xie

Ying

, Software effort estimation based on open source projects: Case study of Github, Information and Software Technology 92 (2017), 145–157.

65.

Rahman

M.T.

Islam

M.M.

, A comparison of machine learning algorithms to estimate effort in varying sized software, in: 2019 IEEE Region 10 Symposium (TENSYMP), IEEE, 2019, pp. 137–142.

66.

Rao

K.E.

Rao

G.A.

, Ensemble learning with recursive feature elimination integrated software effort estimation: A novel approach, Evolutionary Intelligence 14(1) (2021), 151–162.

67.

Rhmann

Pandey

Ansari

G.A.

, Software effort estimation using ensemble of hybrid search-based algorithms based on metaheuristic algorithms, Innovations in Systems Software Engineering, 2021, 1–11.

68.

Sarı

Tosun

Alptekin

G.I.

, A systematic literature review on crowdsourcing in software engineering, Journal of Systems Software 153 (2019), 200–219.

69.

Shukla

Kumar

, Applicability of neural network based models for software effort estimation, in: 2019 IEEE World Congress on Services (SERVICES), IEEE, Vol. 2642, 2019, pp. 339–342.

70.

Shukla

Kumar

Bal

P.R.

, Analyzing effect of ensemble models on multi-layer perceptron network for software effort estimation, in: 2019 IEEE World Congress on Services (SERVICES), IEEE, Vol. 2642, 2019, pp. 386–387.

71.

Siegel

A.F.

, Multiple regression: predicting one variable from several others, Practical Business Statistics, Academic Press: United States of America, 2016.

72.

Singer, Yaron, Mittal, and Manas, Pricing tasks in online labor markets, in: Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

73.

Srivastava

A.K.

Singh

Pandey

A.S.

Maini

, A novel feature selection and short-term price forecasting based on a decision tree (J48) model, Energies 12(19) (2019), 3665.

74.

Suresh Kumar

Behera

Nayak

Naik

, A pragmatic ensemble learning approach for effective software effort estimation, Innovations in Systems Software Engineering, 2021, 1–17.

75.

Takahashi

, A new robust ratio estimator by modified Cook’s distance for missing data imputation, Japanese Journal of Statistics Data Science 5(2) (2022), 783–830.

76.

Twala

Cartwright

, Ensemble missing data techniques for software effort prediction, Intelligent Data Analysis 14(3) (2010), 299–331.

77.

Wang

, Do extra dollars paid-off? an exploratory study on topcoder, in: Proceedings of the 5th International Workshop on Crowd Sourcing in Software Engineering, 2018, pp. 21–27.

78.

Wen

Lin

Huang

, Systematic literature review of machine learning based software development effort estimation models, Information and Software Technology 54(1) (2012), 41–59.

79.

Yang

Karim

M.R.

Saremi

Ruhe

, Who should take this task? Dynamic decision support for crowd workers, in: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2016, pp. 1–10.

80.

Zanatta, Alexandre Machado

Steinmacher

Prikladnicki

de Souza

C.R.

, Strategies for Crowdworkers to Overcome Barriers in Competition-based Software Crowdsourcing Development, in: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 2020, pp. 125–128.

81.

Zare

H.K.

Fallahnezhad

M.S.

, Software effort estimation based on the optimal Bayesian belief network, Applied Soft Computing 49 (2016), 968–980.

82.

Zhang

Yang

Wang

, Using Bayesian regression and EM algorithm with missing handling for software effort prediction, Information and Software Technology 58 (2015), 58–70.

Machine learning based software effort estimation using development-centric features for crowdsourcing platform

Abstract

Keywords

1. Introduction

2.1 Related work on SDEE techniques

2.2 Related work on CSSD

Table 1 Summary of previous CSSD work

1 https://www.octoparse.com/.

4.2.1 Statistical analysis of TopCoder dataset

3 https://www.computerhope.com/jargon/num/1gl.htm.

5. Experimental setup

6.1 Results of performance metrices evaluation

Table 8 Results of ML algorithms applied to TopCoder dataset

Table 9 Results of Welch’s t -test for model significance

8. Threats to validity

9. Conclusion and future work

References

Table 1
Summary of previous CSSD work

¹
https://www.octoparse.com/.

³
https://www.computerhope.com/jargon/num/1gl.htm.

Table 8
Results of ML algorithms applied to TopCoder dataset

Table 9
Results of Welch’s $t$ -test for model significance