Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28 1 Author • Ekrem Kocaguneli ( [email protected] ) • Tim Menzies • Specialties: Data Mining, Effort Estimation • 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK) • 11’ TSE: On the Value of Ensemble Effort Estimation • 11’ ESEM: – • 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short) • Pre: Relevancy Filtering for Defect Estimation 2 Motivation (Why) The Locality(1) Assumption • Data divides best on one attribute – – – – – – 1. project type;e.g. embedded, etc; 2. development centers of developers; 3. development language 4. application type(MIS; GNC; etc); 5. targeted hardware platform; 6. in-house vs out sourced projects; • If Locality(1) – Hard to use data across these boundaries – confined model, need to collect local data 3 Motivation (Why) The Locality(N) Assumption • Data divides best on combination of attributes • If Locality(N) – Easier to use data across these boundaries 4 Work • Cross-vs-Within + “relevancy filtering” for effort estimation – Cross as good as within – Companies can use other’s data for their estimates – If they first apply “relevancy filtering” • "cross" same as "local" 5 Technology (How) • How to find relevant training data? 6 Technology (How) • Variance Pruning 7 Technology (How) • TEAK = ABE0 + Instance selection – 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation • ABE0 = ABE version 0 – – – – – most commonly used Normalized numerics, 0 to 1 Euclidean distance equal weight to all attributes return median effort of k-nearest neighbors • Instance selection – smart way to adjust training data 8 Technology (How) • TEAK is a variance-based instance selector • It is built via GAC trees (binary for even) • TEAK is a two-pass system – First pass selects low variance relevant projects (instance selection) – Second pass retrieves projects to estimate from ( instance retrieval ) • Variance Pruning – > 10% * max ( σ2 ) – > (100%+10%) * max ( σ2 ) ? 9 Technology (How) • TEAK finds local regions important to the estimation of particular cases • TEAK finds those regions via locality(N) not locality(1) 11 Experiments - Datasets • Public availability: for reproducibility • cross-within divisibility • 6 out of 20+ datasets from PROMISE 12 Experiments - Datasets For dataset X: subset X1,X2,X3 • Within – TEAK for X1, X2, X3 separately. LOOCV • Cross – X1 test, X2+X3 train. … N-Fold CV • Repeat 20 times! As TEAK is greedy, vary according to input data order 13 Experiments • Win-Loss-Tie: • Mann Whitney Test (95%) – 检验两个总体的分布是否有显著的差别 14 Experiment1 - Performace Comparison MAR: Mean Absolute Residual MdMRE: Median MRE 15 Experiment1 - Performace Comparison Analogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets ) for i = 1:numTestCases estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i)); for k = 1 : numTestFactors estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k); end end Analogy by K-neighbor: 16 Experiment2 – Retrieval Tendency 17 Experiment2 – Retrieval Tendency Diagonal( WC ) vs. Off-Diagonal( CC ) selection Percentages sorted Percentiles of diagonals and off-diagonals 18 Conclusion 1. Cross performance is no worse than within performance 2. Probability that estimator retrieves a training instance form cross/within data is the same Implication: • • Companies can learn from each other’s data Locality(N). Maybe, there are general effects in SE – Effects that transcend boundaries of one company – Local vs. Global Model… 19 Future work • Check external validity – After instance selection, Does cross == within ? • Build more repositories – More useful than previously thought for effort estimation • Synonym discovery – Can only use cross-data if it has the same ontology – Auto-generate lexicons to map terms between data sets. ( “LOC” – “size”, “product complexity” ) 20 Thanks! Q&A? 21