Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IISDigital Society & Technology, under Grant No. 0222829 Contributors • Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator) • Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student) • Jeff Goett, University of Notre Dame (REU Student) • Chris Hoffman, University of Notre Dame (REU Student) • Nadir Kiyanclar, University of Notre Dame (REU Student) • Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator) • Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) • Carlos Siu, University of Notre Dame (REU Student) • Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator) • Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student) Outline • Research approach • Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments • Data collection and analysis • Example research question • Simulation • Computer experiments • Results One Approach to Researching F/OSSD • Online data – Screen scraping – Database dumps • Modeling – Social network theory – Evolutionary assumptions • Simulation – Verification and validation – Computer experiments • Variation of Classical Scientific Method Classical Scientific Method 1. Observe the world a) Identify a puzzling phenomenon 2. Generate a falsifiable hypothesis (K. Popper) 3. Design and conduct an experiment with the goal of disproving the hypothesis a) If the experiment “fails”, then the hypothesis is accepted (until replaced) b) If the experiment “succeeds”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated The Computer Experiment Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Observation Agent -Based Simulation (Experiment) Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Observation Analysis of SourceForge Data Agent -Based Simulation (Experiment) Grow Artificial SourceForge Agent-Based Modeling and Simulation • Conceptual models of a phenomenon • Simulations are computer implementations of the conceptual models • Agents in models and simulations are distinct entities (instantiated objects) – Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence – Contrasted with higher level AI “intelligent agents” • Foundations in complexity theory – Self-organization – Emergence Collaborative Social Networks • Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) • Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) • • • • Interlocking corporate directorships Terrorist Networks Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon SourceForge • VA Software • Part of OSDN • Started 12/1999 • Collaboration tools • 70,000 Projects • 90,000 Developers • 700,00 Registered Users Savannah • SourceForge Software? • Free Software Foundation •1,600 Projects •16,000 Registered Users Observations • Web mining • Web crawler (scripts) – – – – • • • • • • Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ|DEVELOPER 8001|dev378 8001|dev8975 8001|dev9972 8002|dev27650 8005|dev31351 8006|dev12509 8007|dev19395 8007|dev4622 8007|dev35611 8008|dev8975 Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001 F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster Project 7597 dev[64] dev[72] dev[67] Project 6882 Project 7028 dev[52] dev[65] dev[70] dev[57] 6882 dev[47] dev[52] 6882 dev[47] dev[55] dev[47] dev[99] dev[55] dev[51] 6882 dev[47]6882 dev[58] dev[79] dev[47] 7597 dev[46] 7597 dev[46]dev[64] 7597 dev[46] dev[72] dev[67] 7597 dev[46] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[99] 7597 dev[46] 7028 dev[46] dev[61] dev[51] 7597 dev[46] dev[58] dev[46] 15850 dev[46] dev[58] dev[79] dev[58] 9859 dev[46] dev[54] dev[45] dev[61] dev[58] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[49] dev[53] dev[56] dev[83] dev[59] Project 9859 Project 15850 dev[48] Topological Analysis of the Data • Statistics inspected – – – – – – – Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and life cycle • Evolution of these statistics • Dual networks – developer network and project network Terminology • • • • • • • Diameter – Average length of shortest paths between all pairs of vertices Degree – The count of edges connected to given vertex Average degree – Average of the degrees of all vertices in the network Cluster – The connected components of the network Clustering coefficient (CC) – CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. – CC: average of all CCi in a network Degree distribution – The distribution of degrees throughout a network Major cluster – The largest cluster in the network Degree Distribution: Developers Degree Distribution: Projects Diameter of Developer Network vs. Time • Network size increased from 30,000 to 70,000 Diameter of Project Network vs. Time • Network size increased from 20,000 to 50,000. • Diameter decreasing with time both for developer network and project network Clustering Coefficient of Developer Network vs. Time Clustering Coefficient of Project Network vs. Time Cluster Size Distribution • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799 Relative Size of Major Cluster vs. Time • Increase of the relative size of the major cluster • Approaching steady-state? An Example Research Question • What processes can explain the evolution of the project and developer social networks? – Randomly growing network (Erdos-Reyni, 1960)? – Evolving network with preferential attachment (Barabasi-Albert, 1999)? – Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? – Others? Computer Experiments • Agent-based simulations • Java programs using Swarm class library – Validation (docking) exercises using Java/Repast • Grow artificial SourceForge’s (Epstein & Axtell, 1996) – Parameterized with observed data, e.g., developer behaviors • Join rates • New project additions • Leave projects – Evaluation of multiple models (hypotheses) – Verification/validation Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Agent -Based Simulation Observation Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution (Experiment) Grow Artificial SourceForge Model for SourceForge • ABM based on bipartite graph • Model description – – – – Agent: developer Behaviors: Create, join, abandon and idle Preference: developer’s and project’s Fitness • Four models in iterations – ER, BA, BA with constant fitness and BA with dynamic fitness • Comparison of empirical and simulated data ER Model – Degree Distribution • Degree distribution is normal distribution while it is power law in empirical data • Fit Fails! ER Model - Diameter • Average degree is decreasing while it is increasing in empirical data • Diameter is increasing while it is decreasing in empirical data • Fit Fails! ER Model – Clustering Coefficient • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. • Fit fails! ER Model – Cluster Size Distribution • Power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster) • The actual distribution is different from empirical data • Fit Fails! BA Model – Degree Distribution • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838. • Partial Fit! BA Model – Diameter and Clustering Coefficient • Small diameter and high clustering coefficient like empirical data • Diameter and clustering coefficient are both decreasing like empirical data • Good Fit! BA Model with Constant Fitness • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838. • Improved fit! Discovery: Project Life Cycle BA Model with Dynamic Fitness • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838. • Somewhat better fit! Models of the F/OSS Social Network (Alternative Hypotheses) • General model features – Agents are nodes on a graph (developers or projects) – Behaviors: Create, join, abandon and idle – Edges are relationships (joint project participation) – Growth of network: random or types of preferential attachment, formation of clusters – Fitness – Network attributes: diameter, average degree, degree distribution, clustering coefficient • Four specific models – – – – ER (random graph) - (1960) BA (preferential attachment) - (1999) BA ( + constant fitness) - (2001) BA ( + dynamic fitness) - (2003) Summary Summary • Why Agent-Based Modeling and Simulation? – Can be used as components of the Scientific Method – A research approach for studying socio-technical systems • Case study: F/OSS - Collaboration Social Networks – SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. – Simulations • Computer experiments that tested conceptual models • Provided insight into the phenomenon under study and guided data mining of collected observations Questions • Validity of approaches – Social networks – Simulation • Value/Utility of approachs • Applicability to other areas of F/OSS research – Project sites, e.g., Mozilla.org – Individual projects, e.g., Linux kernel Thank you