Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Issues for Large Scale Digital Library Search Engines in the Cloud: CiteSeerX or Why consider CiteSeerX as a Cloud Testbed C. Lee Giles, Pradeep Teregowda, Bhuvan Urgaonkar Pennsylvania State University University Park, PA Data Varies with Discipline or Small vs Big Science Small vs Big science “Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.” ‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006) Data is local Data will not be shared At some point there will be needed “local” clouds If you can’t move the data around, •Bandwidth of a van loaded with disks take the analysis/cloud to the data! Do all/most data manipulations locally clouds for digital libraries/search engines Several features attractive for information retrieval systems such as digital libraries and search engines (storage and fast access) Flexibility/growth • Components such as crawlers, web interfaces, etc. can utilize resources on demand. Management • Utilizing cloud services potentially requires less investment in hardware and maintenance. • By deploying across sites (or adopting solutions distribution services provided by vendors), systems are potentially more stable. Reliability What about CiteSeerX? SeerSuite - x CiteSeer SeerSuite Framework for digital libraries Flexible, scalable, robust, portable, state of the art machine learning extractors, open source. Easy to create instances of SeerSuite both production and research grade: CiteSeerx: computer science ChemXSeer: chemistry ArchSeer: archaeology CollabSeer EnronSeer YouSeer Facilitates research http://citeseerx.ist.psu.edu How CiteSeerX is like a specialty search engine CiteSeerX shares several components with digital libraries and search engines Web Interface • Both digital libraries and search engines provide interfaces to users to interact with the application • Focused crawling of the web for scholarly/academic documents • Both digital libraries and search engines utilize an inverted index to provide efficient and fast access to users through search. • Digital libraries maintain extensive metadata usually in relational databases. Crawlers Index Databases CiteSeerX as a testbed Digital Libraries continue to grow and be widely used Cyberinfrastructure for scientists and academics Google Scholar is very popular & to some invaluable Publisher collections: ACM portal, Scopus, etc.; Library of Congress (NDLP) DLs are usually poorly supported and have few monetization models CiteSeerX is a digital library and a search engine Features of CiteSeerX Automatic acquisition of new documents by focused web crawling (1.5M documents, 20M citations – 2TB, 1-4 M authors) (data regularly shared by rsync), 24/7 service Interface for search widely used (2 M hits/day, 200K queries/day) Full text indexing Autonomous citation indexing, linking documents through citations. Automatic metadata extraction for each document. MyCiteSeer for personalization. New features in development, e.g. Table extraction and search Algorithm extraction and search Commercial grade open source code and data shared 4 systems: • Production • Crawling • Staging • Research All or some can be cloudized Collection of Research Issues Hosting cloud CiteSeerX instances Economic issues Cost of hosting Cost of refactoring the source to be hosted in the cloud. Computational/technical issues What workflow to cloudize Component modification for efficient operation VM size: storage, memory and CPU sizing as a function of needs Establishing computational needs and availability clusters Appropriate load balancing across multiple sites. Security of data stored including metadata and user data. Policy issues Privacy of user data Copyright issues. CiteSeerX Architecture USENIX ‘10 Web Application Focused Crawler Document Conversion and Extraction Document Ingestion Data Storage Maintenance Services Federated Services CiteSeerX data transfer Nodes for hosting IEEE Cloud ‘10 Hosting models for DLs Component hosting USENIX Hotcloud ‘10, IEEE Cloud ‘10 SeerSuite is modular by design andHot architecture; host individual components across available infrastructure. Content hosting CiteSeerx provides access to document metadata, copies and application content Host parts or complete set. Peak load loading Support the application during peak loads Support growth of traffic. Focus on actual public cloud costs Google APP, EC2 estimates Component Hosting Expense of hosting the whole of CiteSeerx maybe prohibitive. Solution: Host a component or service i.e., Component/service code Data on which the component acts Interfaces, etc. associated with the component Goal: Identify optimal subset/components based on: Service growth Service usage New services Component Hosting - Costs Component Amazon EC2 Initial Google App Engine Monthly Initial Costs Monthly Costs Web Services 0 1448.18 0 942.53 Repository 0 1000 163.8 593.21 Database 0 858.89 12 348.05 Index 0 527.08 3.1 83.48 Extraction 0 499.02 0 90.6 Crawler 0 513.4 0 105 Least expensive option - host the index for cases. Most expensive - host web services. Component Hosting – Lessons Learned Hosting components is reasonable Having a service oriented architecture helps Amazon EC2 Computation costs dominate. Google App Engine Refactoring costs ? Refactoring required not just for components, but other services. Storage and transfer costs maybe optimized A study of data transfer in the application gives insights to costs. Approach suitable for meeting fixed budgets How many components of an application can be hosted for a fixed budget. Content Hosting – Lessons Learned Hosting specific content relevant to peak load scenarios Easy to do – minimal refactoring required, affects a minimal set of components (presentation layer). More complex scenarios need to be examined Hosting papers from the repository Hosting shards of the index Database Peak Load – Lessons Learned Hosting only during peak load conditions is economically feasible. Growth potential Can be used to handle growth in traffic, instead of procuring new hardware. Hosting a specific component under stress; such as a database In such a case it will cost 400$ to host the database in Amazon EC2. Research Directions • • Similar to many discussed at this work shop applied to DLs Explore policies for hosting based on Privacy/Security Integrity/reliability QoS; $ Costs Local vs public Architecture redesign utilizing cloud primitives and systems spanning multiple sites Queues, Key Stores, Clusters Optimization of existing features for automated VM-ing Conclusions Advantages of cloudizing CiteSeerx Reliability, maintenance, potential costs savings Different costs of hosting for all or parts of Components Content Peak load CiteSeerX working system – testbed? Data, storage, access, databases Growth, evolving features, users $ savings depend on continued support; working local system may still be needed Archival issues/ Google Scholar NSF cloud focus? Besides what has been proposed at this workshop Clouds for science – what industry will not or can not support Both for big and small science Clouds for the “new” sciences – social, political, historical,… that have growing amounts of data Also focus on data: Data rules, Without data, there is no science. Future Work Cost of refactoring – particularly for Google App Engine. Cost comparisons for other cloud offerings – Azure, Eucalyptus. Privacy and user issues – myCiteSeer and private clouds. Technical issues with cross hosting – load balancing, latency needed to be addressed. Virtualization in SeerSuite, components built with cloud hosting in mind (Federated Services). References GROSSMAN , R., AND G U , Y. Data mining using high performance data clouds: Experimental studies using sector and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), ACM,pp. 920–927. MIKA , P., AND T UMMARELLO , G. Web semantics in the clouds. IEEE Intelligent Systems 23, 5 (2008), 82–87. NURMI , D., WOLSKI , R., G RZEGORCZYK , C., O BERTELLI , G., S OMAN , S., YOUSEFF , L., AND ZAGORODNOV, D. The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid-Volume 00 (2009), IEEE Computer Society, pp. 124–131. SINGH , A., S RIVATSA , M., AND L IU , L. Search-as-a-service: Outsourced search over outsourced storage. ACM Trans. Web 3, 4 (2009), 1–33. TEREGOWDA , P. URGAONKAR , B., AND GILES , C. Cloud computing: A digital libraries perspective. In 3rd IEEE 2010 International Conference on Cloud Computing (2010). TEREGOWDA , P. B., COUNCILL , I. G., FERNANDEZ , J. P. R., KASBHA , M., ZHENG , S., AND GILES , L. C. Seersuite: Developing a scalable and reliable application framework for building digital libraries by crawling the web. In USENIX Conference on Web Application Development (2010). P.B Teregowda, B. Urgaonkar, C.L. Giles, "Cost Implications Of Moving To The Cloud: A Digital Libraries Perspective," 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), 2010. VAN DE SOMPEL , H., NELSON , M., LAGOZE , C., AND WARNER , S. Resource harvesting within the OAI-PMH framework. D-Lib Magazine 10, 12 (2004), 1082–9873. WALKER , E., BRISKEN , W., AND ROMNEY, J. To lease or not to lease from storage clouds. Computer 43 (2010), 44– 50. WEIGEL , F., PANDA , B., RIEDEWALD , M., GEHRKE , J., AND CALIMLIM , M. Large-scale collaborative analysis and extraction of web data. Proc. VLDB Endow. 1, 2 (2008), 1476–1479. WOOD , T., CECCHET, E., RAMAKRISHNANY, K., SHENOY, P., VAN DER MERWEY, J., AND V ENKATARAMANI , A. Disaster recovery as a cloud service: Economic benefits & deployment challenges. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (2010).