Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering of Scientific Information for Optoelectronics Material GaN V.GOUNTSIDOU, H.M.POLATOGLOU Department of Physics Aristotle University of Thessaloniki 54006 Thessaloniki GREECE Abstract: We examined the feasibility of applying data mining tools to process the huge amount of data concerning materials, their properties and processes. The need for such analysis is even greater today since the variety and possible combinations of materials increase very fast. As an example in the present work we focused on an important optoelectronics material, namely GaN, and by performing clustering of the data we found that the cases do cluster and have also pinpointed some interesting exceptions. Key-Words: ABSTRACT, GAN, ARTICLE, JOURNAL, DATA MINING, TEMPERATURE, SUBSTRATE, MEASUREMENT, MATERIAL. 1 Introduction Materials are an integral part of civilization. Today the systematic and intense study of their properties and the exploitation of novel preparation and processing methods have revolutionized almost every aspect of our lives [1-3]. Optoelectronics, for example, which enable fast and reliable communications and data storage devices, is a branch in which materials and their preparation play a significant role. A key material for optoelectronics today is GaN [4-5]. GaN and its alloys has become a key compound in the development of high density CD devices, and blue and consequently white Light Emitting Diodes. An enormous amount of data, have been published related to this compound (1747 papers from 1975 until now) and handling it is not an easy task. In addition a lot of important information may be hidden. Data mining is a discipline that provides the tools to analyze and extract hidden information in large amounts of data [6-8]. Nowadays research and industry have invested a lot in developing the science of materials. It is very important to study the properties and applications of materials, discover their interactions, map unknown pathways, etc. While a lot of work has been done applying data mining tools in biochemistry for example, very little is done concerning the materials. One notable case is the building of an electronic data base of materials properties and the appropriate software to select materials for a given engineering task [9-10]. The purpose of the present work is, focusing on recent and current scientific data from electronic sources, to examine the feasibility, the best ways to present the information concerning materials and produce examples for the particular case of GaN. 2 Method Scientific information is normally coded in the form of published articles which are widely distributed and can be easily accessed especially in recent years where the web is available to almost everyone. Parts of the published paper such as the title, authors and abstract are as a rule freely available in electronic form. Difficulties arise since there is no standard way in which the scientific knowledge contained in a published paper is summarized in an abstract, as different authors use different ways. Therefore we have selected a large number of published papers and studied thoroughly the body of the paper and the abstract to establish their relation. Scientific data contained in published works are not always suitable coded for machine processing and not every piece of it is of equal importance or relevance. Therefore the fist step is to find which pieces of information are suitable and afterwards to use an appropriate coding suitable for machine processing. These two steps are not independent and some iterative optimization is required in order to fix both. From our involvement in research in this area, and the selected published papers, we have 1 produced a large set of properties, data, and important scientific information. We used this big set of keys on a larger set of abstracts in order to check the consistency and any redundancies. Furthermore, we have reduced the set to a smaller set. The next step is to choose the digital presentation of the data. We decided to use the . plain text format which is standard and almost all data mining programs support it. We have tried some of the available programs in order to access their utility and check that the different programs gave compatible results, thus assuring that the coding/selection of properties was independent from the program. 3. Results. The preparation of GaN is done using techniques like Molecular Beam Epitaxy or MOCVD and the important properties are electrical, optical, structural etc. Therefore we Categories Preparation Structure chose the following categories of information: preparation, chemical composition, measured physical properties. Those categories have the subcategories that appear in Table I. Table I Categories and Subcategories for GaN Subcategories MBE, MOCVD, NCPP, PLD, CHVPE, Thermal Evaporation, LP-MOVPE Hexagonal, Quantum wells Physical Measurements TEM, CCM, Nuclear Microscopy, X-ray Diffraction, Admittance Spectroscopy, XRXRD, AFM ,SEM, PL, DCRC, CL, HRTEM, XRD, SAED, EDS ,XPS, Secondary Ion Mass Spectroscopy Substrate sapphire, Si(111), GaAs, Si(100), porous GaAs(001), glass We worked in a group which has included papers, downloaded from the world wide web. We studied 13 features in every paper. These features were: temperature, substrate (sapphire or Si), methods of preparation (molecular beam epitaxy, metallorganic chemical vapor deposition, vapor phase epitaxy ) measurements (atomic force microscopy, transmission electron microscopy, scanning electron microscopy, X-ray diffraction, photoluminescence) and structure (hexagonal and cubic). The method which the program has used is the method of Repeated Bisections. In this method the desired 10-way clustering solution is computed by performing 9 repeated bisections. This means that the matrix is first clustered into two groups, . then one of these groups is selected and bisected further. This is continuoued until the desired number of clusters is found. During each step, the cluster is bisected so that the resulting 2-way clustering solution optimizes a particular clustering criterion function which computes the similarity between objects using the cosine function. This is the default setting. The set of the descriptive features is determined by selecting the columns that contribute the most to the average similarity between the objects of each cluster. On the other hand the set of discriminating features is determined by selecting the columns that are more prevelant in the cluster compared to the rest of the objects 2 Figure 1 As can be seen from figure 1 the mountain visualization represents these 10 clusters as ten peaks labeled by their cluster id. The height of each peak of the Gausian curve is propotional to the cluster’s internal similarity. The volume of a peak . A/A 1 2 3 4 5 6 7 8 9 10 CLUSTER NUMBER 0 1 2 3 4 5 6 7 8 9 is proportional to the number of elements contained within the cluster. The color of a peak is proportional to the cluster’s internal deviation whereas blue indicates high deviation Table II NUMBER OF PAPERS 5 7 7 7 10 9 15 10 19 11 We can conclude from the above Table II that the cluster with number 0 may be taken as the FEATURES TEM, Hexagonal, Si Sapphire, SEM,VPE Hexagonal, Temperature, AFM Si, MBE, AFM Sapphire, mocvd, TEM Temperature, mocvd, TEM TEM, XRD, PL Sapphire, AFM, PL Temperature, TEM, XRD Sapphire, MBE, VPE exception since the combination of the features provided here is not often found in the papers in the 3 web. On the other hand cluster number 8 is the one which includes the most papers. That means the combination of methods TEM and XRD in a given temperature is the most commonly found. 5. Conclusion We examined the feasibility of applying data mining tools to process the huge amount of data concerning materials, their properties and processes. The need for such analysis is even greater today since the variety and possible combinations of the materials increase very fast. As an example in the present work we focused on an important optoelectronics material, namely GaN, and by performing clustering of the data we found that the cases do cluster and have also pinpointed some interesting exceptions. . As it has already happened in Biology there is a great need of creating online public databases in material science where all the given information will be selected. Studying these data thoroughly correlations will be discovered and with the correct use of data mining tools clusters of large data sets will be formed. Since there is a great need of new materials, predictions will also be made based on the given data. 6. References [1].Wu HM, Song YL, Wen YQ, Zhao T, Gao HJ, Jiang L. Progress in materials and technologies for ultrahigh density data storage. Progress in natural science, 13 (4):247-253 APR 2003 [2] Pearton SJ, Abernathy CR, Norton DP, Hebard AF, Park YD, Boatner LA, Budai JD, Advances in wide bandgap materials for semiconductor spintronics. Materials Science &Engineering R-Reports, 40(4):137-168 FEB 28 2003 [3] Murty BS, Datta MK, Pabi SK, Structure and thermal stability of nanocrystalline materials. SadhanaAcademy Proceedings in Engineering Sciences, 28:23-45 Part 1-2FEB-APR 2003 [4] Ponce F, Nakamura S, Chichibou S.F., Introduction to Nitride Semiconductor Blue Lasers and Light emitting Diodes, Taylor &Francis, 2000,p105 [5] Pearton SJ, Abernathy CR, Norton DP, Hebard AF, Park YD, Boatner LA, Budai JD, Advances in wide bandgap materials for semiconductor spintronics. Materials Science &Engineering R-Reports, 40(4):137-168 FEB 28 2003 [6].Fayyad UM, Smyth P. Cataloging and mining massive datasets for science data analysis, Journal of computational and graphical statistics, 8 (3):589-610 SEP 1999[ [7].Wang X,Wang JTL, Shasha D, Shapiro BA, Rigoutsos I, Zhang KZ., Finding patterns in three-dimensional graphs: Algorithms and applications to scientific data mining, IEEE Transactions on knowledge and data engineering, 14 (4): 731-749 JUL-AUG 2002 [8] Bader GD, Donaldson I, Wolting C, Ouellete BFF, Pawson T, Hogue CWV, BIND-The Biomolecular Interaction Network Database, Nucleic acids research, 29 (1):242-245 JAN 1 2001 [9] Ashby MF, Materials Selection and Process in Mechanical Design ,Butterworth Heinemann,Oxford,1999 [10] Ashby MF, Cebon D, Case studies in Materials Selection.First Edition, Granta Design,Cambridge,1996,Second Edition,Butterworth-Heinmann,Oxford,1999 [11]. S.Curtarolo, D.Morgan, K. Persson, J.Rodgers, and G. Ceder, “Predicting Crystal Structures with Data Mining of Quantum Calculations”, Phys. Rev. Lett. 91,135503 (2003) 4