Download 1 Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
Clustering of Scientific Information for Optoelectronics Material GaN
V.GOUNTSIDOU, H.M.POLATOGLOU
Department of Physics
Aristotle University of Thessaloniki
54006 Thessaloniki
GREECE
Abstract: We examined the feasibility of applying data mining tools to process the huge amount of data
concerning materials, their properties and processes. The need for such analysis is even greater today since the
variety and possible combinations of materials increase very fast. As an example in the present work we
focused on an important optoelectronics material, namely GaN, and by performing clustering of the data we
found that the cases do cluster and have also pinpointed some interesting exceptions.
Key-Words: ABSTRACT, GAN, ARTICLE, JOURNAL, DATA MINING, TEMPERATURE, SUBSTRATE,
MEASUREMENT, MATERIAL.
1 Introduction
Materials are an integral part of civilization.
Today the systematic and intense study of their
properties and the exploitation of novel preparation
and processing methods have revolutionized almost
every aspect of our lives [1-3]. Optoelectronics, for
example, which enable fast and reliable
communications and data storage devices, is a
branch in which materials and their preparation play
a significant role. A key material for optoelectronics
today is GaN [4-5].
GaN and its alloys has become a key
compound in the development of high density CD
devices, and blue and consequently white Light
Emitting Diodes. An enormous amount of data,
have been published related to this compound (1747
papers from 1975 until now) and handling it is not
an easy task. In addition a lot of important
information may be hidden.
Data mining is a discipline that provides the
tools to analyze and extract hidden information in
large amounts of data [6-8]. Nowadays research and
industry have invested a lot in developing the
science of materials. It is very important to study the
properties and applications of materials, discover
their interactions, map unknown pathways, etc.
While a lot of work has been done applying data
mining tools in biochemistry for example, very little
is done concerning the materials. One notable case
is the building of an electronic data base of materials
properties and the appropriate software to select
materials for a given engineering task [9-10].
The purpose of the present work is, focusing
on recent and current scientific data from electronic
sources, to examine the feasibility, the best ways to
present the information concerning materials and
produce examples for the particular case of GaN.
2 Method
Scientific information is normally coded in
the form of published articles which are widely
distributed and can be easily accessed especially in
recent years where the web is available to almost
everyone. Parts of the published paper such as the
title, authors and abstract are as a rule freely
available in electronic form. Difficulties arise since
there is no standard way in which the scientific
knowledge contained in a published paper is
summarized in an abstract, as different authors use
different ways. Therefore we have selected a large
number of published papers and studied thoroughly
the body of the paper and the abstract to establish
their relation.
Scientific data contained in published
works are not always suitable coded for machine
processing and not every piece of it is of equal
importance or relevance. Therefore the fist step is
to find which pieces of information are suitable and
afterwards to use an appropriate coding suitable for
machine processing. These two steps are not
independent and some iterative optimization is
required in order to fix both.
From our involvement in research in this
area, and the selected published papers, we have
1
produced a large set of properties, data, and
important scientific information. We used this big
set of keys on a larger set of abstracts in order to
check the consistency and any redundancies.
Furthermore, we have reduced the set to a smaller
set.
The next step is to choose the digital
presentation of the data. We decided to use the
.
plain text format which is standard and almost all
data mining programs support it. We have tried
some of the available programs in order to access
their utility and check that the different programs
gave compatible results, thus assuring that the
coding/selection of properties was independent
from the program.
3. Results.
The preparation of GaN is done using
techniques like Molecular Beam Epitaxy or
MOCVD and the important properties are
electrical, optical, structural etc. Therefore we
Categories
Preparation
Structure
chose the following categories of information:
preparation, chemical composition, measured
physical properties. Those categories have the
subcategories that appear in Table I.
Table I
Categories and Subcategories for GaN
Subcategories
MBE, MOCVD, NCPP, PLD, CHVPE, Thermal Evaporation, LP-MOVPE
Hexagonal, Quantum wells
Physical
Measurements
TEM, CCM, Nuclear Microscopy, X-ray Diffraction, Admittance Spectroscopy,
XRXRD, AFM ,SEM, PL, DCRC, CL, HRTEM, XRD, SAED, EDS ,XPS,
Secondary Ion Mass Spectroscopy
Substrate
sapphire, Si(111), GaAs, Si(100), porous GaAs(001), glass
We worked in a group which has included
papers, downloaded from the world wide web. We
studied 13 features in every paper. These features
were: temperature, substrate (sapphire or Si),
methods of preparation (molecular beam epitaxy,
metallorganic chemical vapor deposition, vapor
phase epitaxy ) measurements (atomic force
microscopy, transmission electron microscopy,
scanning electron microscopy, X-ray diffraction,
photoluminescence) and structure (hexagonal and
cubic).
The method which the program has used is
the method of Repeated Bisections. In this method
the desired 10-way clustering solution is computed
by performing 9 repeated bisections. This means
that the matrix is first clustered into two groups,
.
then one of these groups is selected and bisected
further. This is continuoued until the desired
number of clusters is found. During each step, the
cluster is bisected so that the resulting 2-way
clustering solution optimizes a particular clustering
criterion function which computes the similarity
between objects using the cosine function. This is
the default setting. The set of the descriptive
features is determined by selecting the columns that
contribute the most to the average similarity
between the objects of each cluster. On the other
hand the set of discriminating features is
determined by selecting the columns that are more
prevelant in the cluster compared to the rest of the
objects
2
Figure 1
As can be seen from figure 1 the mountain
visualization represents these 10 clusters as ten
peaks labeled by their cluster id. The height of each
peak of the Gausian curve is propotional to the
cluster’s internal similarity. The volume of a peak
.
A/A
1
2
3
4
5
6
7
8
9
10
CLUSTER NUMBER
0
1
2
3
4
5
6
7
8
9
is proportional to the number of elements contained
within the cluster. The color of a peak is
proportional to the cluster’s internal deviation
whereas
blue
indicates
high
deviation
Table II
NUMBER OF PAPERS
5
7
7
7
10
9
15
10
19
11
We can conclude from the above Table II that the
cluster with number 0 may be taken as the
FEATURES
TEM, Hexagonal, Si
Sapphire, SEM,VPE
Hexagonal, Temperature, AFM
Si, MBE, AFM
Sapphire, mocvd, TEM
Temperature, mocvd, TEM
TEM, XRD, PL
Sapphire, AFM, PL
Temperature, TEM, XRD
Sapphire, MBE, VPE
exception since the combination of the features
provided here is not often found in the papers in the
3
web. On the other hand cluster number 8 is the one
which includes the most papers. That means the
combination of methods TEM and XRD in a given
temperature is the most commonly found.
5. Conclusion
We examined the feasibility of applying
data mining tools to process the huge amount of
data concerning materials, their properties and
processes. The need for such analysis is even
greater today since the variety and possible
combinations of the materials increase very fast. As
an example in the present work we focused on an
important optoelectronics material, namely GaN,
and by performing clustering of the data we found
that the cases do cluster and have also pinpointed
some interesting exceptions.
.
As it has already happened in Biology there
is a great need of creating online public databases
in material science where all the given information
will be selected. Studying these data thoroughly
correlations will be discovered and with the correct
use of data mining tools clusters of large data sets
will be formed. Since there is a great need of new
materials, predictions will also be made based on
the given data.
6. References
[1].Wu HM, Song YL, Wen YQ, Zhao T, Gao HJ, Jiang L. Progress in materials and technologies for ultrahigh
density data storage. Progress in natural science, 13 (4):247-253 APR 2003
[2] Pearton SJ, Abernathy CR, Norton DP, Hebard AF, Park YD, Boatner LA, Budai JD, Advances in wide
bandgap materials for semiconductor spintronics. Materials Science &Engineering R-Reports, 40(4):137-168
FEB 28 2003
[3] Murty BS, Datta MK, Pabi SK, Structure and thermal stability of nanocrystalline materials. SadhanaAcademy Proceedings in Engineering Sciences, 28:23-45 Part 1-2FEB-APR 2003
[4] Ponce F, Nakamura S, Chichibou S.F., Introduction to Nitride Semiconductor Blue Lasers and Light
emitting Diodes, Taylor &Francis, 2000,p105
[5] Pearton SJ, Abernathy CR, Norton DP, Hebard AF, Park YD, Boatner LA, Budai JD, Advances in wide
bandgap materials for semiconductor spintronics. Materials Science &Engineering R-Reports, 40(4):137-168
FEB 28 2003
[6].Fayyad UM, Smyth P. Cataloging and mining massive datasets for science data analysis, Journal of
computational and graphical statistics, 8 (3):589-610 SEP 1999[
[7].Wang X,Wang JTL, Shasha D, Shapiro BA, Rigoutsos I, Zhang KZ., Finding patterns in three-dimensional
graphs: Algorithms and applications to scientific data mining, IEEE Transactions on knowledge and data
engineering, 14 (4): 731-749 JUL-AUG 2002
[8] Bader GD, Donaldson I, Wolting C, Ouellete BFF, Pawson T, Hogue CWV, BIND-The Biomolecular
Interaction Network Database, Nucleic acids research, 29 (1):242-245 JAN 1 2001
[9] Ashby MF, Materials Selection and Process in Mechanical Design ,Butterworth Heinemann,Oxford,1999
[10] Ashby MF, Cebon D, Case studies in Materials Selection.First Edition, Granta
Design,Cambridge,1996,Second Edition,Butterworth-Heinmann,Oxford,1999
[11]. S.Curtarolo, D.Morgan, K. Persson, J.Rodgers, and G. Ceder, “Predicting Crystal Structures with Data
Mining of Quantum Calculations”, Phys. Rev. Lett. 91,135503 (2003)
4