Download Ruzi data science poster V4.0 half size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Determining Trends in Seed Dispersal: A Pilot Study Using Data Mining Techniques
Selina A. Ruzi1
1Program in Ecology, Evolution, and Conservation Biology, UIUC, [email protected]
Results/ Conclusions
Objectives
To determine whether data mining is a viable
technique to determine trends in research on
seed dispersal, particularly defining areas of
recent interest and knowledge gaps.
Methods
• The Scopus database was searched using the
following criterion:
• Search term = “seed dispersal”
• Document type = Article
• Year published = 2016 or 2000
• Subject area = Life sciences
• 163 abstracts were exported from the 2016
publishing year (1 late 2015) and 170 abstracts
from 2000
• Each line of the csv file contained an abstract
from a different article
• Python was used to read in the csv file line by
line, extracting out the abstract, separating the
abstract by words, and then identifying whether
specific search words (Table 1) were present in the
abstract and to count the number of times they
appear if they were present in the abstract
• Word counts, author, title, year, and abstract were
exported into a newly created csv file
• Word counts were manipulated using the R
program
• Graphing and other future analyses to be done
using the R program
Figure 1.
Figure 1: The percentage of articles in which each
of these categories appears based on the number of
abstracts sampled for either 2000 (170 total) or
2016 (163 total including one from late 2015).
• The category that appears the most times in
abstracts from both 2016 and 2000
publishing years is the “Fate” category
appearing in 29% and 26% of the abstracts
respectively (Figure 1).
• The category that most increased in
abundance based on the percentage of
abstracts it appeared in in 2016 versus 2000
was the “Tropics” category (Figure 2).
Figure 2.
• The category that has decreased in abundance
the most based on the percentage of abstracts it
appeared in in 2016 versus 2000 was the
“Distances” category (Figure 2).
• A larger sampling size and more analyses are
needed to be able to really say if data mining is
a valuable tool to determine trends in seed
dispersal research through time. However,
initial broad trends are able to be identified
using these methods.
Figure 2: The difference in the percentage of abstracts the
categories appear in from 2016 to 2000. Blue indicates
categories that have increased in abundance and red indicates
categories that have decreased in abundance.
Category
Words Searched
Category
Words Searched
Abiotic
Active
Abiotic, Abiotic-factor, Abiotic-factors
Active, Active-dispersal
Germination
Insect
Ant
Aril
Beetle
Ant, Ants, Formicidae
Aril, Arils
Beetle, Beetles, Dung-beetle, Dung-beeltes
Mammal
Myrmecochorous
Non-myrmecochorous
Biotic
Bird
Chem
Distance
Effect
Elaiosome
Fate
Biotic, Biotic-factor, Biotic-factors
Bird, Birds, Avian
Chemical, Chemistry, Chemicals
Distance, Distances
Effectiveness
Elaiosome, Elaiosomes
Fate, Destiny, Predation, Germination, Eaten, Loss,
Predated
Passive
Predation
Primary
Rodent
Secondary
Temperate
Tropics
Germination, Germinated
Insect, Insects, Ant, Ants, Formicidae, Beetle,
Beetles, Dung-beetle, Dung-beetles
Mammal, Mammals, Rodent, Rodents
Myrmecochore, Myrmecochores, Myrmecochorous
Non-myrmecochore, Non-myrmecochores, Nonmyrmecochorous
Passive, Passive-dispersal
Predation, Predated, Eaten
Primary, Primary-dispersal
Rodent, Rodents
Secondary, Secondary-dispersal
Temperate
Neotropics, Neotropic, Neotropical, Tropics,
Tropical, Paleotropical, Paleotropics, Paleotropic
Table 1:
Categories and
words searched
within to
further subset
the abstracts.
Words in black
were searched
in the Python
script. Words in
red will be
added to the
Python script in
future analyses
along with
other
categories.
Future Directions
• Include more articles from more years
• Determine a better way to conduct word counts
• Expand the search area from solely abstracts to
other areas of the papers
• Make the python and R codes more efficient
• Fix bugs in current coding
Acknowledgements
Thanks to the instructors of the Focal Point: Data Science Across
Disciplines class. This work was part of a Focal Point grant funded by the
Graduate College at the University of Illinois at Urbana-Champaign.