Download firstday - University of California, Riverside

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS 179 Database Project
Instructor: Dr Eamonn Keogh
Computer Science & Engineering Department
318 EBII
University of California - Riverside
Riverside, CA 92521
[email protected]
Class web page
www.cs.ucr.edu/~eamonn/cs179
Administration I
Class Meeting Times
Class Activities:
Discussion M 03:10 p.m. - 04:00 p.m. SPR 2339
LAB F 02:10 p.m. - 05:00 p.m. ENGR2 129
(first 15 minutes rule)
We will not meet every week. You are obliged to view the
class web page every Monday morning to check for
announcements.
You are 100% responsible for any announcements/changes I
might post to the web page.
Administration II
Presentation of Final Project:
You will need to give a short group presentation in the last two weeks
(details later).
You must show up to one of these final presentation sessions, or take
a failing grade (exception, you work out an alternate plan with me by
the end of week 4).
Note that the sessions may go very late!! You must be prepared to
stay for the entire sessions. Sign ups for time slots will be made
available on a first-come first-served basis later in the quarter.
Administration III
Groups:
Groups may be of size 2 or 3.
Only one person who did not get an A or B in CS 166 may be in a
group. (I may make exceptions if the numbers require it).
If you need to be a “group” of one, talk to me after class.
You should take your responsibility to your group seriously.
In most case I expect that everyone in the group will get the same
grade, but I reserve the right to give different grades where
warranted.
Administration IIII
Grading :
• Project binder: 90%
• Presentation (including demonstration of project): 10%
Your project binder (exhaustive details in class handouts) is a document in
which you prove to me (or any reader) that you solved the problem
given to you using a good design process.
It must be in the format explicitly stated in the handouts.
Your presentation is your chance to review and highlight the quality
of your work.
Administration V
Office Hours:
I am normally in my office 6-7 days a week. You may visit me any
time.
If you wish to be 100% certain I am there you may make an
appointment by email with at least 24 hours notice. (Note that if you
make an appointment, and then fail to keep it or show up late,
the grade for your entire group will suffer).
If you email me, you must include “CS179” in the subject heading
and note your group name (i.e. CS179-smith-jones-zoe) in the body.
Administration VI
Important: If a member of your group commits an act of academic
dishonesty, all members of the group will receive a failing grade!
Don’t know the exact definition of academic dishonesty? It is your
job to find out! (This is true in general, not just for this class).
http://www.cs.ucr.edu/content/students/index.php?choice=academdis
http://cnas.ucr.edu/~cnas/student/dishonesty.pdf
There are certain rules which must be followed in this class, they are
made clear on the handouts, follow them or get a written exception
from me.
If you write
In order to handle spatial data efficiently, as required in computer
aided design, we decided to use an R-tree. We implemented it...
Everyone in your group gets a failing grade.
Instead you should write
It was noted by Guttman [12] that “In order to handle spatial data
efficiently, as required in computer aided…
XXX 2004: “Similarity matching is useful in two aspects. First, it is
a subroutine of many data mining tasks, such as
classification, clustering, rule discovery, outlier detection, and
query by contents. Second, it is important in its own right for
exploratory data analysis.” It is possible to convert the
subsequence matching problem into whole matching, by placing
a sliding window of size …. A time series of length N is by
definition a sequence of real numbers, and therefore can be
considered as a point in N-dimensional space. This
immediately suggests that … R-tree …. Since a time series
may contain thousands of points,.. This phenomenon is known
as the dimensionality curse problem, and in order to utilize the
powers of SAMs we need to first perform dimensionality
reduction.
.. three steps:

Establish a distance measure Disttrue for the raw data series. In
this thesis, we focus on Euclidean distance Disttrue.

Produce a feature extraction function F that reduces the
dimensionality of the

data from the original length N to n that can be handled by an
appropriate index structure.
Establish a distance measure Distfeature in the feature space (of n
dimensions).
The first dimensionality reduction technique proposed for indexing
time series in the literature is to use the Discrete Fourier
Transform. The basic idea is that
any realistic signal can be characterized by the superposition of a
finite number of sine/cosine waves, each of which is
represented by a single complex number known as a Fourier
coefficient. … and many Fourier coefficients have a very low
amplitude and therefore can be discarded without much loss
of information….
Keogh 2000: “Similarity search is useful in its own right as a tool for
exploratory data analysis, and it is also an important
subroutine of many data mining applications such as clustering
, classification and mining of association rules. ”Keogh 2000:
“it is possible to convert subsequence matching to whole
matching by sliding a "window" of length n….” “A time series C
= {c1…cn} with n datapoints can be considered as a point in ndimensional space. This immediately suggests that time series
could be indexed by multidimensional index structure such as the
R-tree and its many variant. Since realisticqueries typically
contain 20 to 1,000 datapoints (i.e. n varies from 20 to 1000) and
most multidimensional index structures have poor performance at
dimensionalities greater than 8-12 [12], we need to first perform
dimensionality reduction.
following three steps.

Establish a distance metric from a domain expert (in this case
Euclidean distance).

Produce a dimensionality reduction technique that reduces the
dimensionality of the data from n to N, where N can be
efficiently handled by your favorite index structure.

Produce a distance measure defined ...
The first technique suggested for dimensionality reduction of time
series was the Discrete Fourier Transform (DFT) [1]. The basic
idea of spectral decomposition is that any signal, no matter how
complex, can be represented by the superposition of a finite
number of sine (and/or cosine) waves, where each wave
represented by a single complex number known as a Fourier
coefficient [29]. ….. . many of the Fourier coefficients have
very low amplitude and thus contribute little to reconstructed
signal. These low amplitude coefficients can be discarded
without much loss of information…
So, what are spatial queries?
Databases are applications which store data in a format
which supports querying.
Imagine we have a database of restaurants in California. The
database should probably be able to support queries like…
• Return a list of all vegetarian restaurants.
• Return the phone number of Marios Pizza on 123 Spruce st.
• Return the restaurants that have a 4-star or higher rating.
However there are many reasonable queries that most of-theshelf database systems do not support….
• Return a list of all restaurants with 5 miles of my house.
• Return (in order of distance) the 3 pizza restaurants nearest to UCR.
Nearest
neighbor
query
Range query
Your project is to build a database that supports spatial queries,
as well as classic database queries.
Although you could do this from scratch, I highly recommend that you do
this by building some code that sits on top of an off-the-shelf database (ie
Microsoft Access, Oracle, FoxPro, PostgreSQL).
I also highly recommend that you do this by implementing an R-tree.
In some sense the sentence above, “Your project is to build a
database that…”, is misleading. I won’t be grading the quality
of your database directly.
Your project is really to demonstrate your ability to design
medium to large scale software.
User Interface
Spatial Search Engine (probably R-Tree)
Classic Database
Name
Marios Pizza
ID Type Phone Location Grade
888-1212
ITA
1
244, 365 D
Joes Bugers
2
US
848-1298
34, 764
A
Jo’s Mexican
3
MEX
878-1333
123, 32
A
Sues Pasta
4
ITA
878-1342
876, 65
B
ce
Enter an address and we will find the location of the
nearest Californian university
obably R-Tree)
221 Baker Street, Riverside
Exclude Religious Schools
Exclude Cal States
Location UCV
244, 365 Q
34, 764
S
123, 32
G
876, 65
W
The nearest university is CSUSB. Click here for
admissions information
ce
Click on the map and we will find the location of the
nearest Californian university
obably R-Tree)
Exclude Religious Schools
Exclude Cal States
Location UCV
244, 365 Q
34, 764
S
123, 32
G
876, 65
W
The nearest university is CSUSB. Click here for
admissions information
ce
Choose a location and we will find the location of the
nearest Californian university
obably R-Tree)
Location UCV
244, 365 Q
34, 764
S
123, 32
G
876, 65
W
LAX
Golden Gate Bridge
Balboa Park, SD
Ontario Mills
Exclude Religious Schools
Exclude Cal States
The nearest university is CSUSB. Click here for
admissions information
ce
obably R-Tree)
The GPS unit tells me you are in UCR, Riverside
California. Do you want to know the location of the nearest
University?
Exclude Religious Schools
Exclude Cal States
Location UCV
244, 365 Q
34, 764
S
123, 32
G
876, 65
W
The nearest university is CSUSB. Click here for
admissions information
To begin, you must come up with an application area which has a
spatial element (I.e Restaurants in Orange County, California
brown bear sightings, Locations of car crashes in Riverside).
You must write a two page description of the problem, in the first person.
The project description should begin by informally explaining the domain
from the customer’s perspective (“As a restaurant critic… ”). Then explaining
the utility of database for the customer (“The database will allow me to … …it
will also help me…”).
After I approve the project description, I (and/or our TA) will
assume the role of the customer (I may add some requirements).
Thereafter anytime you have a question about what the customer
wants, you must come to see me. If you make an assumption,
and it is the wrong assumption, you will have to redo your work,
or take a major grade penalty.
How am I going to get the Spatial
locations of 500 places?
•
•
•
The web.
A GPS unit.
Use a grid overlay.
If you use a grid overlay you must do it very
carefully, and document the process.
Note that treating the problem as existing on a
Euclidian plane is actually incorrect. Since the
locations are on a sphere there will be an inherent
error in the distances reported. This effect would not
show up in an area the size of Riverside, but would
show up for an area the size of California. However
you may ignore it in this project.
Important Reminder
Do not leave here today thinking… “how am I going to code this
R-tree thing”, or “what language should I use”.
Leave here thinking… “How is our group going to elicit the
problem, design, build and test this piece of software? What is the
best design process to use? How are we going to convince the
professor, (with the contents of our project binder) that we used a
high quality process to solve this problem?”.
In particular, you probably want to spend a few weeks
researching the design process before you even consider the
particular application problem.