Download SearchCourseWMU - UC Berkeley School of Information

Document related concepts

URL redirection wikipedia , lookup

Transcript
Best Practices for Search
for the Federal Government
Marti Hearst
Web Manager University
November 10, 2009
The Importance of Search for Govt
• OMB memorandum, Dec 2005:
“When disseminating information to the public-at-large,
publish your information directly to the internet.”
• Pres. Obama’s memorandum, Jan 21, 2009:
“Information maintained by the Federal Government is a
national asset. My Administration will take appropriate
action, consistent with law and policy, to disclose
information rapidly in forms that the public can readily
find and use. ”
A bit about me
• Professor at the School of Information at
University of California, Berkeley.
 Teach masters students
 User Interface Design, Search Engines,
Computational Linguistics, Visualization.
• Search User Interfaces
• Visiting government for 1 year
 Updating usasearch.gov
 Looking at site search alternatives.
 Generally kibbitzing
Two Focus Areas
• Web search engines
 The quality and form of your content
 How your results are viewed in search engine
listings
 How your site is crawled
• Site search
 The search interface
 What is crawled
 How results are presented
Outline
• Designing your site for effective search
• Site search interfaces
• Special considerations for web search
engines
• An example of what not to do.
• both content and tech people
• have to be focused on it together
• mention why using (free) book example
• add an exercise towards the start
• top 3 things to do right away
• don’t force h1n1, be sure swine flu too
• seo integrated into process
Use Proven Interface Techniques
 Use modern search UI ideas that are
known to have good usability.
 Apply the principle: recognition over
recall.
•Related query suggestions
•Auto-suggest as the user types
•Use faceted navigation where appropriate.
Search-as-you-Type (SAYT)
• As the user types, shows other peoples’
queries with the same word stems.
• Helps people think of additional words
 (recognition over recall)
• Proven to improve search results.
Evidence-based Decision Making
DATA TRUMPS INTUITIONS
(Kohavi)
Use Evidence-based Decision Making
User behavior determines if an idea is retained.
A/B testing is a standard way to do this.
1) Make small changes to an interface.
2) Show the changed interface to a significant
sample of the user population, show
everyone else the original version.
3) Do this over time (~ two weeks) and for (tens
of) thousands of users.
4) Compare what the two groups do over time.
5) Based on this, decide whether to keep or
reject the feature.
Evidence-based Decision Making
• Example:
 Dan Siroker on Obama for America’s
website and video design decisions
 Easy to measure the outcome: it is in
money donated.

http://www.siroker.com/archives/2009/05/14/obama_lessons_learned_talk_at_google.html
Vote: Which Button is Best?
count down counter
Which Button is Best?
Which Button is Best?
Ease of Use: Summary
USE PROVEN UI TECHNIQUES
REDUCE EXTRA STEPS
USE CLEAR LANGUAGE
MAKE EVIDENCE-BASED UI DECISIONS
How Web Search Engines Work
How Search Engines Work
Three main parts:
Gather the contents of all web pages
(crawling)
ii. Organize the contents of the pages in a
way that allows efficient retrieval
(indexing)
iii. Take in a query, determine which pages
match, and show the results (ranking and
display of results)
i.
Standard Web Search Engine Architecture
user
query
Standard Web Search Engine Architecture
Crawler
machines
crawl the
web
Check for duplicates,
store the
documents
DocIds
Create an
inverted
index
Search
engine
servers
Inverted
index
Standard Web Search Engine Architecture
Crawler
machines
crawl the
web
Check for duplicates,
store the
documents
DocIds
Create an
inverted
index
user
query
Show results
To user
Search
engine
servers
Inverted
index
i. Spiders or crawlers
• How to find web pages to visit and copy?

Can start with a list of domain names, visit the
home pages there.
 Look at the hyperlink on the home page, and
follow those links to more pages.
 Keep a list of urls visited, and those still to be
visited.
 Each time the program loads in a new HTML
page, add the links in that page to the list to
be crawled.
Spider behaviour varies
• Parts of a web page that are indexed
• How deeply a site is indexed
• Types of files indexed
• How frequently the site is spidered
Four Laws of Crawling
• A Crawler must show identification
• A Crawler must obey the robots
exclusion standard
http://www.robotstxt.org/wc/norobots.html
• A Crawler must not hog resources
• A Crawler must report errors
Lots of tricky aspects
• Servers are often down or slow
• Hyperlinks can get the crawler into cycles
• Some websites have junk in the web pages
• Now many pages have dynamic content
• The web is HUGE
The Internet Is Enormous
Image from http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.html
“Freshness”
• Need to keep checking pages
 Pages change
•At different frequencies
•Pages are removed
 Many search engines cache the pages
(store a copy on their own servers)
What really gets crawled?
• A small fraction of the Web that search
engines know about; no search engine is
exhaustive
• Not the “live” Web, but the search engine’s
index
• Not the “Deep Web”
• Mostly HTML pages but other file types too:
PDF, Word, PPT, etc.
ii. Index (the database)
Record information about each page
• List of words
 In the title?
 How far down in the page?
 Was the word in boldface?
• URLs of pages pointing to this one
• Anchor text on pages pointing to this one
Inverted Index
• How to store the words for fast lookup
• Basic steps:
 Make a “dictionary” of all the words in all of
the web pages
 For each word, list all the documents it occurs
in.
 Often omit very common words
• “stop words”
 Sometimes stem the words
• (also called morphological analysis)
• cats -> cat
• running -> run
Inverted Index Example
Image from http://developer.apple.com
/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html
Inverted Index
• In reality, this index is HUGE
• Need to store the contents across many
machines
• Need to do optimization tricks to make
lookup fast.
Query Serving Architecture
“travel”
Load Balancer
• Index divided into segments
“travel”
FE1
…
FE2
“travel”
QI1
QI2
…
FE8
…
QI8
…
“travel”
Node1,1 Node1,2 Node1,3
Node2,1 Node2,2 Node2,3
Node3,1 Node3,2 Node3,3
Node4,1 Node4,2 Node4,3
“travel”
…
…
…
…
Node1,N
Node2,N
Node3,N
Node4,N
each served by a node
• Each row of nodes replicated
for query load
• Query integrator distributes
query and merges results
• Front end creates a HTML
page with the query results
iii. Results ranking
• Search engine receives a query, then
• Looks up the words in the index, retrieves many
documents, then
• Rank orders the pages and extracts “snippets” or
summaries containing query words.
 Most web search engines assume the user
wants all of the words
• These are complex and highly guarded algorithms
unique to each search engine.
Some ranking criteria
• For a given candidate result page, use:









Number of matching query words in the page
Proximity of matching words to one another
Location of terms within the page
Location of terms within tags e.g. <title>, <h1>,
link text, body text
Anchor text on pages pointing to this one
Frequency of terms on the page and in general
Link analysis of which pages point to this one
(Sometimes) Click-through analysis: how often
the page is clicked on
How “fresh” is the page
• Complex formulae combine these together.
Measuring Importance of Linking
• PageRank Algorithm
 Idea: important pages are pointed
to by other important pages
 Method:
• Each link from one page to another is counted as a “vote”
for the destination page
• But the importance of the starting page also influences
the importance of the destination page.
• And those pages scores, in turn, depend on those linking
to them.
Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188
CRAFT SITES FOR FINDABILITY
(SEO)
Making Web Sites Attractive to
Search Engines
• Called “Search Engine Optimization” (SEO)
• There is a LOT of information about this on
the web
 Most is about how to improve your site
 Some is about “cheating”; avoid this
• There are many tools to help you too.
The Most Important Principle:
Good, unique content trumps
everything else.
Content is Key
• Web sites that are primarily high-quality,
unique content will be ranked highly.
 Not just links to other content
 Not re-packaging of other content
• Example:
 My online book was top ranked for “search
user interfaces” within one day of site launch.
 It is also top ranked for many related queries.
Web Site Characteristics
• These can lead to high search engine
ranking (but no guarantees):
 High-quality, unique content.
 Linked to by high-quality sites.
 Been around a long time with consistent
content.
Keyword Placement
• Search engines place “weight” on words
according to where they are used
• Place important words in
 Title tags
 Headings (H1 is key) and emphasized text
 Visible body text
 Description metadata – often used in search
results snippets.
 Alt text in images
Keyword Variation
• Describe the same concepts using different
words within the relevant pages.
 Compare “search interfaces” with “search user
interfaces” in the next slide.
 1 hit versus 4 hits in the top 6
 I need to use more variation for the key concepts
• But it must make sense in your page;
 Don’t hide dictionaries of words !
 Can include them in the description metadata.
• put in aw-stat logs
The Importance of URLs
• Meaningful, short urls improve search engine
ranking and usability
• Urls that consist of computer-generated database
queries can hurt rankings.
• Urls with lots of redirects also hurt.
The Importance of Titles
• The title tag determines what words show up in
the search results title.
 Make them descriptive of the site
 Vary them to differentiate them.
• Example (next page)
 Consistently varies the title to show how they
differ.
 But makes a mistake in the metadata
description by putting the part that varies too
far from the start, so it all looks the same.
Robots Exclusion
• It is important to check your robots.txt files
to be sure they are allowing crawling.
• If your server can’t handle a lot of traffic,
use the site map file to slow crawlers down.
Site Maps
•
There are two kinds of site maps:
1. A navigation structure visible to users
2. An XML file visible only to search
engines
• The latter is important to help ensure the pages on
your site are crawled.
• You can also specify the frequency with which you
hope the pages will be crawled.
• There are free tools to help you do this.
Examples of What Not To Do
For both site design and SEO.
Or … don’t mess with my dog!
Photo of Emmi
What happens when you
search for recalls
?
What happens when you type
http://recalls.gov
?
The lesson: make your url (web
address) easy to find
There should at least be a redirect from
recalls.gov to www.recalls.gov
Also, the url should match its description
in the site title field!
Where is search on this site?
The point: the search entry form
should be highly visible and in a
standard position.
Usually wide and centered towards
the top or else shorter and on the
upper right.
Where do I search for that
recent dog food recall?
The point: do not make the
user guess how your
information is structured.
There should be one search engine
for all government recall
information.
The point: do not require users
to fill out structured search
forms.
This can be an option but should not be
required.
Showing categories with previews of
how many hits are associate with each is
better than lots of entry forms.
What happens after I search?
The point:
use standard layout
(unless there is a good reason not to)
This site puts too much text at the top
before showing search results.
Also, searchers frequently modify their query
It is standard to show the search form with the
previous query at the top.
The point: do promote
commonly requested
information to the top of the
results.
This site uses “best bets” to
promote popular content to the
top; the user finds what they want.
What happens if I search for
recalls at searchusa.gov?
The point:
use descriptive titles.
It is important to put the distinguishing
information first so the repeated part
does not dominate. For example:
Home page: Recalls.gov
Recent Recalls
Food Safety Recalls
Automotive Recalls
What happens if I search for
car recalls at major search
engines?
What happens if I search for
car recalls at major search
engines?
Answer: I don’t see recalls.gov
What is on the page for car
recalls?
The point: use words that
your users use.
Notice that the main page for cars at
recalls.gov does not appear towards the
top. The word “car” does not play an
important role on relevant page.
Tools for Improving Web Sites for
Search Engines
Search Engine Information
• SEO
 http://www.ninebyblue.com/
• Keep current with industry
 http://www.searchengineland.com
 http://battellemedia.com
• Search Interface Principles
 http://searchuserinterfaces.com
• Search Design Patterns (Peter Morville)

http://www.flickr.com/photos/morville/collections/72157603785835882/
Faceted Navigation
For Structured Web Site Search
The Idea of Facets
• Facets are a way of labeling data
 A kind of Metadata (data about data)
 Can be thought of as properties of items
• Facets vs. Categories
 Items are placed INTO a category system
 Multiple facet labels are ASSIGNED TO
items
The Idea of Facets
• Create INDEPENDENT categories (facets)
 Each facet has labels (sometimes arranged in
a hierarchy)
• Assign labels from the facets to every item
Ingredient
Cooking
 Example: recipe collection
Method
Chicken
Stir-fry
Bell Pepper
Curry
Course
Cuisine
Main Course
Thai
The Idea of Facets
• Break out all the important concepts into
their own facets
• Sometimes the facets are hierarchical
 Assign labels to items from any level of
Desserts
Fruits
Preparation Method
the
hierarchy
Cakes
Cherries
Fry
Saute
Boil
Bake
Broil
Freeze
Cookies
Dairy
Ice Cream
Sorbet
Flan
Berries
Blueberries
Strawberries
Bananas
Pineapple
Using Facets
• Now there are multiple ways to get to each
item
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Fruit > Pineapple
Dessert > Cake
Preparation > Bake
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sherbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Dessert > Dairy > Sherbet
Fruit > Berries > Strawberries
Preparation > Freeze
Advantages of Faceted Navigation
• Systematically integrates search results:
 reflect the structure of the info
architecture
 retain the context of previous
interactions
• Gives users control and flexibility
 Over order of metadata use
 Over when to navigate vs. when to
search
Faceted Categories vs. Hierarchies
Stickers vs. Folders
vs
Example:
Medicare Prescription Drug Plan Scam
If you have folders, have to place the item
into multiple folders:
Health
Elderly
Safety
Drugs
Fraud
Alternative: assign stickers to the item:
Medicare Prescription Drug Plan Scam
Assign categories to the item, rather than
put the item into categories
Health
Drugs
Elderly
Safety
Physicians
Scams
Faceted Navigation
• User can start with any category, and see the
results grouped by the other categories.
• Example:
 Start with Health
• See results grouped by subcategories of Health, such as Drugs,
Nutrition
 Alternatively, user can group results by other
categories:
• Click on Financial, see Insurance, Payments, etc
• Click on Teens, see results relevant to teens
Examples of Faceted Layouts
Examples of Faceted Layouts
Best Practices for Search
Thank you!
Marti Hearst
Web Manager University
November 10, 2009
Ease of Use
REDUCE STEPS
USE CLEAR LANGUAGE