Download Search - Andreas Weigend

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WeigendNotesStanford2004.doc
Version history
V01
V00 created 2004.06.30
Week 1 - Class 1
Goals
Learning to ask questions
Actionable outcomes
Definitions
What is eBusiness
What is data mining?
Empirical
What else?
Insights
Observation vs Experiments
Experiment: vary the thing you can act upon
2 groups: A, B; or treatment and control
Do it in parallel, vs sequentially
Other effects, such as day of week can swamp out small effects
Levels of data, levels of analysis
Exponential growth of automatically collected data
Constrain problems from what actions are possible
Domain knowledge matters: directs us where to look
Behavioral economics
Levels of explanation
What? vs why?
Different depth of “why?”
Need to create insights that generalize
5 steps of modeling
Not a linear process, but a “dialog with the data”
not an IT function, but an iterative, creative process
Importance of defining metrics
Understand the trade-offs
Baseline
Recommendations were based on (anonymous) co-purchasing behavior
Arrow of time
Game metaphor
Example of tech transfer
Issues
Who owns the data
State street
Ppl who bought also bought
Adverse governments
Class 2
Insights
Technology enables: Key ingredients
Communication
Connectivity
Standardization
D:\840978123.doc
Page 1 of 18
From car parts to
Size of eBiz
O(10%) of overall biz
But: Multiple counting of parts, only single-counting of whole?
O(10%) of the world’s population
But: definition, vs weighted by wealth
Impact on productivity
Other
Organization
TAs
Project
Textbooks
Introduction to e-Business
Background reading
Technology: BFS Ch2
Statistics: B&L Ch5
Class 3
Student backgrounds
Understand backgrounds and expectations, what problems and data do they have
SCPD off-campus people to write in to Eric
What was important last class?
Define e-business
Internet age vs industrial age
2 economies of scale: production vs network
Key ingredients
Communication
Standardization
Storage is free
Search: from finding to ranking
1) context (local)
2) static (link)
3) dynamic (trajectories of customers)
New pricing mechanism
Buyers (for the company)  power shifts to the consumer
Inventory
Comparison shopping
Pay by imprint
Pay by click
Pay by buy
Priceline
Auctions
Primarily in B2B
Trade-offs
Privacy vs Convenience
Security vs Convenience
Power vs Ease of use
Review
Technology and information
Frame: why it matters. Why now.
Infocomm
Concepts of underlying technologies (and business relevance)
D:\840978123.doc
Page 2 of 18
TCP/IP  VoIP etc
Sources of data and scales
E-business: Retail, telecom, airlines, hotels
Trade-offs for customer (Cost of privacy. Power vs ease of use)
Server side data: Customer registration, purchase, payment. Session aggregates, clicks, all that
was displayed.
Client side data: Toolbars (yahoo, google, alexa); ComScore.
Trade-offs in customer perception: privacy vs convenience.
Customer education, trust.
Web services
E-business
Scope, Size and growth
“Stanford01 ppt”
Statistics (B&L Ch 5)
Exploratory data analysis
Description / characterization of observations of single variable. What makes sense when/ Power
“laws”. Entropy.
Experiments vs observation. Causality vs Correlation
Many variables: visualizations,.
Representation: time of year examples
Statistics: Quantifying uncertainty
Making predictions (and knowing how good or bad they are)
Making decisions: Combining predictions with preferences
Value of information
Exploratory: Also Decision Trees, Ch 6
Model based on decision trees can be understood / point to exactly the rule that led to it (e.g.,
denial of credit)
Quantifying uncertainty
Don’t blindly trust numbers presented to you
Plot early, plot often; look at raw data
Understand what questions to ask
If you don’t measure, you can’t manage effectively
Often, shining a bright light on the problem is already enough
Quantitative methods: At their heart, they are about quantifying and managing uncertainty
Don’t use intuitions because they can easily be wrong
If there was no uncertainty, method very simple
Understand what is likely and what is unlikely
Predictive ability on new data (out-of-sample) is king!
Learn to avoid mistakes
Make people aware that they are making decisions even by avoiding to make decisions
Different kinds of random samples
Everything has trade-offs, nothing is black and white: we quantify the different way of grey
Ultimately: need to drive business decisions (under uncertainty)
Just do it
If you understand the risks / the uncertainty, you can make decisions with more certainty

Specific learning
Empirical rule (one variable)
By observing a few points, you can make predictions for rare events
Distinguish chance fluctuations (statistical errors) from bias (systematic errors)
Observed X = true X + chance error + bias
Statistics only can tell you about the chance error
Chance error proportional to 1 / sqrt (N)
D:\840978123.doc
Page 3 of 18
Understanding correlation (two variables)
Limitations, interpretation as linear measure of association
Correlation is not causation
Size of shoes, and linguistic ability are correlated
Chapter 17
How data actually look like, in contrast to how you would like it to look like
Today (Class 3)
Applications to Marketing
Definition / Why it is important
Segmentation, history of direct marketing
Note: act more quantitatively / model based, allows to create even better models
Active learning
(underwear)
Market research (general)
Survey
Comscore: Panel (implicit behavior)
Email campaigns
Measure response
Iowa electronic markets
P2P example for the labels
Observer: queries to download .NE. listening
Streaming service could be different
Cluster queries, determine segments
Issues
Sampling
How to sell it?
Paradigm shift: specific customer and specific session intention
Lesson: focus on current session (Eg., home loan, vs SamTrans; Russian Ladies)
Occasionalization
ASSIGN PAPER
B&L
Ch4Why DM is important in business
NYT p 90
Learn to formulate the right questions, to frame it right
Ch2 P 23
How important is it to understand the business case correctly
P59/ General Mills; Who is a yoghurt lover? They want to be the category manager; who
determines what is placed where on
What they wanted: ZIP codes etc rather than printing coupons
Sources of data
Applications: advertising.com vs server
Session signatures
Cookie
IP address
Time of day
Browser info (color depth, window size)
Purchase y/n
http- referrer
Timing
Customer signatures and customer segmentation
Behavior-based vs demographics-based
Clustering algorithms
Driving and driven by actions
D:\840978123.doc
Page 4 of 18
Conjoint analysis
Interaction with the Company: Customer experience and satisfaction models
Implicit vs explicit data
Surveys: Sampling (session-based or customer based?), non-response biases
Stated vs revealed preferences
(Guest: Ketchpel/Vividence)
Shopping process models (PRMs)
Bayes Nets
cf D’Ambrosio/CleverSet
Network effects (Pedro): Traditional marketing vs network marketing
Pedro Paper + Talk
Data mining for viral marketing
Customers influence each other
Traditional methods ignore this
Model markets as social networks
Networks mined from collaborative filtering systems and knowledge sharing sites
Some customers have very high network value
Targeted viral marketing much more profitable than direct marketing
Class 4
Summary of Class 3
Empower the end-customer
Closing the feedback look
Marketing
Create product
Esp for digital products
Market research
P2P activity info
Observing
Experiments
Eg new song
Paradigm shifts
Time scales: Estimate CURRENT intentions
From customer to session
From corporate buyers to end-customers
Both for physical and for information products
Specific example of empowering the end-customer
Design of Experiments
Statistical design of experiments vs scientific design of experiments
Importance of having to pay for things
Trade-offs
Class 5
Logistics
IM [email protected] during class
Summary of Class 4: Design of experiments
Art notes
Key is to vary something…
Michaelson-Morley
Look up year
.. but what do we vary: need your creativity, your playfulness
D:\840978123.doc
Page 5 of 18
Data mining is not opening a book at a (random, often!) page, and applying a recipe (often
wrongly)
Sad to see that people focus on statistical significance, and customers ask about statistical
significance.
They should ask about relevance!! About actionability!
Examples
First, in all cases, agree on what you want to optimize
Probability of a worst case, average squared error, average absolute error…
Sponsorship matters: DARPA vs VCs
Response time of server
Assume a model, e.g., quadratic response
# of DOF so small that we saw little ellipses as confidence regions
AB Tests
Discrete
Main problem: Meta-theory of what experiments to do, how to generalize from one to the
other
Relationship to “active learning”
2 panel goldbox
Short-term vs long-term
Sponsored links
People are curious: lots of clicks initially, then no more
Vs heat bath / large pool: people always entering new
Search
Framing: This course about technologies, about data, methods and the Internet.
One of the key technologies that has dramatically changed the way we do thinks,
privately and publicly, is SEARCH.
In this course, we will present 3 perspectives on search: General (Web), Products
(in a shopping context), People (in a dating context).
There are lots of insights you will gain, but let’s start simple:
How do you look for a string in a given doc, on a disk
Once vs many times: Trade-off
Similar to data bases
Idea: Indexing
[Reference: Managing gigabytes]
Q: Spell correction? No, keep all the typos
(But might provide tools
(1) looking at transpositions etc, possibly having model of what mistakes people make
(2) trajectories
Show Andreas Weigand at yahoo, then google)
Example: Clearning
Meaning = use: Wittgenstein
We have talked about consumer behavior before … Note here the same shift from normative to
descriptive
(What are the political interests to keep with the normative? What is the point of communication,
vs the Academie Francaise, the French Academy)
Applications for personal search
Not: personals search…
Longhorn
Search deeply embedded in OS
Gmail
Filing is hard: never know what will be needed
E.g., real estate agent
Paradigm shift: from filing to (indexing and) searching
D:\840978123.doc
Page 6 of 18
Q: Why does search on most intranets suck?
Q: Ask students how satisfied they are with their intranet
Blue pants example here
Search inside the book
1.07 times as many books sold with Search inside the book compared to books that don’t have
that feature
How about the web?
Need to crawl the web
Classnotes: no links to them, need to know – no way to find this.
Hyperlinks!
Different crawl strategies / policies
[BFS]
Serious CS issues
Don’t try this at home
Order of magnitude: Number of computers
Underlying technologies
Communication
Standards
Storage
And, one level deeper, processing power, RAM…
Ok, so now we have the index of the web
Examples
"Bird Diapers"
Diapers and beer:
Highest nr of adult diapers sold per capita (by ZIP code)
"Is it a fruit or not"
"Vegan Singles"
But what about all the weigends?
(Also show the google number)
…Solved one problem, but create another one…
Like in the good old days, when you were successful, your server would melt down
… ranking
Paradigm shift: from finding to relevance ranking
Relevance of course context dependent
George Miller’s literacy test
What kind(s) of info can we use here?
Local info: where / how often does the word appear
Keyword match (title, abstract, body)
Anchor Text (referring text)
Static info: link structure
NYU undergrad course
Enough? No – need to consider the quality of the sites linking to me
Iterate, formally largest eigenvector.
Luckily easily gotten,
[BFS]
Dynamic info
Click-through (Direct Hit)
No statement about what is beyond the link!
(Can do smart evaluation, Thorsten Joachims, but this problems remains)
(Unless more complicated models)
But that only goes for the attractiveness of the link!
No statement about what is beyond the link!
D:\840978123.doc
Page 7 of 18
(Can do smart evaluation, Thorsten Joachims, but this problems remains)
(Unless more complicated models)
Exploration vs exploitation trade-off
Trajectories of people
…. And… Money: Goto/Overture: rank by $$
Right combination: $$ * click-through
Beliefs
No human editors
Focus on the process
Other people’s work
Economies of scale
Provide the means of production
“Platform”
Empower people to become authors, editors
People want to contribute, (1) make it easy for them to contribute, (2) organize their contributions,
(3) learn, and channel them in the direction of max leverage
Limitations: political views enter:
Loved it: nobody answered.
just nobody clicked
Power of community
So far, from the perspective of the company providing search…
Economics of search
Advertising.com
Discuss who sees what
Product search
Price Comparison sites (aggregators)
Shopping.com
NexTag.com
How do ppl search on P2P networks
How does their behavior change according to what they get back / mental model of the space
What can be done to guide them to more advance search?
People search
User = product
Consider IM activity
… Now from the perspective of the user
How much time do you spend searching per week?
No such notion of a “session” any more
Personalization
Any search engine in 5 years: knows your preferences, highly targeted
Focus on customer, focus on creating value. Not about windowdressing and balance sheets
creating transparency, through data mining
Localization
Knowing past searches
But for this, need to start a major collection engine that does collect data about the past
Toolbars: weak value props
Economics for the end-user
Advantage Amazon has over companies that don’t sell physical goods
Class 6/ Guests: Chris Pirkner, Jon Herlocker
Presentation on the Web
Class 7/ Guest: Jim Oliver, eBay
D:\840978123.doc
Page 8 of 18
Discussion: PERSONALIZATION
Desiderata: What are desired properties of a personalization system?
Purpose: Why do we want to offer the customer a personalized experience?
Class
Not static, but adaptive (slow time scale)
Detect current situation (fast time scale)
Understandability / Interpretability
Making things easier
Save time
Especially on repeated tasks
Help user discover things
Metrics: How do we measure whether the system does a good / bad job?
Class
Accuracy – defn?
Relevance – who determines relevance?
Ask user explicitly?
Does user click at it (short-term)? Does the user return?
Timing / Context
Note: is a loop!!
P(purchase)??
Baselines?? Empirical density
Pick random items leaving the warehouse now
Customer satisfaction?
How do we measure that??
Time scales: Short-term vs long-term
Will it dramatically differ according to personality?
Track more things
Esp behavior over time
Proxies for:
Satisfaction
User experience
Inputs: What data are used to build the model (variable selection)?
Class
Device (mobile phone vs PC)
Geolocation / IP Address
Shopping history, both browsing and the order history
Past price sensitivity
Key distinction: Passive observation vs active collection
Personalization
Implicit: customer has no choice
…vs Customization
Explicit set of choices, but people are not doing it unless they see what’s in for them
Explicit ratings (like / dislike)
e.g., as basis for recommender system
Disambiguates the otherwise unary bought vs don’t know
Amount of inference
From just storing past searches…
To having complicated models
User acceptance: are users willing to spend the time to understand why they see what they see?
Related
Multi-modal personalization
Can view as schizophrenic, multiple personalities
Either user explicitly logs in with different identities, or sorts actions to belong to identities, or
D:\840978123.doc
Page 9 of 18
Occasionalization
Room-mate, visit, hook-up
Availability of historical data
Personalization by definition based on persistent data
Discuss range of data collection
Stateless
Scope = session
Scope = entire history
Do we want to enable the user to edit their history?
What people edit out is informative
Or at least turn it on or off?
Outputs: What does the model produce?
Class
Perhaps build model of preferences
Characterization of the situation
Mode: Push vs pull for delivery of the result
Independent of initial
Company-initiated vs customer-initiated activity?
Exploration vs exploitation
Wants to delight the customer
Acceptance: What does the customer like? What scares them?
Does it depend on the individual customer?
Psychology of customization
What drives people to get special ringtones?
Want to express themselves, define themselves
Who experiences the results of the customization?
Psychographics vs more “objective”
What does the customer consider personal or private (vs public)?
Blogs
Anonymity / Pseudonymity?
Self-realization
Psychology of personalization
Pay-off matrices: individual differences in false-positives, false negatives
Careful to keep in mind the alternative of no customization / personalization
Value propositions
What’s in for the user?
Co-creation, participatory design
Explicit incentives
Implicit
E.g. better ranking of search results based on their past click behavior
How much are they willing to work for good personalization?
What’s in for the sponsor?
Lifecycle
Up-sell
More features, understand how far user willing to go (e.g., good salesperson!)
What cues beyond clicks will be possible?
Interaction / Discourse / Mode switching: ask key questions for decision explicitly (rather than let
user build a mental model of the system!)
Cross-sell
Market basket analysis
Association rules of stuff bought (or in shopping basket)
Info through navigation
Risk
D:\840978123.doc
Page 10 of 18
Trust vs bad PR
User education is key, but how? Who takes the time??
Privacy, Info leakage
What would be good examples?
Ads
Ask about elevator
Computer Operating system
Open Sesame, ca 1995, suggested what files to open next
Search: How would you do customized vs personalized search
Geolocation key
Supershuttle example
A historic perspective: Personalization 4 years ago
What has disappeard? Why? What has survived? Why?
Industry Standard summary
http://www.thestandard.com/article/display/0,1151,12444,00.html
March 06, 2000
The Profilers: Getting Personal
Web software applications that track consumer behavior are springing up all over, and
some of them aren't invasive at all.
By Jenny Oh
E-businesses and advertisers are continually searching for new ways
to personalize the Web experience and attract new customers.
Consumers, meanwhile, are increasingly concerned about controlling
their personal information on the Internet. Fueling the privacy race
is a variety of companies that offer software and Web-based
applications that serve both needs. The following is a rundown of
some of the services that are currently on the market, and others
that soon will be.
COMPANY
Alexa
San Francisco
Andromedia
(acquired by
Macromedia
in October
1999)
San Francisco
PRODUCTS
Alexa's downloadable
navigational bar tracks
and aggregates users'
Web visits and provides a
Related Links service.
Owned by Amazon.com,
the company also offers a
beta version of zBubbles,
a productrecommendation guide.
LikeMinds personalization
software aggregates
information provided by
Web visitors and makes
product
recommendations based
on data collected from
"like-minded" users.
PRICING
CUSTOMERS
Free to
consumers.
5 million downloads
of Alexa's
navigational bar;
250,000 downloads
of zBubbles.
Starts at $25,000
for 50,000 users
and can go as
high as $100,000.
More than 120
customers,
including the
Boston Herald,
Cinemax, E-Trade,
Levi Strauss, Sun
Microsystems and
the U.S. Postal
Service.
D:\840978123.doc
Page 11 of 18
CoreMetrics
San Francisco
ELuminate Web-based
application tracks and
analyzes consumers'
browsing and buying
behavior.
Plans to charge a
monthly service
fee.
Service will go live
in mid-April.
DoubleClick
New York
DART technology profiles
users based on the ads
they click. Boomerang
technology tracks product
and service preferences
of consumers visiting any
of DoubleClick Network's
750-plus sites and
creates a list of frequent
buyers or previous
visitors for advertising.
Through its merger with
Abacus, DoubleClick
offers consumer
purchasing-behavior
information to marketers.
CPM rates vary
depending on
whether the
advertiser belongs
to the DoubleClick
Network and the
number of
targeting filters.
DoubleClick
Network serves
over 11,000 sites,
including Kelley
Blue Book,
Thomson Financial
Interactive and
MindSpring.
Engage
Technologies
Andover,
Mass.
ProfileServer technology
tracks customer
preferences at specific
Web sites. AudienceNet
tracks users' browsing
habits, then draws on its
database of 42 million
demographic-based
profiles to deliver
targeted ads.
Pricing for
ProfileServer
varies. Cost for
AudienceNet is
CPM-based.
1,400 customers,
including CNET,
NetNoir and Image
Networks. This
figure also includes
users of Engage
services and
products other than
those mentioned
here.
Lumeria
Berkeley,
Calif.
SuperProfile technology
lets consumers customize
and selectively share
their personal profiles
with Web businesses. The
Lumeria Ad Network
delivers ads based on
consumers' profiles and
pays consumers for each
click-through.
SuperProfile is
free to
individuals.
Lumeria charges
an undisclosed ad
rate to
businesses.
Scheduled to ship
in April.
MatchLogic
Westminster,
Colo.
TrueMatch technology
uses cookies to
determine demographic
information about a Web
surfer, then uses
MatchLogic's database of
72 million profiles to
deliver targeted ads.
CPM rate varies.
400 customers,
primarily
advertisers. This
figure includes
customers of other
MatchLogic
products as well as
TrueMatch.
Net
Perceptions
Eden Prairie,
Minn.
E-commerce and Ad
Targeting software
generate profiles through
direct customer queries
and by monitoring
browsing habits. The
software then makes
product
recommendations and
delivers ads.
The annual
license fee for Ecommerce starts
at $75,000 for up
to 50,000 users.
Annual license fee
for Ad Targeting
begins at $25,000
for sites serving
up to 1 million
ads per month
and goes as high
as $250,000 for
sites serving 500
million ads per
156 customers,
including Dean &
Deluca, eToys,
HomeGrocer, Lycos
and Virgin Online
Megastore.
D:\840978123.doc
Page 12 of 18
month.
Personify
San Francisco
Essentials server
software identifies
patterns in consumers'
Web behavior and
integrates data from user
registrations and from
offline databases.
Proactive e-mail app
generates targeted e-mail
lists based on the same
data.
License fee for the
basic package
which includes
Essentials plus
one add-on option
is $385,000; the
premiere
package, which
includes
Essentials,
Proactive and four
add-on options is
$490,000.
75 customers
including
Drkoop.com, J.
Crew, Onsale and
Patagonia.
Predictive
Networks
Boston
Predictive Networks'
technology enables ISPs
to track the ads and Web
sites that consumers
click, then aggregate
their preferences to
deliver highly targeted
advertising.
N/A
Internet service
providers and
advertising
agencies.
PrivaSeek
Broomfield,
Colo.
Persona technology
allows consumers to
customize and selectively
share their personal
profiles with Web
businesses; functions as
an e-wallet.
Free to
consumers.
N/A
WinWin
Boston
WinWin technology
delivers ads based on
demographic and
psychographic
information provided by
individual consumers.
Advertisers pay
anywhere from 1
cent to $1 per ad,
depending on the
level of
interaction with
the consumer.
Hook Media.
Younology
New York
Orby technology creates
profiles by tracking clickthroughs and directly
asking consumers about
their preferences. Total
Perspective for ebusinesses tracks Orby
users to create a
personalized shopping
experience. SmartSense
Server links consumer
profiles to business
suppliers and partners in
real time.
Orby is free to
consumers. Total
Perspective costs
$1,500 per site
server plus
monthly fees
ranging from
$2,250 to
$60,000
depending on the
number of
registered Orby
users. There is a
$9,995 site
license fee to link
to SmartSense
Server.
Scheduled to
launch in midMarch. Customers
include the Knot,
Travelbreak.com
and Fuxito
Worldwide.
D:\840978123.doc
Page 13 of 18
Article from ZDNet
http://www.zdnet.com/pcmag/stories/reviews/0,6755,2327781,00.html
Class 8: Infomediary + Unbundling the corporation
Reduce cost of interaction / information / transaction
US 70% interaction activities
India 40%
Negotiation tools – Impact of IT?
Where to spend the time of the human??
Good point: free up / reallocate that time to more creativity / redeploy
Existence of firm: shaped by interaction costs
As technology shifts, customers empowered to obtain more info as well as NEGOTIATE with the
customers
What is a market
Customers finding vendors that are best / good matches
Infomediary: power is shifting (not sudden, over decades in US: from producers to retailers!!)
Cf: Wal*Mart becoming a customer agent, squeezing out suppliers
Also: Dell.
What is an Infomediary
Agent who acts on behalf of the customer
Net worth: NOT a privacy event, only an element of the concept
Greater value: to be helpful to find vendors
What factors need to come together to have these new businesses
develop?
Examples of startups that emerged (and failed)?
Late 90’s: about eyeballs, fast growth
(1) BUT not enough quality commitment, takes time to build profiles
Q: sub-markets?
(2) Overly optimistic on technology (XML already 5 years ago!)
(3) Successful intermediary requires

trust and profiles / customer information (exists in larger companies, bank,
retailers…)
… but their mindset is not that of an agent!!! They just want to sell more products
 clean-slate startups / entrepreneurial
Q: verisign? Deutsche Post?
Positive examples in business setting
Li & Fung
HK based, 100 years old
They act as agent on behalf of apparel designers
They have 7.5 companies whose capabilities they know very well
They engineer business processes on behalf of their customers
Of course, done with very little technology. Only now moving toward technology
They are spending a lot of time up front to discuss trade-offs etc
After 9/11, they managed to move production from Pakistan to “safer” countries
D:\840978123.doc
Page 14 of 18
USD 5bn in revenues, similar to Amazon.com
What is the bottleneck?
Not technology, but mindset
Business issue of not thinking to send traffic to others
Cf Alibaba.com
Freemarkets (Pittsburgh), recently acquired by Ariba
They would go out and qualify vendors based on the needs
Then they create a reverse auction, having
Alternative technologies, but different business models
Has search replaced the Infomediary?
BUT: only a small part is searchable!!
What about personal agents (“agent-based technologies”?
Technology: What can be automated?
Platform
Reputation systems
Market design: What are good market mechanisms for information?
Economic schemes
Is it just that companies can get it away without paying??
Or do we need to define paying more broadly (with attention, with time)
Example
Hire-right: stupid that they don’t contact the customer
LinkedIn
Why is it so hard to get companies (e.g. HR departments) to close a loop
Issues
Validity of information!!
Acxiom, but what’s in for the end-customer??
Also: Need mechanisms for fixing simply incorrect information
Unbundling the Corporation
Less visionary, more what’s on the way already
Initially bundled because of communication overhead… but very different
mindsets and cultures!!
1. Customer relationship
2. Infrastructure
High-volume, routine
E.g., manage a logistics network
3. Product innovation
Product lifecycle
What has happened in the 5 years since this article?
Companies outsourcing core operations etc
Companies driven by taking out cost of their business
Sport shoes
Design (cf GA’s)
www.spectorCNE.com (98 SLAC story)
Thoughts on privacy
Auction off information?
Add: RFIDs, .4mm on the side, 5cent, dropping to 1c, typically 128bits
Up to 10+ meters
Wireless cameras
D:\840978123.doc
Page 15 of 18
Micro-GPS
My HK transmitter
Header info/ Mike Schwartz paper
Blogs.
IBM CN example: characters expressed in Pinjin and translated into english
Understand what is non-communication recording (eg GPS in rental cars), and what is
communicating (Payless: $1 / mile outside CA/NV)
Can we restrict propagation of existing info (whether correct or not)
Drive through toll-station
Clothing: conclude from RFIDs + mobile phones who you are
(But then of course lost and found)
Yet -- same tech can make inventory / supply chain etc way more effective (Wal-mart estimate:
save 8bn / year (total sales: 250bn)
31M surveillance cameras in the world. Data production amount??
4M in UK (-10yrs: 0.15M)
Smart badges, ca 1990 at then-Xerox PARC: mad your phone ring in the ofc you happen to be in
(pre-mobile area)
Phone sensing volume and being smart about it; elevators
A dozen syphilis cases in SF ca 1999 could be traced back to a chatroom
SH jiao tong card/ metro: by the minute and second which gate you entered, which you left
HK: Octopus card
Should people be made aware when they are being checked out?
(Show log of weigend.com)
In this world: ppl worry about cookies!
Requesting authentication...: listening habits
Already: TiVo PVR (Personal Video Recorder, cf the NYT story)
So, will my fridge be talking to my mobile when I am at safeway, and they have caviar on sale?
Or will safeway be talking to me since they know?
Class 9: Spam: Rick Giarrusso
Email: Relatively recent phenomenon
Mid-80s: email rare
Mid-90s: email popular, but not spam yet
What is Spam?
Unsolicited email
(Will frequency of sent-out emails make a difference to the perception of whether it is spam?)
Individual vs community
1) Individual recipient
Manually maintaining a list
Of senders
Of keywords…
But lots of difficulties with rules, such as the order they are applied in
2) Community based approach
Implicit data
Lots of emails arriving from a certain sender
Explicit
Empowering users to label emails as spam
D:\840978123.doc
Page 16 of 18
Probabilistic model (Bayesian network)
Explicit vs implicit data
Issues
Problem: Users need to understand the result of their actions
People using the spam button instead of the delete button
How do you train users?
Different people have different modes to learn things
What info to use?
Header info, including TCP/IP
Arms race
White-colored text on white background
(non-copyrighted stuff) to swamp the filter
Return address from
Note: Spammers and viruscreators make nteresting bedfellows
People and Processes
Ex: Scenarios
Interesting: build a Bayes Net to model the outcome of a litigation
Helps to focus people on what they really are thinking about / communication issues
But: communication does not mean PPT slides…
Why do people not have those conversations??
Logical, rational way of doing problem solving?
What role does data mining play?
Processes
How do you quantify the benefits?
Chuck Lam’s insight:
Scientist are trained to asked questions
MBAs are trained to give answers 
Pricing
Class 10: Analyzing proxy data and data from other sources to create
information products for Wall Street
www.weigend.com/tmp has several papers by Majestic Research (MR…)
Speaker: Seth Goldstein, CEO of Majestic
Followed by Reception in the stats department
Goal
Create transparency
Key driver
Focus on proprietary data
Not accounting irregularities, forensic stuff
Dimensions of information products
Latency
Granularity
Exclusivity
2-dimensional plot:
Along the diagonal: Tools – Reports –Servicds – Custom
X: $1 … $1bn
Y: 1000’s of clients -> 1 client
Which are good stocks for them to predict?
Retails, not Enterprise software
Enterprise: might be made or killed by one consumer
Travel
D:\840978123.doc
Page 17 of 18
Initially
Cool: Then look at their costs! I.e., prices they are paying for keywords (Isn’t that smart?!)
And 2 more dimensions
Thin – Liquid
Calm - Volatile
Data Company or Research Company?
Service company?
How are the data analyzed?
For any equity, there is a model out there
Same store sales
New subscribers
License revenue
NPD monthly sales
…
Import, scrub and categorize initial data
Build models
Assess results
Confidence intervals
Analysis of residuals
Normality assumptions
Out-of-sample testing
If possible assessment, automate data collection and model updating process
What makes an analyst?
Quantiative
Creative
D:\840978123.doc
Page 18 of 18