Download slide

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
蔣以仁
Search Problem
Search Query: Jaguar




Jaguar(Animal)
Jaguar(Automobile)
Jaguar(Watch)
Jaguar(OS)
Monika Henzinger, Search Technologies for the
Internet Science, Vol. 317. no. 5837, 468 – 471, 27 July
2007
2
檔案的目的在為未來創造知識
 …records are recognized as agency assets used to
underpin current business and legal needs, as well as
the basis for a knowledge management system to meet
future goals. – HOWARD P. LOWELL
Director
Modern Records Programs
NARA
資料探勘走向決策支援
 彙整同一性質資料
 資料探勘以產生關聯相依規律
 視覺化顯示協助專家研判主題
 定義處理指引方便建立決策支援
KDD Process
Interpretation/
Evaluation
Data Mining
Transformation
Preprocessing
Knowledge
Pattern
Selection
Transformed
Data
Preprocessed
Data
Target Data
Data
Warehouse
BI 結構
metadata
Other sources
Operational
DBs
Data Sources
Monitor
&
Integrator
Extract
Complete Data
Transform
Warehouse
Load
Refresh
Data Marts
OLAP
Server
Server
1. Comprehensive
Performance
Management
2. Analysis
3. Query
4. Reports
5. Data mining
Tools
Business Intelligence
8
Gaining market intelligence from news feeds
9
Sreekumar Sukumaran and Ashish Sureka
Signal
 Dr. Bhandari said, “I first noticed this when the
New York Times did an analysis after the fact
showing that early indications of the FordExplorer-Firestone-tire problem went undetected
in a federal database. Recently, a similar analysis by
CNN showed that early indications of security
problems at Logan, Dulles, and Newark airports,
went undetected in a federal database well before
the September 11 tragedy. It is clear that the cost of
missing these patterns is too high to be ignored.”
資訊整合
Mining target: individual text
Mining unit:
>texts
>category labeled items extracted from
text using NLP
Original Data
Structured Data
Call Taker: James
Date: Aug. 30, 2002
Duration: 10 min.
CustomerID: ADC00123
Q: cust sys has stopped
working.
A: checked cust bios and
it need updated. …
Unstructured Data
Meta Data
Category
Category
Dictionary
Synonym
Dictionary
Item
Visualization &
Interactive Mining
[Call Taker] James
[Date] 2002/08/30
[Duration] 10 min.
[CustomerID] ADC00123
Mining
Linguistic
Analysis
[Noun] Customer
[Software] BIOS
[Subj...Verb]
customer system..stop
[SW..Problem] BIOS..need
Tagging
Dependency Analysis
Named Entity Extraction
Intention Analysis
IBM TAKMI
(Nasukawa, Nagano,1999)
醫學文獻告訴我什麼
 醫學文獻來源:Medline
 可發現疾病、症狀與藥物或化合物的因果關
聯
1.
Swanson DR. Searching natural language text by computer. Machine indexing and text searching offer an
approach to the basic problems of library automation. Science. 132:1099–1104, 21 Oct. 1960.
2. Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med.
30(1):7–18, 1986.
3. Swanson, D.R., Complementary structures in disjoint science literatures. In A. Bookstein, et al (Eds.),
SIGIR91: Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and
Development in Information Retrieval Chicago, Oct 13-16, 280-289, 1991.
偏頭痛?
 Stress is associated with migraines
 Stress can lead to loss of magnesium
 Calcium channel blockers prevent some migraines
 Magnesium is a natural calcium channel blocker
 Spreading cortical depression (SCD) is implicated in
some migraines
 High levels of magnesium inhibit SCD
 Migraine patients have high platelet aggregability
 Magnesium can suppress platelet aggregability
Smalheiser, N.R. & Swanson, D.R.. Assessing a gap in the biomedical literature: Magnesium deficiency and
neurologic disease. Neuroscience Research Communications, 15, 1-9, 1994.
文獻實証
All Migraine
Research
migraine
CCB
PA
SCD
stress
All Nutrition
Research
magnesium
找出新線索
雷諾氏現象
Raynauds
Hypothesis generation
Fish oils
vasoconstrictions
血管收縮
platelet aggregation
血小板活化凝集
blood viscosity
粘滯血症
Intermediate concepts
Swanson, D.R. (1994). Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med.
Autumn;30(1):7-18, 1986 .
不得不提到的技術-自然語言處理 NLP
 始於1948年倫敦Birkbeck College字典查詢系統
 1949- Warren Weaver之American Interest破解
密碼
 1950- 機器翻譯 (German to English, Russian to
English)
 1966~ 雷聲大雨點小
 機器翻譯字對字 (Dr. Eye?)
 NLP brought the first hostility of research funding
agencies.
 NLP gave AI a bad name before AI had a name.
資訊巨幅成長
2006 年數位資訊量已達 1,610 億GB( 相當
於 161 Exabytes) 。
IDC 預估從 2006 至 2010 年間,資訊成長
量約為六倍。
2010 年時,有近 70% 的數位世界的資訊
是由個人使用者所創造,而至少有 85% 的
資訊量是組織企業必須負起資訊安全、隱
私、可靠性及相關法規遵從的責任。
The Expanding Digital Universe,
http://www.emc.com/leadership/digitaluniverse/expanding-digital-universe.htm
100
網路訊息
新聞報導
專利
電子郵件
文件…
90
80
70
60
50
非結構資料
40
結構化資料
30
20
10
Oracle
0
資料量
市場化價值
Search Engine Roadmap
Exploratory Search
Affiliation
(Topic Relevance
Analysis)
Dictionary/Ontology
Wikis
Full Text Search
Including complex
Boolean search
Clustering/Categorize
Synonym/Anatomy
Document Abstraction
Custom Search
Knowledge collaborative
search
Filtering
Crawler
Integrate other
search engines
Summarization
(mobile)
Multiple abstracts
organization
Search log recorder
Personal tagging
Sharing
Forum/Blogger
Customized meta-search
Taxonomy search
Natural language processing/
understanding
Web Page Features
Extraction
(semi- and un-structure)
Feature Ranking
Feature Mapping
Recommendation
Taxonomy Search
Visual Technology
Ajax
Topological
Graphics
Web 2.0 or upper
Collaborative Filting
Visualization
網路搜尋引擎
 以離線方式抓去網頁,透過建立一種內部資料儲
存方式,稱之為 (反轉;inverted) 索引,儲存資料
 線上檢索
Monika Henzinger, Search Technologies for the Internet
Science, Vol. 317. no. 5837, 468 – 471, 27 July 2007
Search Engine Problems
Index Comprehensiveness
Relevance
Deterministic Search
 Search Query
 Jaguar(Animal)
 Jaguar(Automobile)
 Jaguar(Watch)
 Jaguar(OS)
 Problem: Scalable
J, Beall, The Weaknesses of Full-Text Searching. The
Journal of Academic Librianship, 34(5):438-444, 2008.
搜尋引擎之演進
 第一代– 只使用“網頁內”文字資料
 字頻, 語言
 第二代--使用非頁內, 網路上特殊屬性資料
 連接分析
 點擊資料 (What results people click on)
 下錨文字 (Hyperlinks, How people refer to this page)
1995-1997 AV,
Excite, Lycos, etc
From 1998. Made
popular by Google
but everyone now
 第三代– 回答 “查詢所知”
 語意分析 -- what is this about?
 專注使用者所需, 非僅僅查詢
 關鍵資料之推定
 輔助使用者
 整合搜尋及文件分析
Still experimental
網路搜尋問題
 問題
 查詢過於簡短不夠精確
 同意與相似字詞讓查詢匹配度難預期
 網頁作者混淆式安排, 讓搜尋結果差強人意
 使用者需要額外功能, 如過濾器
 解決
 增加理解
 結果排列
 Trailblazer
 Car
 Basketball team
Monika Henzinger, Search Technologies for the Internet Science, Vol. 317. no. 5837, 468 – 471, 27
July 2007
Expand
Crawler
Basic Crawler
Wrapper/Clipper
XHTML, DHTML Parser
Feature Transformation
XML Parser
Structural Features
Extraction
HTML Parser
…
Scheduling
Clipper Windows
Specify
Hertrix
Crawler
Unstructured Document
Features Extractions
(NLP)
Feature Mapping
Ontological Organization
Specific Feature Parse
…
Filtering
…
Ontology
Machine Learning Approach
Semantic Crawler
P2P Knowledge Sharing
Crawler
Crawler Classes
Annotated Crawler
Craw with specific terms/phases
Crawler
Outside Search
Engine
Supporting Information
from original sources &
Reference contents
Filter
Data Sources
Learner
Relevant
Information
Feed into Reference List
Authoring
User process
Filtering NE
records
Web
Crawler Classes
Page/Section/Block/Item Specify
GUI Specification System
Scheduler
Crawler
Notify for manually tune
Logger
Adaptor
Log
Named Entities Recognition
Comparator
Compare the extracted structure
between two stages
Feature Extractor
Repository
時序性資訊彙整
事件分析
分群檢索
1.
2.
Walter Warnick, Problems of Searching in Web Databases. Science .
Vol. 316. no. 5829, 1284, June 2007.
I-Jen Chiang, Discover the Semantic Topology in High-Dimensional
Data, Expert Systems with Applications, 33 (1), September, 2007.
技術架構略圖
d1
d2
…
dm
t1 t2 … tn
w11 w12… w1n
w21 w22… w2n
……
wm1 wm2… wmn
Term
similarity
分群
Doc
similarity
Term Weighting
Tokenized
text
Stemming & Stop words
Sentence
selection
t t tttt tt
t
t tt
dd
dd
dd
dd
dd
d dd
d
Vector
centroid
摘要
d
Raw text
META-DATA/
ANNOTATION
分類/文件追蹤
Salton’s Vector Space Model
一袋子字 (Bag of Words)
A
 Cosine Similarity
 Jaccard index
θ
B
Jaccard similarity coefficient
Tanimoto coefficient
G. Salton, A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing,"
Communications of the ACM, vol. 18, nr. 11, 613–620, 1975.
Curse of Dimensions
1
句意不清: I saw the man on the hill
with telescope
Using a telescope, I saw a man who was on a hill.
I saw the man on the hill with
telescope
I saw a man who was on a hill and who had a
telescope.
I saw the man on the hill with
telescope
I saw a man who was on the hill that has a telescope
on it.
I saw the man on the hill with
telescope
自然語言處理新方向
The delegation, which
training
sentences
included the
commander of the
U.N. troops in Bosnia,
Lt. Gen. Sir Michael
Rose, went to the Serb
stronghold of Pale,
Speech
Speech
near Sarajevo, for
Recognition
talks with Bosnian
Text
Serb leader Radovan
Karadzic.
Training
Program
answers
NE
Models
Entities
Extractor
•Prior to 1997 - no learning approach competitive with handbuilt rule systems
•Since 1997 - Statistical approaches (BBN (Bikel et al. 1997),
NYU, MITRE, CMU/JustSystems) achieve state-of-the-art
performance
1.
2.
3.
4.
The delegation, which
included the
commander of the
U.N. troops in Bosnia,
Lt. Gen. Sir Michael
Rose, went to the
Serb stronghold of
Pale, near Sarajevo,
for talks with Bosnian
Serb leader Radovan
Karadzic.
地點
人物
組織
M. Marcus. New trends in natural language processing: Statistical natural language processing. PNAS.
92. 10052-10059, 1995.
Current Trends in Biomedical Natural Language Processing, Ohio State University, June 2008
Tanveer Siddiqui. National Language Processing and Information Retrieval. Oxford Univ Press, 2008.
Yorick Wilks. Natural Language Processing as a Foundation of the Semantic Web. Foundations and
Trends® in Web Science, 1(3-4). 199-327, 2009.
知識地圖
I-Jen Chiang
事件追蹤
資訊檢索
知識概念
議題內事件發生的相依關聯
查詢以瞭解議題內相關論點
論點角度(依機關、案由等)
議題內某事件所受之影響
議題內某事件之影響
依時間追蹤事件處理狀況
深入細節以瞭解現象、處置
權衡輕重以瞭解處事準則
事件追蹤分析議題主軸變化
組合屋議題下
政府震災地區災民住宅重建信用保證基金一千億讓災民取得貸款
組合屋議題下
重建條例訂定含括工程、獎助金
Integrated BI Systems
ETL
Complete Data
Warehouse
RDBMS
Structural Data
File System
XML
XML
Text tagger & Annotator
ETL
DBMS
Intermedia Data
EA
Unstructured Data
Legacy
CMS
Scanned
Documents
Email
Sreekumar Sukumaran and Ashish Sureka
標註
Date
Acquiring
Organization
Acquisition
Event
Acquired
Organization
On November 16, 2005, IBM announced it had acquired Collation, a privately held company
based in Redwood City, California for undisclosed amount.
Place
Amount
Output to
RDBMS
Text Annotator
Date
Organization
Place
Amount
Nov. 16
IBM
Redwood City,
CA
Undisclosed
XML
output
On <Date>November 16, 2005</Date>, <ACQUIRING ORG>IBM</ACQUIRING ORG> announced it had
<ACQUISITION EVENT>acquired</ACQUISITION EVENT> <ACQUIRED ORG>Collation</ACQUIRED
ORG>, a privately held company based in <PLACE>Redwood City, California</PLACE> for
<AMOUNT>undisclosed</AMOUNT> amount.
McIlraith, S.A., Son, T.C., Zeng, H.: Semantic web services. IEEE Intelligent Systems 16, 46–53, 2001
整合式BI系統
Intermedia Data
ETL
Complete Data
Warehouse
RDBMS
Text tagger & Annotator
ETL
Structural Data
DBMS
File System
XML
XML
EA
Unstructured Data
Legacy
CMS
Sreekumar Sukumaran and Ashish Sureka
Scanned
Documents
Email
Knowledge-based Persistent Archives
Knowledge
Repository for
Rules
Access
Rules - KQL
Knowledge
Relationships
Between
Concepts
Manage
XTM DTD
Ingest
Knowledge or
Topic-Based
Query
Attributes
Semantics
Information
Repository
EMCAT /
MIX
Information
XML DTD
(Topic Maps / Model-based Access)
Attribute- based
Query
Fields
Containers
Folders
Storage
(Replicas,
Persistent IDs)
GRIDS
Data
MCAT/HDF
(Data Handling System - Storage Resource Broker)
Feature-based
Query
NExIOM
Ontology
Models
Electrical Power
Electrical Power
Analysis
Analysis
W
S
Structure and
Connectivity
W
S
W
S
Trade-Offs
Analysis
Risk Modeling
Mapping
WS
Mapping
WS
Ontology
Authoring
Mapping
TopSCAPE
WS
COVE
Discipline
Ontology
Models
WS
Mapping
Translation
Models
W
S
W
S
W
S
Interaction Logic
Application Logic
Semantic Interface
W
S
Cost Modeling
Semantic Application
Performance
Modeling
NASA iLoC SBA Workspace
SI
SI
IL
AL
BL
SI
IL
AL
BL
SI
IDT DB
T1
RFx DB
T2
Text Mining for Hypertext Creation
A general topic
Concept map
Subtopic 1
Subtopic i
Subtopic M
...
Doc 1
Doc 2
Hypertext
Doc N
Type of Links
Term  Term Links
DocTerm Links
A general topic
TermDoc Links
Subtopic 1
Subtopic i
Subtopic M
...
Doc 1
Doc 2
Doc N
Doc  Doc Links
Example from an Enterprise Architecture
Process Ontology
Agent
Role
Process
Task
Measure
Goal
FEA-RMO delivers “Line of Sight”
fea: Mission
fea: intentOf
prm: GenericMeasurementIndicator
fea: Agency
prm: PerformanceMeasure
brm: provides
fea: hasIntent
prm:hasIndicator
brm: SubFunction
brm: hasProcess
brm: Process
brm: usesResource
brm: Resource
prm:hasSpecialization
brm: hasPerformance
brm: realizedWith
brm: hasCustomer
fea: Customer
prm:
OperationalizedMeasurementIndicator
srm: Service
病歷紀錄整合
ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE
######:MRS ##### #######
27 Aug 1998 Seen in the Follow Up Staging Clinic
This 65 year old lady has been reviewed in the Breast staging clinic.
As you know, she was originally diagnosed with a carcinoma of the left
ROYAL MARSDEN NHS TRUST - PATIENT CASE NOTE
breast in 1974 and treated with a total mastectomy. This was followed
######:MRS ##### #######
with MEFUP chemotherapy. In 1982 she noticed a lump in the
infraclavicular region which was excised and this was followed by
ROYAL MARSDEN NHS TRUST
- DIAGNOSTIC
- CT
REPORT
radiotherapy.
In 1994 she RADIOLOGY
developed a tumour in
the chest
cavity that
15 Dec 1993 General Surgical
was diagnosed
######:#######,MRS
#####with a CT guided biopsy and this was treated with VAC
I reviewed this patient in clinic today. She
has beenMARSDEN
followed
ROYAL
chemotherapy and radiotherapy to the mediastinum. Since 1994 she had
NHS TRUST - PATIENT CASE NOTE
Exam 18 Dec Examination LIVER/THORAX/ABDOMEN/PELVIS
noticed a slight deterioration and earlier this year she had problems
up for a left breast carcinoma for which she was treated with a
######:MRS ##### #######
Exam Number [NUM]
with occasional episodes of vomiting, nausea and general lethargy. She
mastectomy. She had a prosthesis removed last year and has had
Date of Birth 17 May 1933
some improvement in the symptoms of chest wall discomfort since
24intermittently.
Jan 1997
then although she still gets quite sharp pains
Ref
Seen in the Chemotherapy Clinic (TPFRIDAY)
[HCA1]
Clinical
She has been reviewed in the pain clinic local
where she
I sawto #####
today
was found to have lymphadenopathy in the right supraclavicular fossa
and was treated with Arimidex. Since being on Arimidex there was
OUTPATIENT
originally stablisation of her disease but recently it appears that the
node has started to enlarge.
in clinic. I am very pleased to say that she has
BR had
Verified by [HCA2] On examination today, she has a 1.5x1cm lymph node in the right
lives but has not had much relief of her symptoms. She feels
supraclavicular fossa and an essence of thickening probably due to
a complete response in her superior mediastinum and rightDIAGNOSIS: Carcinoma of breast.
previous therapy in the left supraclavicular fossa. She also has
though that she can bear with these and does not want any
CT scans
have been obtained through
chest,
abdomen
pelvis
with
oral
radiation
changes
in the lungand
which
produced
some
physical sign at both
supraclavicular fossa lymphadenopathy. There is some minimal
thickening
further intervention at present.
On examination today there is no sign of remaining
recurrence ofin
herthe
disease. Chest and abdominal examination
were
We might
fact
it unremarkable.
is felt that this
will see her again in a year's time.
contrast only.
soft tissues around the superior mediastinum and in
bases and there was no evidence of abdominal organomegaly.
Her recent staging investigations show that she has C5 carcinoma cells
There is thickening in the left clavicular fossa and small-
now be related to previous
present in the lymph node fine needle aspirate. A right mammogram is
volume residual abnormalities in the
mediastinum.
Comparison
unremarkable.
An ultrasound
of the liver is
wasmade
normal and a chest x-ray
showed
thickening present
in the left axilla due to
radiotherapy. To be honest, however, symptomatically there
withhas
thebeen
most recent scan (21.7.95)
and some
theresoft
is tissue
no discernible
change
28/03/2003, 10:35:26
little in the way of benefit with overall palliative response by
of CT
no criteria.
previous therapy. There is also some loss of volume in the left upper
zone but no lung nodules seen. A bone scan shows evidence of
Lung changes, which may have been
relatedchanges
to radiotherapy,
nowofless
degenerative
but no specific are
evidence
bony metastases. Her
change. She is tolerating the treatment fairly well. Interestingly she
extensive.
thyroid function tests show that the TSH is 0.12 and her free T3 are 4
which indicates that the TSH is slightly low. This does not amount to
has had virtually complete alopecia with the treatment. SheThere
has are
been
on
no abnormally-enlarged
nodes in the retroperitoneum
primary hypothyroidism but it would be worth repeating the thyroid
warfarin for about the same amount of time and I wonder whether
this are no focal hepatic
or pelvis. There
masses.
function
tests in three months time.
it appears
that the patient has stable disease on Arimidex
CONCLUSION:
No CT evidenceOverall,
of disease
progression.
may be partly responsible. We have given her a fourth cycle
of
apart from in the right supraclavicular fossa. The Arimidex is not
treatment today and we will see her in three weeks for consideration of
28/03/2003, 12:35:06
holding the disease completely and we feel that the best approach to
management would be to consider some radiotherapy to the right
her fifth.
supraclavicular fossa. She has previously had radiation therapy to the
28/03/2003, 10:44:20
left clavicular region and mediastinum. We have discussed performing a
CT scan of the thorax but she was unable to lie flat for the duration
of the investigation some months ago. We shall ask our radiotherapy
colleagues to review her and consider her for therapy. We shall review
her again in the follow up clinic in six weeks time.
28/03/2003, 10:50:25
疾病診斷
 Consider a 62-year-old man with 3 months history
of severe back pain. His weight remained stable.
CBC and routine biochemistry were normal. ESR
was 52 mm / hour. An x-ray of the lumbar and
thoracic spine was reported to showing
degenerative changes.  Cancer
Low back pain
特徵
History and physical examination
Age > 50 years or
Failure of treatment or
weight loss
History of
Previous cancer
ESR,spine
Films, 9%
with cancer
No significant
finding
ESR
ESR < 20 and only
one clinical Finding
No
cancer
ESR > 20 or more than
one clinical finding
X-ray
2.3% cancer
What was
done…
What happened… And why
Human:1382
Pain:5735
Ulcer:1945
locus
locus
attends
reason
locus
reason
attends
finding
attends
Breast:1492
Clinic:4096
reason
plans
Clinic:1024
plans
plans
reason
locus
Biopsy:1066
target
Radio:1812
finding
time
reason
plans
Chemo:6502
treats
reason
Mass:1666
Clinic:2010
plans
treats
locus
time
Cancer:1914
time
time
time
time
time
time
Concept Lattice
Given the context (D1,T1) where
D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6}
Hasse Diagram
C1:(D1,Ø)
R t1 t2 t3 t4 t5 t6
d1 1 0 1 0 1 1
C2:({d1,d2,d4},{t1,t6})
C3:({d3,d4},{t4})
d2 1 0 1 0 1 1
d3 0 1 0 1 0 0
d4 1 0 0 1 0 1
C5:({d4},{t1,t4,t6})
C4:({d1,d2},{t1,t3,t5,t6})
C6:({d3},{t2,t4})
Table: The input relation
R = documents  keywords
C7:(Ø, T1)
The formal concept
C4 has two own terms
{t3,t5} and two inherited
terms {t1,t6}
Text Analysis Spectrum
Classification
Concept
Identification
Targeted Facts
and Events
Entity Extraction
Clustering
What is this
document about?
Who did
what to
whom when
where, etc.
Why is getting dimensional
data so hard?
Hank bought plastic explosives from Henry in
Tucson yesterday.
Named Entity Extraction
Hank
People,
Weapons,
Vehicles,
Dates
Henry
NER
Engine
Plastic explosives
11/01/07
Tucson
Automatic PatternLearning Systems
Language
Input
Trainer
Answers
Model
 Pros:
 Portable across domains
 Tend to have broad coverage
Language
Input
Decoder
Answers
 Robust in the face of degraded input.
 Automatically find appropriate statistical patterns
 System knowledge not needed by those who supply the domain
knowledge.
 Cons:
 Annotated training data, and lots of it, is needed.
 Isn’t necessarily better or cheaper than hand-built sol’n
 Examples:
 Riloff et al., AutoSlog, Soderland WHISK (UMass); Mooney et al.
Rapier (UTexas); Ciravegna (Sheffield)
 Learn lexico-syntactic patterns from templates
Explicit Events, Object Identity,
Symmetry
E52 Time-Span
E39 Actor
E53 Place
7012124
February 1945
P82 at some
time within
E7 Activity
E39 Actor
“Crimea Conference”
E38 Image
P86 falls
within
E65 Creation Event
*
E39 Actor
P81
ongoing
throughout
E52 Time-Span
11-2-1945
E31 Document
“Yalta Agreement”
Rules Extraction
 The formal concept C4 makes it possible the
following rules
R1 : t3  t1  t6
 R2 : t5  t1  t6
 R3 : t3  t5

 The interpretation of the R1 and R2: The use of
terms t3 or t5 is always associated with that of
terms t1 and t6
 The rule R3 express mutual equivalence of the
terms {t3,t5}: All the documents which have the
term t3 also have the t5 term.
災後重建
基金
因果圖 -- 失依兒童 所在各縣市失
依兒童狀態
各縣市政
府,社會
局等介入
各縣市福利,
信託基金的
成立
中低收入
戶補助
對單親家庭
的補助之災
後重建及經
費相關使用
規則
中文 NER – Example 2
 黑色當道 少了尖叫 女星太規矩 城城活跳跳 金馬獎星光大道不若前晚金鐘獎
「峰芒」畢露,女星們規矩平穩的服裝,讓星光大道上少了一些特色,並未出
現讓人眼睛一亮的驚喜。其中,在金鐘獎上讓人血脈僨張的蕭淑慎,在金馬獎
上可以看出服裝「規矩」了些。總體來說,今年的星光大道造型略顯平庸。
秋冬主流黑色更在金馬星光大道上大量出現,凱渥模特兒公司老闆、也是專業
資深時尚人洪偉明說:「可以發現他們選擇合適的服裝,規矩、正式的選擇,
可避免遭受批評,今年確實少了些特色,但重要的國際場合,平穩的黑色服裝,
也是出席正式場合的安全造型。」 洪偉明表示:「楊千嬅的服裝和她的人很
搭,黑色蕾絲讓她不至於顏色過重,正式中又帶點活潑,感覺很棒。」台中市
長胡志強女兒胡婷婷桃紅色的緞面禮服,也讓洪偉明很欣賞,他說:「整體感
覺落落大方,亮色服裝和她的人也很適合,她的自信和星光大道主持人蔣怡的
乾淨大方一樣,讓人感覺舒服,也是不錯的造型。」 舒淇鵝黃色的禮服,洪
偉明笑說:「羅曼蒂克的感覺和她的笑容很搭配,讓氣色宛如戀愛中的女人一
樣美好。」梁詠琪的黑色短禮服,雖然露出她的修長美腿,但洪偉明也建議:
「她至少可以搭雙絲襪,整體感覺會更好。她在演唱會上展現性感,其實星光
大道上也可以大膽改變。」 至於男星們的服裝,今年則是絲絨的天下,洪偉
明笑說:「男星們服裝不易做出變化,敢大膽嘗試不同造型的人也不多,其中
郭富城神采奕奕的精神,十分突出,張震的服裝則顯得穩重而規矩。」
專有名詞
詞
詞類
出現次數
舒淇
[Nb]專有名稱
2
張震
[Nb]專有名稱
1
高達
[Nb]專有名稱
1
賴雅妍
[Nb]專有名稱
1
白
[Nb]專有名稱
1
米蘭
[Nb]專有名稱
1
竹幼婷戴榮賢
[Nb]專有名稱
1
林熙蕾
[Nb]專有名稱
2
郭富城
[Nb]專有名稱
1
楊貴媚
[Nb]專有名稱
1
范文芳
[Nb]專有名稱
1
林志玲
[Nb]專有名稱
1
金馬獎
[Nb]專有名稱
3
楊采妮
[Nb]專有名稱
1
舒淇鵝
[Nb]專有名稱
1
藍正龍
[Nb]專有名稱
1
金城武
[Nb]專有名稱
2
侯佩岑
[Nb]專有名稱
3
蕭淑慎
[Nb]專有名稱
4
梁詠琪
[Nb]專有名稱
2
黃志瑋
[Nb]專有名稱
1
黃子佼
[Nb]專有名稱
1
天心
[Nb]專有名稱
1
楊千嬅
[Nb]專有名稱
1
洪偉明
[Nb]專有名稱
2
胡婷婷
[Nb]專有名稱
2
師李
[Nb]專有名稱
1
戴起
[Nb]專有名稱
1
出現次
數
詞
詞類
背後
[Nc]地方詞
1
中途
[Nc]地方詞
1
世界
[Nc]地方詞
1
天下
[Nc]地方詞
1
原地
[Nc]地方詞
1
時間
詞
詞類
詞
出現次數
詞類
出現次數
昨天
[Nd]時間詞
4
露美腿
[LN]人名類
2
新春
[Nd]時間詞
1
昨晚
[Nd]時間詞
1
[LN]人名類
1
早春
台中市長胡志強
女兒胡婷婷桃紅
色
[Nd]時間詞
1
前晚
[Nd]時間詞
2
先後
[Nd]時間詞
1
今年
[Nd]時間詞
6
週末
[Nd]時間詞
1
Generative  Discriminative
Generalize
Object:
attribute
貸款
Object:
Attribute (condition)
震災重建暫行條例
受災戶
method
重建家園專案
object
災戶
Object:
attribute
金融機構
利息
Object:
attribute
Object:
attribute
房屋
Object:
attribute
Specify
損毀
Object:
condition
範例
很適合用機洗
香味好聞
去污力強
洗衣省力
氣味清香
能去除99種污漬
洗得特別乾淨
香味好聞
白襪子洗得最乾淨
氣味很香
不傷手
能夠很好的去除污漬
衣服不易褪色
洗衣不費力
能去除99種污漬
用量少
洗得乾淨
對皮膚刺激少
洗各種污漬都很乾淨
洗得乾淨
價格適當
洗衣服的效果較好
氣味不錯
一直使用該品牌
洗好的衣物更白
氣味好聞
廣告印象深
洗得乾淨
易漂清
不太傷手
洗得乾淨
用量少
洗得乾淨
用量比別的牌子少
廣告大
洗得乾淨
用量少
質量好
用量少
洗得乾淨
包裝好
廣告多,吸引人
香味好聞
洗的乾淨、白
宣傳好,廣告有趣
很多人都說好
80
81
語意概念萃取 for Malignancy DSS
Patient (Patient ID)  ESR  Screening (Positive)
Symptom (Positive Indication)  Cancer
Bag of “Words”
extraction
Expressions
extraction
Decision Making
Patient ID
Named Entities
malignancy
ESR
extraction
 Treatment
severe
Patient ID
Events/Sentiment
ESR
back
Extraction
severe back pain
pain
x-ray
x-ray
lumbar
Patient ID  Diagnostic term
lumbar
spine
malignancy?
spine
degenerative changes ESR  screening test
degenerative
Lumber, Spine  Anatomy Term
Combined
changes
degenerative changes  Symptom
With structured data
Information Retrieval
Information Extraction
Knowledge Inference
(文件)資料探勘走向決策支援
 彙整同一性質資料  Clustering
 資料探勘以產生關聯相依規律  Association Rules
 視覺化顯示協助專家研判主題  Visualization
 定義處理指引方便建立決策支援  Processing
Guideline
發展
Local data
FTP
Gopher
HTML
More structure
Indexing
Search
Relevance Ranking
Latent Semantic Topology
Crawling
WebSQL
Social
Network
of
Hyperlinks
WebL
XML
Clustering
Collaborative
Filtering
ScatterGather
Topic Directories
Semi-supervised
Automatic
Learning
Classification
Web
Communities
Web Servers
Topic
Distillation
Focused
Crawling
Monitor
Mine
Modify
User
Profiling
Web Browsers