Download 1 Python Basics - Temple Fox MIS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
PythonandWebDataExtraction:
Introduction
Alvin Zuyin Zheng
[email protected]
http://community.mis.temple.edu/zuyinzheng/
About
• AboutMe:Alvin Zuyin Zheng ,MIS
• Following up with the workshop in May
• Organizers:Sudipta Basu, Lalitha Naveen, Jing
Gong
•
Thisworkshop issupported bytheOfficeofResearch
andDoctoral ProgramsatFox.ThanksPaul,Lindsayand
everyone intheOfficeofResearch!
About
• StudentAssistants:
– ShawnJNiederriter
– Xue Guo
– Zhe Deng
• Website:
http://community.mis.temple.edu/zuyinzheng
/pythonworkshop/
• Maintenance from 12:00 to 2:00pm!
Topics
1. PythonBasics
2. WebScraping
3. IntroductiontoNaturalLanguageProcessing
• Nopriorprogrammingexperienceneeded
Schedule
9:50am
WelcomeandSetUp
10:00am
Session1–Python basics
11:00am
CoffeeBreak
11:20am
Session2–WebScraping(Part1)
12:20pm
LunchBreak
1:20pm
Session3–WebScraping(Part2)
2:20pm
CoffeeBreak
2:40pm
Session4– BasicIntrotoNatural
LanguageProcessing
3:45pm
ClosingRemarksandQuestions
PythonandWebDataExtraction:
PythonBasics
JingGong
[email protected]
http://community.mis.temple.edu/gong
Prerequisites
• Beforetheworkshop,yourcomputerneedsthe
followingtoolsinstalledandworkingto
participate.
– Acommand-line interface tointeract withyour
computer
– Atexteditortoworkwithplaintextfiles
– Python2.7
– Thepip packagemanagerforPython
– A browser that can view web source code like Chrome
(Pleasefollow thesetupguidepostedhere)
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
WhyPython?
•
•
•
•
•
Simple
Easytolearn
Freeandopensource
Portableacrossplatforms
Withextensivelibraries
• Python2versus3:
– Verydifferent
– WewillusethelatestversionofPython 2(Latest
versionisPython 2.7.12)
PythonIDLE(InteractiveShell)
• ThePython IDLE providesaninteractive environment to
playwiththelanguage
• OpenPythonIDLE
– OnWindows:taptheWindowskeyonyourkeyboardandtype“idle”
toopenthe“IDLE(PythonGUI)”
– OnMac:useCmd+Space andtype“idle”toselectthe“IDLE.”
PythonIDLE(InteractiveShell)
•Youcantypecommands directlyintotheinteractive
shell
•Resultsofexpressions areprintedonthescreen
>>> 1+3
4
>>> workshop = "Python"
>>> workshop
'Python'
>>> print "hello"
hello
Variablesareassigned
usingthe“=“sign
Indentation
• Pythonusesindentation(usuallyfourspaces) to
structureablockofcodes
– nocurlybraces {} tomarkwherethefunctioncodestarts
andstops
>>> x = 10
>>> if x>10:
...
print "x is larger than 10"
... else:
...
print "x is less than or equal to 10"
Returns:
x is less than or equal to 10
Comments
• Comments: Textsmainlyusefulasnotesforthe
readerofthescript.
• Aretotherightofthe#symbol
>>> print "hello" #this is a comment
Hello
>>> #this is a comment
>>>
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
BasicDataTypes
• Numbers
– Integers
>>> month = 3
– Floats
>>> income = 100.2
– Calculations
>>> 100/month
33
>>> income/month
33.4
BasicDataTypes
• Strings
– Specifystringsusingeithersinglequotesor
doublequotes
>>> gender = "Male"
>>> name = 'Jack'
– Usetriplequotesforstringsacrossmultiplelines
>>> paragraph = """This is a long
paragraph with multiple lines."""
– Concatenate stringsusing“+”
>>> name + gender
'JackMale'
List[]
• List:Anorderedcollectionofdata
– Youcanhaveanything inalist:
>>>
>>>
>>>
>>>
[0,
[0]
[2.3, 4.5]
[5, "Hello", "there", 9.8]
range(4)
1, 2, 3]
– Uselen()togetthelengthofalist
>>> names=["Ben","Jack","Lee","Nick"]
>>> len(names)
4
AccessingandUpdatingValuesinLists
• Use[]toaccessvaluesinthelist
>>> names=["Ben","Jack","Lee","Nick"]
>>> names[0]
[0]:the1st value
'Ben'
>>> names[1:3]
[1]:the2nd value
['Jack', 'Lee']
…
>>> names[1:]
['Jack', 'Lee', 'Nick']
>>> [names[i] for i in [1,3]]
['Jack', 'Nick']
• Updatevalues
>>> names[1]="Ann"
>>> names
['Ben', 'Ann', 'Lee', 'Nick']
TolearnmoreaboutPythonLists:Visithere
OtherTypes
• Dictionaries{}
– consistofkey-valuepairs.
>>> dict = {'name': 'Ann', 'age': 10}
• Tuples()
– Similartolist,butcannotbeupdated.
>>> tup = ('Ann', 10)
>>> tup[1]=12
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
tup[1]=12
TypeError: 'tuple' object does not support
item assignment
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
If
• Theif statementchecksacondition
• Andcanbecombined with…elif …else
>>>
>>>
...
...
...
...
...
age = 10
if age>10:
print "age is greater than 10"
elif age<10:
print "age is less than 10"
else:
print "age is equal to 10"
age is equal to 10
For
• Thefor statementloopsiterateovereachvalueina
list
>>> for i in range(4):
...
print "i: ",i
i:
i:
i:
i:
0
1
2
3
Break
• break canbeusedtostoptheforloop
>>> for i in range(4):
...
print "i: ",i
...
if i>1:
...
print "Exiting the for loop"
...
break
i: 0
i: 1
i: 2
Exiting the for loop
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
Packages(Modules)
• Pythonitselfonlyhasalimitedsetof
functionalities.
• Packagesprovideadditionalfunctionalities.
• Use“import”toloadapackage
>>> import math
>>> import os
InstallPackagesUsingpip
• Forstandardpackages
• Youdonotneedtobeinstalledmanually
• Forthird-partypackages
• Usepip inyourcommand lineinterface toinstall
pip install SomePackage
• Forexample,toinstallthebeautifulsouppackage:
pip install beautifulsoup4
Seetheseinstructionsforhowtoopenthecommandlineinterface.
• OnWindowsitiscalled“CommandPrompt.”
• OnMacitiscalled“Terminal.”
Functions
• Afunction isanamedsequenceofstatements
thatperformsadesiredoperation
• Defineafunction:
>>> def f(x):
...
return x*2
>>> f(1)
2
>>> f(0.5)
1
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
WorkingDirectory
• WorkingDirectory:Thinkofitasthefolderyour
Pythonisoperatinginsideatthemoment
• Getthecurrentworkingdirectory:
>>> import os
>>> os.getcwd()
'C:\\Python27'
• Changethecurrentworkingdirectory:
>>> os.chdir("C:/users/jing")
>>> os.getcwd()
'C:\\users\\jing'
Useforwardslash (/)or
doublebackslash(\\)when
specifyingdirectories.
FileOpen/Close
• Touseafile,youhavetoopenitusingtheopen function
(Note:Thefollowingcodesassumesthatyourfilesareinyourcurrent
workingdirectory)
>>> input_file = open("in.txt", "r")
>>> output_file = open("out.txt", "w")
>>> for line in input_file:
"r" meansreading
output_file.write(line)
"w" meanswriting
"a" meansappending
• Whendone,youhavetocloseit usingtheclose function
>>> input_file.close()
>>> output_file.close()
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
RegularExpressions(RE)
• RegularExpressionsareapowerfultext
manipulationtoolfor
– searching, replacing,andparsingtextpatterns
• Youneedtoloadthe“re”package
>>> import re
Exactmatch:Anexample
• Suppose wehaveatextstring,andwanttoknowifthe
stringhastheword“cat”init…
String: " A fat cat doesn't eat oat
but a rat eats bats. "
• re.search(pattern, string[, flags])
– findthefirst locationwherethepattern producesamatch
>>>
>>>
rat
>>>
>>>
cat
import re
teststring = """ A fat cat doesn't eat oat but a
eats bats. """
match = re.search("cat",teststring)
print match.group()
ExtractingTexts
• Howtoextract everything between“cat”and
“rat”?
"A fat cat doesn't eat oat but
String:
a rat eats bats."
• Wecandefineapattern "cat(.*?)rat"
– (.*?) representseverything inbetween“cat”and“rat”
>>> match = re.search("cat(.*?)rat",teststring)
>>> print match.group(0)
cat doesn't eat oat but a rat
>>> print match.group(1)
doesn't eat oat but a
FindAllMatchedSubtrings
• re.findall(pattern, string[, flags])
– Findallmatchedsubstringswithagivenpattern
• ".at" isapatternthatmatchesanyof'fat','cat',
'eat','oat','rat','eat'(or'1at','2at',‘aat',‘bat‘…)
>>> import re
>>> teststring = """ A fat cat doesn't eat oat but
a rat eats bats. """
>>> match = re.findall(".at",teststring)
>>> print match
['fat', 'cat', 'eat', 'oat', 'rat', 'eat']
EscapeCharacter
• Specialcharactersoftencauseproblemsbecausetheyare
usedtodefinepatterns
• Ifyouwantaspecialcharacter tojustbehavenormally
(mostofthetime)youprefixitwithbackslash(\ )
• Escape
\\
\ '
\ "
\n
\t
\r
\.
Meaning
\
'
"
newline
tab
return
.
MoreaboutRegularExpression
• Python’sofficialRegularExpressionHOWTO:
– https://docs.python.org/2/howto/regex.html#regexhowto
• Google’sPythonRegularExpressionTutorial:
– https://developers.google.com/edu/python/regularexpressions
• Testyourregularexpression:
– https://regex101.com/#python
Outline
•
•
•
•
•
•
•
Overview
DataTypes
ControlFlow
PackagesandFunctions
FileInput/Output
RegularExpression
Tutorial1.FirstRunningtheFirstPythonScript
Tutorial1:RunningYourFirstPythonScript
• DownloadtheFirstPythonScript.py filefromthe
websiteandtrytorunit
• Steps
1. Locatethe.py fileyou’dliketoruninyourfolder
2. Openthethe .py filewithIDLE
3. Clickthe“Run”menuandchoose“RunModule”
OnlineRecoursesforPython
• Python'sBeginnersGuide listedmanyonlinebooksandtutorials:
https://wiki.python.org/moin/BeginnersGuide/Programmers
• Python'sOfficialTutorial: https://docs.python.org/2/tutorial/
• PythonBasicTutorialby
TutorialsPoint: http://www.tutorialspoint.com/python/
• learnpython.org: http://www.learnpython.org/
• Google'sPythonClass: https://developers.google.com/edu/python/