Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PythonandWebDataExtraction: Introduction Alvin Zuyin Zheng [email protected] http://community.mis.temple.edu/zuyinzheng/ About • AboutMe:Alvin Zuyin Zheng ,MIS • Following up with the workshop in May • Organizers:Sudipta Basu, Lalitha Naveen, Jing Gong • Thisworkshop issupported bytheOfficeofResearch andDoctoral ProgramsatFox.ThanksPaul,Lindsayand everyone intheOfficeofResearch! About • StudentAssistants: – ShawnJNiederriter – Xue Guo – Zhe Deng • Website: http://community.mis.temple.edu/zuyinzheng /pythonworkshop/ • Maintenance from 12:00 to 2:00pm! Topics 1. PythonBasics 2. WebScraping 3. IntroductiontoNaturalLanguageProcessing • Nopriorprogrammingexperienceneeded Schedule 9:50am WelcomeandSetUp 10:00am Session1–Python basics 11:00am CoffeeBreak 11:20am Session2–WebScraping(Part1) 12:20pm LunchBreak 1:20pm Session3–WebScraping(Part2) 2:20pm CoffeeBreak 2:40pm Session4– BasicIntrotoNatural LanguageProcessing 3:45pm ClosingRemarksandQuestions PythonandWebDataExtraction: PythonBasics JingGong [email protected] http://community.mis.temple.edu/gong Prerequisites • Beforetheworkshop,yourcomputerneedsthe followingtoolsinstalledandworkingto participate. – Acommand-line interface tointeract withyour computer – Atexteditortoworkwithplaintextfiles – Python2.7 – Thepip packagemanagerforPython – A browser that can view web source code like Chrome (Pleasefollow thesetupguidepostedhere) Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript WhyPython? • • • • • Simple Easytolearn Freeandopensource Portableacrossplatforms Withextensivelibraries • Python2versus3: – Verydifferent – WewillusethelatestversionofPython 2(Latest versionisPython 2.7.12) PythonIDLE(InteractiveShell) • ThePython IDLE providesaninteractive environment to playwiththelanguage • OpenPythonIDLE – OnWindows:taptheWindowskeyonyourkeyboardandtype“idle” toopenthe“IDLE(PythonGUI)” – OnMac:useCmd+Space andtype“idle”toselectthe“IDLE.” PythonIDLE(InteractiveShell) •Youcantypecommands directlyintotheinteractive shell •Resultsofexpressions areprintedonthescreen >>> 1+3 4 >>> workshop = "Python" >>> workshop 'Python' >>> print "hello" hello Variablesareassigned usingthe“=“sign Indentation • Pythonusesindentation(usuallyfourspaces) to structureablockofcodes – nocurlybraces {} tomarkwherethefunctioncodestarts andstops >>> x = 10 >>> if x>10: ... print "x is larger than 10" ... else: ... print "x is less than or equal to 10" Returns: x is less than or equal to 10 Comments • Comments: Textsmainlyusefulasnotesforthe readerofthescript. • Aretotherightofthe#symbol >>> print "hello" #this is a comment Hello >>> #this is a comment >>> Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript BasicDataTypes • Numbers – Integers >>> month = 3 – Floats >>> income = 100.2 – Calculations >>> 100/month 33 >>> income/month 33.4 BasicDataTypes • Strings – Specifystringsusingeithersinglequotesor doublequotes >>> gender = "Male" >>> name = 'Jack' – Usetriplequotesforstringsacrossmultiplelines >>> paragraph = """This is a long paragraph with multiple lines.""" – Concatenate stringsusing“+” >>> name + gender 'JackMale' List[] • List:Anorderedcollectionofdata – Youcanhaveanything inalist: >>> >>> >>> >>> [0, [0] [2.3, 4.5] [5, "Hello", "there", 9.8] range(4) 1, 2, 3] – Uselen()togetthelengthofalist >>> names=["Ben","Jack","Lee","Nick"] >>> len(names) 4 AccessingandUpdatingValuesinLists • Use[]toaccessvaluesinthelist >>> names=["Ben","Jack","Lee","Nick"] >>> names[0] [0]:the1st value 'Ben' >>> names[1:3] [1]:the2nd value ['Jack', 'Lee'] … >>> names[1:] ['Jack', 'Lee', 'Nick'] >>> [names[i] for i in [1,3]] ['Jack', 'Nick'] • Updatevalues >>> names[1]="Ann" >>> names ['Ben', 'Ann', 'Lee', 'Nick'] TolearnmoreaboutPythonLists:Visithere OtherTypes • Dictionaries{} – consistofkey-valuepairs. >>> dict = {'name': 'Ann', 'age': 10} • Tuples() – Similartolist,butcannotbeupdated. >>> tup = ('Ann', 10) >>> tup[1]=12 Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> tup[1]=12 TypeError: 'tuple' object does not support item assignment Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript If • Theif statementchecksacondition • Andcanbecombined with…elif …else >>> >>> ... ... ... ... ... age = 10 if age>10: print "age is greater than 10" elif age<10: print "age is less than 10" else: print "age is equal to 10" age is equal to 10 For • Thefor statementloopsiterateovereachvalueina list >>> for i in range(4): ... print "i: ",i i: i: i: i: 0 1 2 3 Break • break canbeusedtostoptheforloop >>> for i in range(4): ... print "i: ",i ... if i>1: ... print "Exiting the for loop" ... break i: 0 i: 1 i: 2 Exiting the for loop Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript Packages(Modules) • Pythonitselfonlyhasalimitedsetof functionalities. • Packagesprovideadditionalfunctionalities. • Use“import”toloadapackage >>> import math >>> import os InstallPackagesUsingpip • Forstandardpackages • Youdonotneedtobeinstalledmanually • Forthird-partypackages • Usepip inyourcommand lineinterface toinstall pip install SomePackage • Forexample,toinstallthebeautifulsouppackage: pip install beautifulsoup4 Seetheseinstructionsforhowtoopenthecommandlineinterface. • OnWindowsitiscalled“CommandPrompt.” • OnMacitiscalled“Terminal.” Functions • Afunction isanamedsequenceofstatements thatperformsadesiredoperation • Defineafunction: >>> def f(x): ... return x*2 >>> f(1) 2 >>> f(0.5) 1 Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript WorkingDirectory • WorkingDirectory:Thinkofitasthefolderyour Pythonisoperatinginsideatthemoment • Getthecurrentworkingdirectory: >>> import os >>> os.getcwd() 'C:\\Python27' • Changethecurrentworkingdirectory: >>> os.chdir("C:/users/jing") >>> os.getcwd() 'C:\\users\\jing' Useforwardslash (/)or doublebackslash(\\)when specifyingdirectories. FileOpen/Close • Touseafile,youhavetoopenitusingtheopen function (Note:Thefollowingcodesassumesthatyourfilesareinyourcurrent workingdirectory) >>> input_file = open("in.txt", "r") >>> output_file = open("out.txt", "w") >>> for line in input_file: "r" meansreading output_file.write(line) "w" meanswriting "a" meansappending • Whendone,youhavetocloseit usingtheclose function >>> input_file.close() >>> output_file.close() Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript RegularExpressions(RE) • RegularExpressionsareapowerfultext manipulationtoolfor – searching, replacing,andparsingtextpatterns • Youneedtoloadthe“re”package >>> import re Exactmatch:Anexample • Suppose wehaveatextstring,andwanttoknowifthe stringhastheword“cat”init… String: " A fat cat doesn't eat oat but a rat eats bats. " • re.search(pattern, string[, flags]) – findthefirst locationwherethepattern producesamatch >>> >>> rat >>> >>> cat import re teststring = """ A fat cat doesn't eat oat but a eats bats. """ match = re.search("cat",teststring) print match.group() ExtractingTexts • Howtoextract everything between“cat”and “rat”? "A fat cat doesn't eat oat but String: a rat eats bats." • Wecandefineapattern "cat(.*?)rat" – (.*?) representseverything inbetween“cat”and“rat” >>> match = re.search("cat(.*?)rat",teststring) >>> print match.group(0) cat doesn't eat oat but a rat >>> print match.group(1) doesn't eat oat but a FindAllMatchedSubtrings • re.findall(pattern, string[, flags]) – Findallmatchedsubstringswithagivenpattern • ".at" isapatternthatmatchesanyof'fat','cat', 'eat','oat','rat','eat'(or'1at','2at',‘aat',‘bat‘…) >>> import re >>> teststring = """ A fat cat doesn't eat oat but a rat eats bats. """ >>> match = re.findall(".at",teststring) >>> print match ['fat', 'cat', 'eat', 'oat', 'rat', 'eat'] EscapeCharacter • Specialcharactersoftencauseproblemsbecausetheyare usedtodefinepatterns • Ifyouwantaspecialcharacter tojustbehavenormally (mostofthetime)youprefixitwithbackslash(\ ) • Escape \\ \ ' \ " \n \t \r \. Meaning \ ' " newline tab return . MoreaboutRegularExpression • Python’sofficialRegularExpressionHOWTO: – https://docs.python.org/2/howto/regex.html#regexhowto • Google’sPythonRegularExpressionTutorial: – https://developers.google.com/edu/python/regularexpressions • Testyourregularexpression: – https://regex101.com/#python Outline • • • • • • • Overview DataTypes ControlFlow PackagesandFunctions FileInput/Output RegularExpression Tutorial1.FirstRunningtheFirstPythonScript Tutorial1:RunningYourFirstPythonScript • DownloadtheFirstPythonScript.py filefromthe websiteandtrytorunit • Steps 1. Locatethe.py fileyou’dliketoruninyourfolder 2. Openthethe .py filewithIDLE 3. Clickthe“Run”menuandchoose“RunModule” OnlineRecoursesforPython • Python'sBeginnersGuide listedmanyonlinebooksandtutorials: https://wiki.python.org/moin/BeginnersGuide/Programmers • Python'sOfficialTutorial: https://docs.python.org/2/tutorial/ • PythonBasicTutorialby TutorialsPoint: http://www.tutorialspoint.com/python/ • learnpython.org: http://www.learnpython.org/ • Google'sPythonClass: https://developers.google.com/edu/python/