Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Digital Text and Data Processing Week 1 Course background □ Future of reading □ Understanding “Machine reading”: □ Text analysis tools □ Visualisation tools □ Differences between machine reading and human reading Images taken from textarc.org and from Google App store, Javelin for Android Scale Text Mining □ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1) □ Related to data mining □ Information is found “not among formalised database records, but in the unstructured textual data” (2) (1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51 (2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1 Difficulties of natural language One thing was certain, that the WHITE kitten had had nothing to do with it:--it was the black kitten's fault entirely. For the white kitten had been having its face washed by the old cat for the last quarter of an hour (and bearing it pretty well, considering); so you see that it COULDN'T have had any hand in the mischief. Down, down, down. There was nothing else to do, so Alice soon began talking again. 'Dinah'll miss me very much tonight, I should think!' (Dinah was the cat.) … And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, 'Do cats eat bats? Do cats eat bats?' In a Wonderland they lie, Dreaming as the days go by, Dreaming as the summers die: Ever drifting down the stream, Lingering in the golden gleam. Life, what is it but a dream? □ Semantic categories are generally implicit □ Inflections: conjugations and declension □ Homonyms and synonyms □ Meaning is context-specific □ Spelling changes over time or may vary across regions I trod on grass made green by summer's rain, Through the fast-falling rain and highwrought sea 'Tis like a wondrous strain that sweeps And suddenly my brain became as sand She mixed; some impulse made my heart refrain were found where the rainbow quenches its points upon the earth Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’ Two stages in text mining □ Data creation □ Data analysis Weekly Programme Cluster 1: Data creation □ W1: Introduction to the course and introduction to the Perl programming language □ W2: Regular expressions, word segmentation, frequency lists, types and tokens □ W3: Natural language processing: Part of Speech tagging, lemmatisation □ W4: Exploration of existing text mining tools Weekly Programme Cluster 2: Data analysis □ W5: Introduction to R package □ W6: Multivariate analysis: Principal Component Analysis, Clustering techniques □ W7: Visualisation □ W8: Conclusion: What type of knowledge can we create? Individual Research project □ Techniques taught in DTDP generally enable you to study formal differences and similarities between texts, e.g. vocabulary, sentence length, grammatical structure □ Create a corpus of a least four different texts, of ca. 5000 words each; you can copy texts from existing corpora □ You can apply the techniques which are explained in this class to your own corpus □ Formulate your own research question Course evaluation □ 3 assignments (1 point to be earned for each) □ Final essay (ca. 3,000 words) □ Report of your individual research project (3 points) □ Critical reflection on digital humanities research (4 points) □ What sort of knowledge can be produced? How does this type of research relate to traditional scholarship? □ Is programming a legitimate scholarly activity in the humanities? □ Can visualisations of texts function as independent scholarly resources? Introduction to programming □ Programming languages: used to give instructions to a computer □ There is a gap between human language and machine language □ Digital information is information represented as combinations of 1s and 0s, e.g.: A = 01100001 □ First generation programming languages: Assembler, eg ADD X1 Y1 □ Higher-level programming languages: Compilers or Interpreter Human Programmer Programming language, e.g. Perl Machine Language 0101100101010 Language processor Computer □ First generation programming languages: Assembler, eg ADD X1 Y1 □ Higher-level programming languages: Compilers or Interpreter Human Programmer Programming language, e.g. Perl Machine Language 0101100101010 Language processor Computer Algorithm □ Etymology: Muhammad ibn Musa al-Khwarizmi, Al-kitāb al-mukhtaṣar fī ḥisāb al-ğabr wa’l-muqābala □ Unambiguous descriptions of the steps which need to be followed to arrive at a well-defined result □ Developed by human beings! Getting started 1. Create a working directory on your computer 2. Open a code editor and type the following lines: print “It works!” ; 3. Use the .bat file that is provided Variables □ Always preceded by a dollar sign $keyword □ Variables can be assigned a value with a specific data type (‘string’ or ‘number’) $keyword = “time” ; $number = 10 ; □ Three types of variables: scalar, array, hash Strings □ Can be created with single quotes and with double quotes □ In the case of double quotes, the contents of the string will be interpreted. □ You can then use “escape characters” in your string to add basic formatting: “\n” new line “\t” tab Statements □ Perl statements can be compared to sentences. □ Perl statements end in a semi-colon! print “This is a statement!” ; Exercise Print a string that looks as follows: This is the first line. This is the second line. This line contains a tab. Operators = Assignment e.g. $a = 5 ; Arithmetic operators + * Addition Subtraction Multiplication Exercise Create two variables, and assign a numerical value to both of them Print their sum, their difference and their product. Reading a file Is done as follows: open ( IN , “shelley.txt” ) ; while ( <IN> ) { print $_ ; } close ( IN ) ; Exercise Create a Perl application which can read the text file “shelley.txt” and which can print all the lines. Control keywords if ( <condition> ) { <first block of code> } elsif ( <condition> ) { <second block of code> } else { <last block of code ; default option> } Regular expressions □ The pattern is given within two forward slashes □ Use the =~ operator to test if a given string contains the regex. □ Example: $keyword =~ /rain/ Exercise Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword. (suggestions: “fire” , “rain” , “moon”, “storm”, “time”) Regular expressions (2) □ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner. □ \b can be used in regular expressions to represent word boundaries if ( $keyword =~ /\btime\b/i ) { } Additional exercises □ Create a program that can count the total number of lines in the file “shelley.txt” □ Create a program that can calculate the length of each line, using the length() function length( $line ) ; □ Calculate the average line length (in characters) for the entire file.