EM (Expectation maximization) algorithm
Comp Ling Glossary
- context free grammar (CFG) - in formal language theory a formal grammar where every production rule has the form A-> b where A is non-terminal and b is 0+ terminal and/or nonterminals.
- linguistics sometimes calls them phrase structure grammars
- comp sci often uses Backus-Naur Form (BNF) for CFGs.
<symbol> ::= __expression__
<US-address> ::= <name> <street> <zip> "USA" - perplexity - a measure of how well test data is predicted by a model.
- I read that the lowest perplexity found using a 3gram on the Brown Corpus is 247.
- The perfect/true model for the data would have a perplexity of 0.
- precision & recall - In IR precision is the fraction of retrieved instances that are relevant and recall is the fraction of relevant instances that are retrieved.
- Max precision is no false positives (being conservative about finding a match).
- Max recall is no false negatives (being exhaustive and thorough).
Morphology review, glossary
(low morpheme-to-word ratio high morpheme-to-word ratio)
- lexeme – a word concept. A lexeme can represent multiple surface forms of a word that comprise the paradigm of the lexeme (e.g. run, runs, ran and running are the same lexeme, RUN). As parts of a single concept, the forms of a lexeme will have the same syntactic category (runner is a different lexeme).
- lemma – a particular form of a lexeme conventionally chosen to represent the canonical form. In a dictionary the lemma is the headword.
- paradigm - complete set of words associate with a given lexeme.
- inflectional rules – relate a lexeme to its forms. Generally the syntactic category remains the same (e.g. eat and eaten, boy and boys)
- derivational rules – relate a lexeme to a new lexeme. Generally, but not always, the syntactic category changes (e.g. slow and slowly). Necessarily the meaning of the base changes (e.g. write and rewrite, circle and encircle). Derivational affixes are bound morphemes.
- zero derivation or conversion – changing from one lexeme to another without any surface change (e.g. telephone and to telephone).
- endocentric compound – The compound has a head which represents the basic meaning (and same part-of-speech) of the whole (e.g. house and doghouse)
- exocentric compound – meaning of the compound is not transparent from the constituents (e.g. white-collar and must-have).
- copulative – …
- appositional – …
Review of common grammatical cases
“Among modern languages, cases still feature prominently in most of the Balto-Slavic languages, with most having six to eight cases, as well as German and Modern Greek, which have four. In German, cases are mostly market on articles and adjectives, and less so on nouns.” (wikipedia)
- nominative – subject of finite verb
- accusative – direct object of verb
- dative – indirect object of a verb
- ablative – indicates movement from something or causality
- genitive – possessive
- vocative – addressee
- locative – a location
- instrumental – object use in performing an action
Ergative versus Accusative languages
In nominative-accusative languages (like English) the agent of a transitive verb and the solitary argument of an intransitive verb are treated alike. They are both called the subject and they have a syntactic/morphological parity which might be word order or grammatical case (nominative). The object of a transitive verb (patient) is treated differently (accusative).
- We run marathons. We sleep.
- They remember us.
In ergative-absolutive languages (e.g. Basque) it is the solitary argument of an intransitive verb and the object of a transitive verb that are treated the same (morphologically or syntactically). In languages with case, solitary argument of an intransitive verb and the object of a transitive verb would have absolutive case. The agent of a transitive verb is treated differently (ergative case).
Some languages have both ergative and accusative morphology.
I'd like to know more about...
Lorenzo di Medici http://en.wikipedia.org/wiki/Lorenzo_di_Medici
Trees of WA state
The human brain
Weather
Alfred the Great
Vegetarian cooking
Ayurvedic cooking
basic biology
Essential Perl
- Program Stub / Typical Flow
#!/usr/bin/perl
open IN, “<”, inputfile.txt;
open OUT, “>”, outputfile.txt
if($#ARGV != 2) {
print “ERROR - need 2 args!\n”;
exit;
}
$arg1 = $ARGV[0];
$arg2 = $ARGV[1];
while(<IN) {
$l = $_;
chomp($l):
…
}
close IN;
close OUT;
- Read/Write files
open FILE, "first2.txt" or die "Personalized error message!!!";
open FILE, "first2.txt" or die $!; # generic error message will be stored in $! variable
open FILE, “>output.txt” or die $!”;
# to use a variable for the filename, it is easier to write the mode in it’s own comma-separated quotes like this:
open FILE, "<", $mine or die $!;
open OUT, ">>", $yours or die $!;
<file.txt (read but DON’T create or truncate/delete/overwrite)
>file.txt (write, create and truncate/overwrite)
>>file.txt (append or create)
* adding ‘+’ allows for simultaneous reading and writing
+< (read/write, but DON’T create or truncate/delete/overwrite)
+> (read/write, create and truncate/overwrite)
+>> (read/append or read/create-write)
-- check if a file exists
$file = ‘ /dir/file.txt’;
if (-e $file) {
print “File exists!”;
}
- FILEHANDLE directly to array
- The file will only be read once per open statement so you can’t do @lines = <FILE> and then while(<FILE>) without closing FILE and re-opening it in between the two <FILE> lines of code.
my @lines = <FILE>;
- Arrays
@myArray = ();
$length = @myArray;
if (exists $myArray[$ind]) #Value EXISTS, but may be undefined.
if(defined $myArray[$ind]) #Value is DEFINED, but may be false.
if($myArray[$ind]) #Value at array index $index is TRUE.
- Hash
# initialize by assigning to an empty list
%hash = ();
# add value
$hash { ‘key' } = ‘value’;
$hash { $key} = $value; # with vars
%hash = (
key1 => $val1,
key2 => $val2,
key3 => $val3,
);
# reference values of hash
$href->{ ‘key’ } = ‘value’;
$href->{ $key } = $value; # with vars
MORE ON HASHES AND HASH REFERENCES
keys()
value()
- Sub routines
# call subroutine
$result = doSomething($input);
# actual subroutine
sub doSomething(){
my $var1 = shift(@_);
….
return $var2;
}
- Regex
I remember pretty well
- Loops
foreach (@myArray) {
print $_;
}
foreach $item (@myArray) { # use scalar as iterator for readability
print $item;
}
$linecount++ while (<FILE>);
do { # execute do statement before testing expression
$calc += ($fact * $val); # using an assignment operator
# equivalent to $calc = $calc + ($fact * $val);
} while ($calc < 100);
split, push, pop, shift etc??????
- Good Resources
http://en.wikibooks.org/wiki/Perl_Programming/Operators
http://www.cs.mcgill.ca/~abatko/computers/programming/perl/howto/hash/
http://www.troubleshooters.com/codecorn/littperl/perlsub.htm
http://www.cs.cmu.edu/afs/cs/usr/rgs/mosaic/pl-predef.html Predefined names in Perl
http://cslibrary.stanford.edu/108/EssentialPerl.html
http://perldoc.perl.org/functions
Perl one-liners
- Read line, substitute regex and print line.
- perl -ne '{$l = $_; $l=~ s/dede/frfr/g; print $l;}' input1.txt
- perl –ne ‘$l = $l =~ s/dede/frfr/g; print $l;}’ input1.txt input2.txt
- -e means “execute” and –n makes it loop line by line
- Multiple input files are fine.
- Snazzier replacement per line
- perl -pe 's/a/b/g' < input > newfile
- Read line and print the (1-indexed) line number and line.
- perl –ne ‘print “$. – $_”’ input1.txt
- output looks like:
- 1 – My first line of input text
- 2 – My second line of input text
- With no input, print 0 to 999
- perl -e ' for ($i=0;$i<999;$i++) { print "$i\n"} '
Programming Glossary
- scalar variable – a non-composite (non-object) value. Primitive data types like booleans, integers, floating points, characters and strings are scalars. (bool, int, float, double, char and string).