learned and not forgotten: 2012

caffeine content

Caffeine in Chocolate
Ghirardelli 72% Cacao Twilight Delight: 22mg per 17g square
Ghirardelli 86% Cacao Midnight Reverie: 20mg per 13g square
dark chocolate in general is 28 mg per 1.4oz (40g)
1 oz (28 g) Cadbury (small) bar is 15mg
milk chocolate in general 3-10 mg per 1.4 oz (40g)

Caffeine in Tea
Black Tea: 23 - 110 mg Oolong Tea: 12 - 55 mg Green Tea: 8 - 36 mg White Tea: 6 – 25 mg

Drink green, white and lightly oxidized oolong teas are good choices, as they tend to benefit from lower water temperatures and shorter steeping times.

Guidelines
Adverse effects have been reported when pregnant women consume more than 200mg a day.

I aim to have less than 100mg a day, which would be:
- 1 green tea and two Ghiradelli squares
- three Ghiradelli squares

pre-pregnancy list

- preconception checkup
- ask about immunizations, get a flu shot
- take 400 mg of folic acid a day for one month before conception and during first trimester
- make sure that your multivitamin doesn't contain more than 770 mcg RAE (2565 IU) of vitamin A, unless it is in beta-carotene form
- don't drink during last two weeks of cycle (or at all)
- limit caffeine to 200 mg per day
- get my BMI below 24 (160lbs would be 24.3, 158lbs would be 24.0)
- don't eat: shark, swordfish, king mackerel or tilefish
- don't eat more than 6 oz per week of: chunk light (solid white and albacore are worse) it's good when it comes from skipjack and bad when it comes from yellowfin
- do eat two servings of fish a week: salmon, herring(Atlantic, jack, chub), farm-raised rainbo trout, sardines, whitefish
- floss
- call insurance about prenatal coverage, ask about deductibles, tests and procedure covered
- avoid soft cheese, cold deli meat, raw fish, unpasturized juice

Grep Regex

grep "twitter \|wikipedia " filename # grep for A or B
grep -P "twitter |wikipedia " filename # grep for A or B
grep "\\\\" filename # grep for a single \

grep -P "\t" filename # grep for tab

grep -h - suppress filename content is found in
grep -A 2 - returns matching line and to lines After
grep -B 2 - returns matching line and to lines Before

Grep Regular Expressions

grep regex
http://www.bo.infn.it/alice/alice-doc/mll-doc/usrgde/node28.html

more about unicode hex etc

common python snippets

Python Modules http://wiki.python.org/moin/UsefulModules

regex subsitution in myString

myString= re.sub("$[0-9][0-9]*$", "", myString)
myString= re.sub("\\\\", "", myString) # remove a single backslash
myString= re.sub(r"\\", "", myString) # remove a single backslash using a "raw string"

string substitution (*not* for regex)

c = c.replace(u'\xae', '') # ®

c = c.replace(u'\xbb', '') # »

c = c.replace(u'\x99', '') # char for TM

c = c.replace(u'\xa9', '') # ©

c = c.replace(u'—', '-') # replace mdash u2014 with regular dash

c = c.replace('?', '') # ?

c = c.replace(':', '') # :

c = c.replace(';', '') # ;

save regex-matching string to a new variable
pageObj = re.search("od [0-9][0-9]*", pagetext)
if pageObj:
pageObj2 = re.search("[0-9][0-9]*", pageObj.group())
numb = pageObj2.group()
print "numb is ",numb
pages = range(2,int(numb)+1)

for p in pages:
newUrl = locUrl+'afpg/'+str(p)+'/Default.aspx'
print "adding ", newUrl

stuff to write about

BIO notation

Inside-Outside Algorithm

HPSG

HPSG (Head-Driven Phrase Structure Grammar)

Stanford and Berkeley espouse this syntactic theory.
It is a "highly lexicalized, non-derivational generative grammar theory."
It is a type of phrase structure grammar, as oppsed to a dependency grammar (The syntactic trees I learned are dependency grammars).

sed one-liners

sed 's/ Телефон: /\t/' filename # basic search and replace

sed 's/ [-] .*//' filename # put tricky chars in []

sed 's/\&amp/\&/' filename # example of escaped char

sed 's/.\{1\}/& /g' filename # insert 1 space between each char

sed 's/.\{2\}/& /g' filename # insert 2 spaces between each char

sed 's/$.*$\/word$/word\1/' filename # search and replace with a captured pattern () on LHS and \1 on RHS

sed 's/word/&\n/' filename # search and replace insert a newline
sed '/^\s*$/d' filename # delete blank line

Good resources for one-liners with examples http://sed.sourceforge.net/sed1line.txt

Starting Off in Python

We’ll start off by looking at material from these sites.
http://www.astro.ufl.edu/~warner/prog/python.html
http://www.sthurlow.com/python/

http://learnpythonthehardway.org/book/ <- recommended for new programmers

free online courses

http://www.udacity.com/ <- cs classes
http://coursera.org/

http://www.khanacademy.org/
http://www.khanacademy.org/#biology
http://www.khanacademy.org/#computer-science
UnCollege.org has a nice collection of online resources, including a lot of computer science resources: http://www.uncollege.org/resources.

- MIT Open Course Ware: http://ocw.mit.edu/courses/#electrical-engineering-and-computer-science
- Google Code University: http://code.google.com/edu/
- Stanford's Education Everywhere: http://see.stanford.edu/see/courses.aspx

using "screen" command to handle multiple sessions

from http://www.cyberciti.biz/tips/linux-screen-command-howto.html

$ screen -S 1
CTRL+a, c -- create another screen window
CTRL+a, n -- switch to next screen window I've got open

To list all windows use the command CTRL+a followed by " key (first hit CTRL+a, releases both keys and press " ).
To switch to window by number use the command CTRL+a followed by ' (first hit CTRL+a, releases both keys and press ' it will prompt for window number).

Common screen commands

screen command	Task
Ctrl+a c	Create new window
Ctrl+a k	Kill the current window / session
Ctrl+a w	List all windows
Ctrl+a 0-9	Go to a window numbered 0 9, use Ctrl+a w to see number
Ctrl+a Ctrl+a	Toggle / switch between the current and previous window
Ctrl+a S	Split terminal horizontally into regions and press Ctrl+a c to create new window there
Ctrl+a :resize	Resize region
Ctrl+a :fit	Fit screen size to new terminal size. You can also hit Ctrl+a F for the the same task
Ctrl+a :remove	Remove / delete region. You can also hit Ctrl+a X for the same taks
Ctrl+a tab	Move to next region
Ctrl+a D (Shift-d)	Power detach and logout
Ctrl+a d	Detach but keep shell window open
Ctrl-a Ctrl-\	Quit screen
Ctrl-a ?	Display help screen i.e. display a list of commands

awk

specify tab-delimited columns
awk -F'\t' '{ print $1}' file

print column $13 and column $10 where $13 matches "MY"
awk -F"\t" '$13 ~ /MY/ {print $13"\t"$10}' | less

awk -F"\t" '$7 == "restaurant" ' japan.osm.poi > restaurants

awk '{ print \$1} ' file, print 1st column, escape $ when in perl system command

not really sure what this does.....

awk -F "\"*,\"*" '{print $3"\t"$5}' jigyosyo.csv

iconv and recode to change file encoding

List all the encoding codes:

iconv --list

iconv --from-code LATIN1 --to-code UTF-8 --output adoos.com.my.categories.zlm-MYS.UTF.txt adoos.com.my.categories.zlm-MYS.LATIN1.txt

recode ....

windows 1252 encoding

CP1252 is windows encoding

em dash is a measurement of font size that is often double encoded?
to find it do:
zcat file | grep -P '\xc2\x96'

91,92,93,94 are also other troublesome windows chars

my .bashrc file

So you don't have to do source ~/.bashrc every time you open a terminal, put it in ~/.bash_profile.
~> cat .bash_profile
source ~/.bashrc

Things I like to have in my .bashrc file are:
.... in progress

file transfer

If you're transferring between Windows and linux, use winSCP.

If you're transferring linux to linux use scp like this:

>> scp ......

EM (Expectation maximization) algorithm

- an iterative method for find maximum likelihood estimates of parameters in a statistical model

- E compute the expectation of the log-likelihood using the current estimates for the parameters

- M compute parameters maximizing the expected log-likelihood found during the E step

I was trying to find weights for rules of my PCFG.

I used gigaword news data.

Comp Ling Glossary

context free grammar (CFG) - in formal language theory a formal grammar where every production rule has the form A-> b where A is non-terminal and b is 0+ terminal and/or nonterminals.

linguistics sometimes calls them phrase structure grammars
comp sci often uses Backus-Naur Form (BNF) for CFGs.
<symbol> ::= __expression__
<US-address> ::= <name> <street> <zip> "USA"

perplexity - a measure of how well test data is predicted by a model.

I read that the lowest perplexity found using a 3gram on the Brown Corpus is 247.
The perfect/true model for the data would have a perplexity of 0.

precision & recall - In IR precision is the fraction of retrieved instances that are relevant and recall is the fraction of relevant instances that are retrieved.

Max precision is no false positives (being conservative about finding a match).
Max recall is no false negatives (being exhaustive and thorough).

Morphology review, glossary

(less complex morphology more complex morphology)
(low morpheme-to-word ratio high morpheme-to-word ratio)

analytic / isolating > agglutinative / fusional > polysynthetic

lexeme – a word concept. A lexeme can represent multiple surface forms of a word that comprise the paradigm of the lexeme (e.g. run, runs, ran and running are the same lexeme, RUN). As parts of a single concept, the forms of a lexeme will have the same syntactic category (runner is a different lexeme).
lemma – a particular form of a lexeme conventionally chosen to represent the canonical form. In a dictionary the lemma is the headword.
paradigm - complete set of words associate with a given lexeme.
inflectional rules – relate a lexeme to its forms. Generally the syntactic category remains the same (e.g. eat and eaten, boy and boys)
derivational rules – relate a lexeme to a new lexeme. Generally, but not always, the syntactic category changes (e.g. slow and slowly). Necessarily the meaning of the base changes (e.g. write and rewrite, circle and encircle). Derivational affixes are bound morphemes.
zero derivation or conversion – changing from one lexeme to another without any surface change (e.g. telephone and to telephone).

endocentric compound – The compound has a head which represents the basic meaning (and same part-of-speech) of the whole (e.g. house and doghouse)
exocentric compound – meaning of the compound is not transparent from the constituents (e.g. white-collar and must-have).
copulative – …
appositional – …

Review of common grammatical cases

“Among modern languages, cases still feature prominently in most of the Balto-Slavic languages, with most having six to eight cases, as well as German and Modern Greek, which have four. In German, cases are mostly market on articles and adjectives, and less so on nouns.” (wikipedia)

nominative – subject of finite verb
accusative – direct object of verb
dative – indirect object of a verb
ablative – indicates movement from something or causality
genitive – possessive
vocative – addressee
locative – a location
instrumental – object use in performing an action

Ergative versus Accusative languages

In nominative-accusative languages (like English) the agent of a transitive verb and the solitary argument of an intransitive verb are treated alike. They are both called the subject and they have a syntactic/morphological parity which might be word order or grammatical case (nominative). The object of a transitive verb (patient) is treated differently (accusative).

- We run marathons. We sleep.
- They remember us.

In ergative-absolutive languages (e.g. Basque) it is the solitary argument of an intransitive verb and the object of a transitive verb that are treated the same (morphologically or syntactically). In languages with case, solitary argument of an intransitive verb and the object of a transitive verb would have absolutive case. The agent of a transitive verb is treated differently (ergative case).

Some languages have both ergative and accusative morphology.

I'd like to know more about...

Brainstorm for Annual Topics
Lorenzo di Medici http://en.wikipedia.org/wiki/Lorenzo_di_Medici
Trees of WA state
The human brain
Weather
Alfred the Great
Vegetarian cooking
Ayurvedic cooking
basic biology

Essential Perl

Program Stub / Typical Flow

#!/usr/bin/perl
open IN, “<”, inputfile.txt;
open OUT, “>”, outputfile.txt

if($#ARGV != 2) {
print “ERROR - need 2 args!\n”;
exit;
}

$arg1 = $ARGV[0];
$arg2 = $ARGV[1];

while(<IN) {
$l = $_;
chomp($l):
…
}
close IN;
close OUT;

Read/Write files

open FILE, "first2.txt" or die "Personalized error message!!!";
open FILE, "first2.txt" or die $!; # generic error message will be stored in $! variable
open FILE, “>output.txt” or die $!”;

# to use a variable for the filename, it is easier to write the mode in it’s own comma-separated quotes like this:
open FILE, "<", $mine or die $!;
open OUT, ">>", $yours or die $!;

<file.txt (read but DON’T create or truncate/delete/overwrite)
>file.txt (write, create and truncate/overwrite)
>>file.txt (append or create)
* adding ‘+’ allows for simultaneous reading and writing
+< (read/write, but DON’T create or truncate/delete/overwrite)
+> (read/write, create and truncate/overwrite)
+>> (read/append or read/create-write)

-- check if a file exists
$file = ‘ /dir/file.txt’;
if (-e $file) {
print “File exists!”;
}

- FILEHANDLE directly to array
- The file will only be read once per open statement so you can’t do @lines = <FILE> and then while(<FILE>) without closing FILE and re-opening it in between the two <FILE> lines of code.
my @lines = <FILE>;

Arrays

@myArray = ();

$length = @myArray;

if (exists $myArray[$ind]) #Value EXISTS, but may be undefined.
if(defined $myArray[$ind]) #Value is DEFINED, but may be false.
if($myArray[$ind]) #Value at array index $index is TRUE.

Hash

# initialize by assigning to an empty list
%hash = ();

# add value
$hash { ‘key' } = ‘value’;
$hash { $key} = $value; # with vars

%hash = (
key1 => $val1,
key2 => $val2,
key3 => $val3,
);

# reference values of hash
$href->{ ‘key’ } = ‘value’;
$href->{ $key } = $value; # with vars

Perl one-liners

Read line, substitute regex and print line.

perl -ne '{$l = $_; $l=~ s/dede/frfr/g; print $l;}' input1.txt
perl –ne ‘$l = $l =~ s/dede/frfr/g; print $l;}’ input1.txt input2.txt
-e means “execute” and –n makes it loop line by line
Multiple input files are fine.

Snazzier replacement per line

perl -pe 's/a/b/g' < input > newfile

Read line and print the (1-indexed) line number and line.

perl –ne ‘print “$. – $_”’ input1.txt
output looks like:

1 – My first line of input text
2 – My second line of input text

With no input, print 0 to 999

perl -e ' for ($i=0;$i<999;$i++) { print "$i\n"} '

THESE DON'T WORK WELL WITH sed

perl -pe "s/\&apos/'/g" < 2.2_cleaner > new

perl -pe 's/\&quot/"/g' < 2.2_cleaner > new

Programming Glossary

scalar variable – a non-composite (non-object) value. Primitive data types like booleans, integers, floating points, characters and strings are scalars. (bool, int, float, double, char and string).

using tar to compress and extract files

Using tar to compress file(s)

tar –czvf destination.tar.gz source/*

Using tar to uncompress/extract file(s)

tar –xzvf destination.tar.gz

where:
c = create a new tar file
v = verbose , display file to compress or uncompress
f = create the tar file with filename provided as the argument
z = use gzip to zip it
x = extract file

HPSG (Head-Driven Phrase Structure Grammar)

Common screen commands

Suggested readings: