TinyCC 2.0 User’s Manual
Chris Biemann
March 2007
TinyCC 2.0 is a text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format. The LCC-DVD 1.0, distributed from May 2006 on, was created using tinyCC 1.1 and other procedures. For further explanations on LCC corpus building, see [Quasthoff et al. 2006].
TinyCC 2.0 splits the text into sentences and creates tab-separated files, containing:
The log-likelihood ratio [Dunning 1993] is used as significance test.
The implementation consists of a shell script calling programs written in JAVA and PERL. It is platform-dependent and was tested only on LINUX.
Download the archive tinyCC2.tar.gz into a folder of your choice and unzip it by:
gzip -d tinyCC2.tar.gz
tar -xf tinyCC2.tar.gz
A maintenance update which covers some problems with processing UTF-8 text is available from tinyCC2.1.1.tar.gz.
!! Windows Users !!: A somewhat slower and less comfortable version of tinyCC is available at tinyCC1.5win.zip. Only use this version if you do not have the possibility to run tinyCC2.0 in a UNIX-like environment.
To run tinyCC 2.0, you need a Java Runtime Environment (JRE) of version 1.5 or later. You can obtain it at http://java.sun.com/j2se/1.5.0/download.jsp. Please ensure that java is in the path – to check this, type “java -version” in your shell – it should respond with a version number of 1.5.0 or higher. Further, you need PERL version 5 or higher. The latest version can be downloaded at http://www.perl.com/download.csp. Please ensure that PERL is in the path – to heck this, type “perl -v” in your shell – it should respond with a version number of 5 or higher.
Raw text corpora can be fed into tinyCC 2.0 in
three different ways: HTML and plain text. For retaining the source per
sentence (e.g. the name of the text), the SATZ.S-Format allows to provide this
information directly. Otherwise, the source will carry the name of the file the
sentence was found in.
The text is given in plain format. Sentences should not cross lines: if your corpus is formatted such that carriage-returns can be found within sentences, please remove them beforehand. The text must be given in files with “.txt” extension.
The text is given in HTML encoding in files with “.htm” or “.html” extension. In pre-processing, all HTML elements will be removed.
The text can only be given in a plain format. To feed sources to the process, the following line should be present BEFORE every different text source:
<quelle><name>NAME_OF_SOURCE</name><name_lang>NAME_OF_SOURCE
</name_lang></quelle>
Please note that this line starts with a space-character.
To provide text data to the process, please put all files containing the text in these formats into one folder (subfolders are possible).
Change to the directory you unpacked the archive to. The program accepts three parameters:
The distribution comes with a small sample in all three formats in the folder sampledata. You can check the functionality of tinyCC by typing
./tinyCC
mycorpusPLAIN sampledata/PLAIN none
in your shell. The program will produce seven files in a subfolder “result”:
bytes filename
1301 mycorpusPLAIN.co_n
9888 mycorpusPLAIN.co_s
576 mycorpusPLAIN.inv_so
8531 mycorpusPLAIN.inv_w
4470 mycorpusPLAIN.sentences
62 mycorpusPLAIN.sources
4306 mycorpusPLAIN.words
For corpora this small, please do not expect meaningful co-occurrences.
This section describes the format of the seven output files and what the output means.
This file has two columns and contains the sentences of the corpus.
1st column: sentence-id, as used in inv_so and inv_w
2nd column: sentence text as in original. the internal tokenizing of tinyCC is not reflected here.
This file has three columns and contains the words of the corpus
1st column: word-id, as used in inv_w, co_s, co_n
2nd column: word
3rd column word frequency count in the corpus
The first 100 word-ids are reserved for special characters such as punctuation, begin-of-sentence (%^%), end-of-sentence (%$%) and numeral (_NUMBER_).
This file has two columns and contains the sources
1st column: source-id as used in inv_so
2nd column: Source name: either file name or contents of <name>-tag in SATZ.S format
This File has two columns and indexes sentences by source
1st column: sentence-id as used in sentences
2nd column: source-id as used in sources
This file has four columns of which the fourth is optional. It indexes sentences by words
1st column: word-id as in words
2nd column: sentence-id as in sentences
3rd column: position in sentence. Here, the internal tokenization is reflected.
4th column (optional): Contains “-“ if word is part of a multi word unit.
This file has four columns and contains significant neighbour-based co-occurrences
1st column: word-id of left word in a word bigram
2nd column: word-id of right word in a word bigram
3rd column: frequency of word bigram consisting of left and right word
4th column: log-likelihood ratio
This file has four columns and contains significant sentence-based co-occurrences
1st column: word-id of word 1
2nd column: word-id of word 2
3rd column: frequency of joint occurrence
4th column: log-likelihood ratio
The data is symmetric in columns 1 and 2.
The parameters
of tinyCC 2.0 have been carefully set in a sensible way. For normal text
corpora, there should be no need to change them. However, this section
describes how to do exactly this.
For
changing internal parameters, open the file “tinyCC.sh” in the main
folder of your installation and look for the following section, starting after
the initial comments:
# input text is in format (latin|utf8)
export TEXTFORM=utf8
# locales for latin__must__ be installed on your system!
# See `localedef --list-archive` for a list of installed locales
# Edit /etc/locale.gen and sudo locale-gen to enable specific locales
# locale to be used for processing ISO 8859 text
export LTYPE=de_DE@euro
# name of this locale as understood by `recode`
export LNAME=latin1
# locale to be used for processing UTF-8 text
export UTYPE=de_DE.UTF-8
#
Memory max usage in MB (approximate)
export
MAXMEM=600
# min
frequency for scoocs
export
SMINFREQ=2
# min
sig for scooc
export
SMINSIG=6.63
# min
freq for nbcooc
export
NMINFREQ=2
# min
sig for NBcooc
export
NMINSIG=3.84
# number
of digits after .
export
DIGITS=2
# temp
directory
export
TEMP=temp
#
result directory
export
RES=result
These
parameters are explained now:
·
TEXTFORM:
Specifies which format to assume for the processed texts. `latin` should work for most encodings, such as ISO-8859-* and windows12++.
·
LTYPE:
Name of an available locale for ISO-8859-* encoding.
·
LNAME:
Name of the above encoding as understood by GNU recode (see recode -l for the list of supported encodings)
·
UTYPE:
Name of an available locale for UTF-8 encoding. See localedef --list for a list of the locales installed on your system. If there is no locale supporting UTF-8 available on your system select one from /usr/share/i18n/SUPPORTED and sudo $EDITOR /etc/locale.gen to add it to your locales list. Then sudo locale-gen to rebuild the locales on your system.
·
MAXMEM:
The maximum RAM in megabytes the process is allowed to use. As this
value is very approximate, please set it considerably lower than your main
memory. Larger values speed up co-occurrence computation (especially for large
corpora), but too large values will result in swapping.
·
SMINFREQ/NMINFREQ: The
minimum joint occurrence frequency to be taken into account for
sentence/neighbour-based co-occurrences. A value of 1 should not be used, see
[Moore 2004].
·
SMINSIG/NMINSIG: The
minimum log-likelihood ratio to be taken into account for co-occurrences. 3.84
corresponds to 5% error probability, 6.63 corresponds to 1% error probability,
also cf. [Moore 2004].
·
DIGITS: Output
precision for log-likelihood ratios.
·
TEMP: temporary working directory
·
RES: where to store the output
Further,
you might change
·
tokenisation:
Dive into “perl/tokenize.pl” (latin1)
·
behaviour
on carriage-returns inside sentences: remove “-n” in the text2satz call
·
significance
formula: dive into “perl/nbcooc.pl”
and “perl/ssig.pl”
·
platform
dependence: the most crucial point is the usage of “bin/sort64” which is
UNIX sort compiled for 64 bits. 32-bit sorts do not handle temporary files
larger than 2GB.
TinyCC 2.0 merely converts plain text data into the LCC format, thereby computing co-occurrences. Duplicates and ‘dirt’ are not removed. TinyCC was tested up to 50 Million sentence (750 Million words) corpora.
TinyCC 2.0 was developed by
Chris Biemann at the University of Leipzig. The component handling sources and
performing sentence splitting was developed by Fabian Schmidt. Some fixes for UTF-8 handling were implemented by Matthias Richter. Thanks goes to all the testers from the NLP Department,
University of Leipzig.
[Dunning 1993] Ted E. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics 19(1):1993 http://www.comp.lancs.ac.uk/ucrel/papers/tedstats.pdf
[Moore 2004] Moore, R. C. (2004): On Log-Likelihood-Ratios and the
Significance of Rare Events. Proceedings of EMNLP 2004,
Barcelona, Spain http://research.microsoft.com/users/bobmoore/rare-events-final-rev.pdf
[Quasthoff et
al. 2006] Quasthoff, U., Richter, M. and Biemann,
C. (2006): Corpus Portal for Search in Monolingual Corpora. Proceedings of
LREC-06, Genoa, Italy QuasthoffBiemannRichter06portal.pdf