TinyCC 2.0 User’s Manual

Chris Biemann

March 2007

 

 

Introduction

 

TinyCC 2.0 is a text corpus production engine that can be used to produce corpora in Leipzig Corpus Collection (LCC) format. The LCC-DVD 1.0, distributed from May 2006 on, was created using tinyCC 1.1 and other procedures. For further explanations on LCC corpus building, see [Quasthoff et al. 2006].

 

TinyCC 2.0 splits the text into sentences and creates tab-separated files, containing:

The log-likelihood ratio [Dunning 1993] is used as significance test.

Installation

The implementation consists of a shell script calling programs written in JAVA and PERL. It is platform-dependent and was tested only on LINUX.

Download the archive tinyCC2.tar.gz into a folder of your choice and unzip it by:

gzip -d tinyCC2.tar.gz

tar -xf tinyCC2.tar.gz

A maintenance update which covers some problems with processing UTF-8 text is available from tinyCC2.1.1.tar.gz.

!! Windows Users !!: A somewhat slower and less comfortable version of tinyCC is available at tinyCC1.5win.zip. Only use this version if you do not have the possibility to run tinyCC2.0 in a UNIX-like environment.

System requirements

To run tinyCC 2.0, you need a Java Runtime Environment (JRE) of version 1.5 or later. You can obtain it at http://java.sun.com/j2se/1.5.0/download.jsp. Please ensure that java is in the path – to check this, type “java -version” in your shell – it should respond with a version number of 1.5.0 or higher. Further,  you need PERL version 5 or higher. The latest version can be downloaded at http://www.perl.com/download.csp. Please ensure that PERL is in the path – to heck this, type “perl -v” in your shell – it should respond with a version number of 5 or higher.

Operation of tinyCC 2.0

Input formats

Raw text corpora can be fed into tinyCC 2.0 in three different ways: HTML and plain text. For retaining the source per sentence (e.g. the name of the text), the SATZ.S-Format allows to provide this information directly. Otherwise, the source will carry the name of the file the sentence was found in.

 

Plain text

The text is given in plain format. Sentences should not cross lines: if your corpus is formatted such that carriage-returns can be found within sentences, please remove them beforehand. The text must be given in files with “.txt” extension.

 

HTML

The text is given in HTML encoding in files with “.htm” or “.html” extension. In pre-processing, all HTML elements will be removed.

 

SATZ.S-Format

The text can only be given in a plain format. To feed sources to the process, the following line should be present BEFORE every different text source:

 <quelle><name>NAME_OF_SOURCE</name><name_lang>NAME_OF_SOURCE </name_lang></quelle>

Please note that this line starts with a space-character.

 

To provide text data to the process, please put all files containing the text in these formats into one folder (subfolders are possible).

 

Running the program

Change to the directory you unpacked the archive to. The program accepts three parameters:

  1. Name of the corpus, e.g. “mycorpus”
  2. Data folder for texts, e.g. “/sampledata/PLAIN”
  3. List of multiword units (MWUs). If you do not want to index MWUs, use “none”

 

The distribution comes with a small sample in all three formats in the folder sampledata. You can check the functionality of tinyCC by typing

./tinyCC mycorpusPLAIN sampledata/PLAIN none 

in your shell. The program will produce seven files in a subfolder “result”:

bytes    filename

1301    mycorpusPLAIN.co_n

9888    mycorpusPLAIN.co_s

576      mycorpusPLAIN.inv_so

8531    mycorpusPLAIN.inv_w

4470    mycorpusPLAIN.sentences

62        mycorpusPLAIN.sources

4306    mycorpusPLAIN.words

 

For corpora this small, please do not expect meaningful co-occurrences.

 

Output Format

This section describes the format of the seven output files and what the output means.

File: sentences

This file has two columns and contains the sentences of the corpus.

1st column: sentence-id, as used in inv_so and inv_w

2nd column: sentence text as in original. the internal tokenizing of tinyCC is not reflected here.

 

File: words

This file has three columns and contains the words of the corpus

1st column: word-id, as used in inv_w, co_s, co_n

2nd column: word

3rd column word frequency count in the corpus

The first 100 word-ids are reserved for special characters such as punctuation, begin-of-sentence (%^%), end-of-sentence (%$%) and numeral (_NUMBER_).

 

File: sources

This file has two columns and contains the sources

1st column: source-id as used in inv_so

2nd column: Source name: either file name or contents of <name>-tag in SATZ.S format

 

File: inv_so

This File has two columns and indexes sentences by source

1st column: sentence-id as used in sentences

2nd column: source-id as used in sources

 

File: inv_w

This file has four columns of which the fourth is optional. It indexes sentences by words

1st column: word-id as in words

2nd column: sentence-id as in sentences

3rd column: position in sentence. Here, the internal tokenization is reflected.

4th column (optional): Contains “-“ if word is part of a multi word unit.

 

File: co_n

This file has four columns and contains significant neighbour-based co-occurrences

1st column: word-id of left word in a word bigram

2nd column: word-id of right word in a word bigram

3rd column: frequency of word bigram consisting of left and right word

4th column: log-likelihood ratio

 

File: co_s

This file has four columns and contains significant sentence-based co-occurrences

1st column: word-id of word 1

2nd column: word-id of word 2

3rd column: frequency of joint occurrence

4th column: log-likelihood ratio

The data is symmetric in columns 1 and 2.

Tuning tinyCC 2.0

The parameters of tinyCC 2.0 have been carefully set in a sensible way. For normal text corpora, there should be no need to change them. However, this section describes how to do exactly this.

For changing internal parameters, open the file “tinyCC.sh” in the main folder of your installation and look for the following section, starting after the initial comments:

# input text is in format (latin|utf8)

export TEXTFORM=utf8

# locales for latin__must__ be installed on your system!

# See `localedef --list-archive` for a list of installed locales

# Edit /etc/locale.gen and sudo locale-gen to enable specific locales


# locale to be used for processing ISO 8859 text

export LTYPE=de_DE@euro

# name of this locale as understood by `recode`

export LNAME=latin1


# locale to be used for processing UTF-8 text

export UTYPE=de_DE.UTF-8


# Memory max usage in MB (approximate)

export MAXMEM=600

# min frequency for scoocs

export SMINFREQ=2

# min sig for scooc

export SMINSIG=6.63

# min freq for nbcooc

export NMINFREQ=2

# min sig for NBcooc

export NMINSIG=3.84

# number of digits after .

export DIGITS=2

# temp directory

export TEMP=temp

# result directory

export RES=result

 

These parameters are explained now:

·         TEXTFORM: Specifies which format to assume for the processed texts. `latin` should work for most encodings, such as ISO-8859-* and windows12++.

·         LTYPE: Name of an available locale for ISO-8859-* encoding.

·         LNAME: Name of the above encoding as understood by GNU recode (see recode -l for the list of supported encodings)

·         UTYPE: Name of an available locale for UTF-8 encoding. See localedef --list for a list of the locales installed on your system. If there is no locale supporting UTF-8 available on your system select one from /usr/share/i18n/SUPPORTED and sudo $EDITOR /etc/locale.gen to add it to your locales list. Then sudo locale-gen to rebuild the locales on your system.

·         MAXMEM: The maximum RAM in megabytes the process is allowed to use. As this value is very approximate, please set it considerably lower than your main memory. Larger values speed up co-occurrence computation (especially for large corpora), but too large values will result in swapping.

·         SMINFREQ/NMINFREQ: The minimum joint occurrence frequency to be taken into account for sentence/neighbour-based co-occurrences. A value of 1 should not be used, see [Moore 2004].

·         SMINSIG/NMINSIG: The minimum log-likelihood ratio to be taken into account for co-occurrences. 3.84 corresponds to 5% error probability, 6.63 corresponds to 1% error probability, also cf. [Moore 2004].

·         DIGITS: Output precision for log-likelihood ratios.

·         TEMP: temporary working directory

·         RES: where to store the output

 

Further, you might change

·         tokenisation: Dive into “perl/tokenize.pl” (latin1) and “perl/tokenize_utf8.pl (UTF-8). Please be careful to preserve the file's encoding.

·         behaviour on carriage-returns inside sentences: remove “-n” in the text2satz call

·         significance formula: dive into “perl/nbcooc.pl” and “perl/ssig.pl

·         platform dependence: the most crucial point is the usage of  bin/sort64” which is UNIX sort compiled for 64 bits. 32-bit sorts do not handle temporary files larger than 2GB.

 

Limitations

TinyCC 2.0 merely converts plain text data into the LCC format, thereby computing co-occurrences. Duplicates and ‘dirt’ are not removed. TinyCC was tested up to 50 Million sentence (750 Million words) corpora.

 

Acknowledgements

TinyCC 2.0 was developed by Chris Biemann at the University of Leipzig. The component handling sources and performing sentence splitting was developed by Fabian Schmidt. Some fixes for UTF-8 handling were implemented by Matthias Richter.  Thanks goes to all the testers from the NLP Department, University of Leipzig.

References

[Dunning 1993] Ted E. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics 19(1):1993 http://www.comp.lancs.ac.uk/ucrel/papers/tedstats.pdf

[Moore 2004] Moore, R. C. (2004): On Log-Likelihood-Ratios and the Significance of Rare Events. Proceedings of EMNLP 2004, Barcelona, Spain http://research.microsoft.com/users/bobmoore/rare-events-final-rev.pdf

[Quasthoff et al. 2006] Quasthoff, U., Richter, M.  and Biemann, C. (2006): Corpus Portal for Search in Monolingual Corpora. Proceedings of LREC-06, Genoa, Italy QuasthoffBiemannRichter06portal.pdf