ASPN ActiveState Programmer Network  
ActiveState, a division of Sophos
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups
Submit Recipe
My Recipes

All Recipes
All Cookbooks


View by Category

Title: Language detection using character trigrams
Submitter: Douglas Bagnall (other recipes)
Last Updated: 2004/11/07
Version no: 1.1
Category: Algorithms

 

4 stars 2 vote(s)


Description:

The Trigram class can be used to compare blocks of text based on their local structure, which is a good indicator of the language used. It could also be used within a language to discover and compare the characteristic footprints of various registers or authors. As all n-gram implementations should, it has a method to make up nonsense words.

Source: Text Source

#!/usr/bin/python

import random
from urllib import urlopen

class Trigram:
    """From one or more text files, the frequency of three character
    sequences is calculated.  When treated as a vector, this information
    can be compared to other trigrams, and the difference between them
    seen as an angle.  The cosine of this angle varies between 1 for
    complete similarity, and 0 for utter difference.  Since letter
    combinations are characteristic to a language, this can be used to
    determine the language of a body of text. For example:

        >>> reference_en = Trigram('/path/to/reference/text/english')
        >>> reference_de = Trigram('/path/to/reference/text/german')
        >>> unknown = Trigram('url://pointing/to/unknown/text')
        >>> unknown.similarity(reference_de)
        0.4
        >>> unknown.similarity(reference_en)
        0.95

    would indicate the unknown text is almost cetrtainly English.  As
    syntax sugar, the minus sign is overloaded to return the difference
    between texts, so the above objects would give you:

        >>> unknown - reference_de
        0.6
        >>> reference_en - unknown    # order doesn't matter.
        0.05

    As it stands, the Trigram ignores character set information, which
    means you can only accurately compare within a single encoding
    (iso-8859-1 in the examples).  A more complete implementation might
    convert to unicode first.

    As an extra bonus, there is a method to make up nonsense words in the
    style of the Trigram's text.

        >>> reference_en.makeWords(30)
        My withillonquiver and ald, by now wittlectionsurper, may sequia,
        tory, I ad my notter. Marriusbabilly She lady for rachalle spen
        hat knong al elf

    Beware when using urls: HTML won't be parsed out.

    Most methods chatter away to standard output, to let you know they're
    still there.
    """
    length = 0

    def __init__(self, fn=None):
        self.lut = {}
        if fn is not None:
            self.parseFile(fn)

    def parseFile(self, fn):
        pair = '  '
        if '://' in fn:
            print "trying to fetch url, may take time..."
            f = urlopen(fn)
        else:
            f = open(fn)
        for z, line in enumerate(f):
            if not z % 1000:
                print "line %s" % z
            # \n's are spurious in a prose context
            for letter in line.strip() + ' ':
                d = self.lut.setdefault(pair, {})
                d[letter] = d.get(letter, 0) + 1
                pair = pair[1] + letter
        f.close()
        self.measure()


    def measure(self):
        """calculates the scalar length of the trigram vector and
        stores it in self.length."""
        total = 0
        for y in self.lut.values():
            total += sum([ x * x for x in y.values() ])
        self.length = total ** 0.5

    def similarity(self, other):
        """returns a number between 0 and 1 indicating similarity.
        1 means an identical ratio of trigrams;
        0 means no trigrams in common.
        """
        if not isinstance(other, Trigram):
            raise TypeError("can't compare Trigram with non-Trigram")
        lut1 = self.lut
        lut2 = other.lut
        total = 0
        for k in lut1.keys():
            if k in lut2:
                a = lut1[k]
                b = lut2[k]
                for x in a:
                    if x in b:
                        total += a[x] * b[x]

        return float(total) / (self.length * other.length)

    def __sub__(self, other):
        """indicates difference between trigram sets; 1 is entirely
        different, 0 is entirely the same."""
        return 1 - self.similarity(other)


    def makeWords(self, count):
        """returns a string of made-up words based on the known text."""
        text = []
        k = '  '
        while count:
            n = self.likely(k)
            text.append(n)
            k = k[1] + n
            if n in ' \t':
                count -= 1
        return ''.join(text)


    def likely(self, k):
        """Returns a character likely to follow the given string
        two character string, or a space if nothing is found."""
        if k not in self.lut:
            return ' '
        # if you were using this a lot, caching would a good idea.
        letters = []
        for k, v in self.lut[k].items():
            letters.append(k * v)
        letters = ''.join(letters)
        return random.choice(letters)


def test():
    en = Trigram('http://gutenberg.net/dirs/etext97/lsusn11.txt')
   #NB fr and some others have English license text.
    #   no has english excerpts.
    fr = Trigram('http://gutenberg.net/dirs/etext03/candi10.txt')
    fi = Trigram('http://gutenberg.net/dirs/1/0/4/9/10492/10492-8.txt')
    no = Trigram('http://gutenberg.net/dirs/1/2/8/4/12844/12844-8.txt')
    se = Trigram('http://gutenberg.net/dirs/1/0/1/1/10117/10117-8.txt')
    no2 = Trigram('http://gutenberg.net/dirs/1/3/0/4/13041/13041-8.txt')
    en2 = Trigram('http://gutenberg.net/dirs/etext05/cfgsh10.txt')
    fr2 = Trigram('http://gutenberg.net/dirs/1/3/7/0/13704/13704-8.txt')
    print "calculating difference:"
    print "en - fr is %s" % (en - fr)
    print "fr - en is %s" % (fr - en)
    print "en - en2 is %s" % (en - en2)
    print "en - fr2 is %s" % (en - fr2)
    print "fr - en2 is %s" % (fr - en2)
    print "fr - fr2 is %s" % (fr - fr2)
    print "fr2 - en2 is %s" % (fr2 - en2)
    print "fi - fr  is %s" % (fi - fr)
    print "fi - en  is %s" % (fi - en)
    print "fi - se  is %s" % (fi - se)
    print "no - se  is %s" % (no - se)
    print "en - no  is %s" % (en - no)
    print "no - no2  is %s" % (no - no2)
    print "se - no2  is %s" % (se - no2)
    print "en - no2  is %s" % (en - no2)
    print "fr - no2  is %s" % (fr - no2)

    print "\nmaking up English"
    print en.makeWords(30)
    print "\nmaking up French"
    print fr.makeWords(30)


if __name__ == '__main__':
    test()

Discussion:

For some reason I am on the W3C's www-international mailing list,
where I read this message:

http://lists.w3.org/Archives/Public/www-international/2004OctDec/0062.html

mentioning that people use n-grams to guess languages. That is to say, you
look at the micro-structure of a block of text, and count how many times
sequences of length n occur. If you count pairs it is called a 2-gram (or
bigram), and so with any value of n. I have used a 3-gram, or trigram.

I combined this with a vector search as described by Maciej Ceglowski in his
famous O'Reilly article:

http://www.perl.com/pub/a/2003/02/19/engine.html

It would be quicker, simpler, and more memory efficient to use a bigram, for
perhaps no worse results. If speed really mattered, it might also be
tempting to use an array.array of ints indexed by the characters' ordinal
numbers, rather than nested dictionaries.

On the other hand, converting everything to unicode would make it slower
and more complicated (because you have to be sure of the source material
encoding), but would of course be more useful.

The greatest improvement is probably to be found in integrating ad-hoc
short-cuts (akin to search engine stopwords), or hybridising with other
techniques.

Related links:

how the pros do it (bigram combined with chararacter distribution and
encoding analysis):

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Maciej Ceglowski (of the O'Reilly article above) uses what seems to be
a cleverly augmented trigram method.

Maciej Ceglowski's Language Guesser:

http://languid.cantbedone.org/

GPL source (perl):

http://www.idlewords.com/2004/11/source_code_for_language_guesser.htm


Wikipedia recommend removing the spaces first:

http://en.wikipedia.org/wiki/N-gram



Add comment

No comments.



Highest rated recipes:

1. A simple XML-RPC server

2. Web service accessible ...

3. Treat the Win32 Registry ...

4. Watching a directory ...

5. Union Find data structure

6. Function Decorators by ...

7. MS SQL Server log monitor

8. Table objects with ...

9. wx twisted support using ...

10. More accurate sum




Privacy Policy | Email Opt-out | Feedback | Syndication
© 2006 ActiveState Software Inc. All rights reserved.