Archive for category Text processing

Textual difference detector

comparedifflargeToday I uploaded my textual difference detector to the eDesign examples. This is an example application demonstrating the theory of applying the Levenshtein algorithm to detect differences between two versions of the same text. Also, the ‘Find the differences‘ post is updated with a link to this example.

This example takes two texts as input and outputs one merged text marked with what was deleted and what was added. Take a look and feel free to download the source code. This also inlcludes the Levenshtein algorithm source code.

No Comments

Character entities

Character encodingAs in real life characters that build written language differ from system to system. Ελληνικά characters differ from Русский,  汉语 and Latin characters. Fortunately these character sets have been standardized and called alphabets. The same goes for character sets in the digital world. As computers can only process binary data, all characters are mapped to a number. In the early days such a mapping of the Latin alphabet, along with some other graphical ‘characters’, digits and control characters (e.g. escape, tab, line feed, carriage return) was standardized. This standard is known as  the American Standard Code for Information Interchange (ASCII) and was developed by the American Standards Association (currently: ANSI). This 7-bit encoding lacked digital representations for many characters of e.g. foreign characters (as respectively Greek, Russian and Chinese are mentioned above) but also accents like å, è, ï, ó and û were not represented in the set. But as you can see in this paragraph, improvements have been made to facilitate such ’special’ characters. Read the rest of this entry »

No Comments

Find the differences

Spot differences city pictureComparing files is something developers do every once in a while. For example, comparing configuration files to see what is different in the other environment or compare programming files to see what has changed in the source code. Implementations of text comparison algorithms are therefore widespread and used in several fields. For instance, in blogs and content managements systems, one might need to know what was altered in an update of a text (in cms like systems) or a programmer in a team would like to see what changed in the source code (svn). Also a lot of (combined) search, spell checking, speech recognition and plagiarism detection software compare texts (strings) in a certain way. This article covers the Levenshtein distance algorithm and how to use it to indicate alterations to texts. Read the rest of this entry »

1 Comment

Regular expression tester

O'Reilly Regular Expression OwlFirst one to be back is the simple but very useful regular expression tester. What it does is simply dump the contents of a pattern match and its subpattern matches. Developers might find this a useful tool not only to test their regular expressions, but also to see the way subpatterns are counted to use backreferences.

At the tester links to PCRE documentation (Perl Compatible Regular Expressions) are available. Also I would like to point to a handy cheat sheet on general regular expressions by Dave Child and another cheat sheet on JavaScript regular expressions from Visibone.

This tool is accessible again at regex.edesign.nl and can be downloaded there as wel (use the src link).

No Comments