<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eDesign.nl &#187; Text processing</title>
	<atom:link href="http://www.edesign.nl/category/text-processing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.edesign.nl</link>
	<description>Thoughts and concepts on software development</description>
	<lastBuildDate>Wed, 22 May 2013 17:08:34 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Textual difference detector</title>
		<link>http://www.edesign.nl/2009/05/07/textual-difference-detector/</link>
		<comments>http://www.edesign.nl/2009/05/07/textual-difference-detector/#comments</comments>
		<pubDate>Thu, 07 May 2009 15:55:13 +0000</pubDate>
		<dc:creator>Jurgen</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Text processing]]></category>

		<guid isPermaLink="false">http://www.edesign.nl/?p=397</guid>
		<description><![CDATA[Today I uploaded my textual difference detector to the eDesign examples. This is an example application demonstrating the theory of applying the Levenshtein algorithm to detect differences between two versions of the same text. Also, the &#8216;Find the differences&#8216; post is updated with a link to this example.
This example takes two texts as input and [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-398" href="http://www.edesign.nl/2009/05/07/textual-difference-detector/comparedifflarge/"><img class="alignleft size-thumbnail wp-image-398" title="comparedifflarge" src="http://www.edesign.nl/wp-content/uploads/2009/05/comparedifflarge-150x150.jpg" alt="comparedifflarge" width="150" height="150" /></a>Today I uploaded my <a href="http://www.edesign.nl/examples/levenshtein/" target="_blank">textual difference detector</a> to the eDesign examples. This is an example application demonstrating the theory of applying the <a href="http://www.edesign.nl/2009/04/12/find-the-differences/" target="_self">Levenshtein algorithm</a> to detect differences between two versions of the same text. Also, the &#8216;<a href="http://www.edesign.nl/2009/04/12/find-the-differences/" target="_self">Find the differences</a>&#8216; post is updated with a link to this example.</p>
<p>This example takes two texts as input and outputs one merged text marked with what was deleted and what was added. <a href="http://www.edesign.nl/examples/levenshtein/" target="_blank">Take a look</a> and feel free to download the <a href="http://www.edesign.nl/examples/levenshtein/levenshtein.zip">source code</a>. This also inlcludes the <a href="http://www.edesign.nl/examples/levenshtein/levenshtein.zip">Levenshtein algorithm source code</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.edesign.nl/2009/05/07/textual-difference-detector/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Character entities</title>
		<link>http://www.edesign.nl/2009/05/04/character-entities/</link>
		<comments>http://www.edesign.nl/2009/05/04/character-entities/#comments</comments>
		<pubDate>Mon, 04 May 2009 09:40:58 +0000</pubDate>
		<dc:creator>Jurgen</dc:creator>
				<category><![CDATA[Character Encoding]]></category>
		<category><![CDATA[Web standards]]></category>

		<guid isPermaLink="false">http://www.edesign.nl/?p=180</guid>
		<description><![CDATA[As in real life characters that build written language differ from system to system. Ελληνικά characters differ from Русский,  汉语 and Latin characters. Fortunately these character sets have been standardized and called alphabets. The same goes for character sets in the digital world. As computers can only process binary data, all characters are mapped to [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-317" href="http://www.edesign.nl/2009/05/04/character-entities/codage2/"><img class="alignleft size-medium wp-image-317" title="Character encoding" src="http://www.edesign.nl/wp-content/uploads/2009/05/codage2-300x214.jpg" alt="Character encoding" width="191" height="136" /></a>As in real life characters that build written language differ from system to system. Ελληνικά characters differ from Русский,  <span lang="zh-Hans">汉语</span> and Latin characters. Fortunately these character sets have been standardized and called alphabets. The same goes for character sets in the digital world. As computers can only process binary data, all characters are mapped to a number. In the early days such a mapping of the Latin alphabet, along with some other graphical &#8216;characters&#8217;, digits and control characters (e.g. escape, tab, line feed, carriage return) was standardized. This standard is known as  the American Standard Code for Information Interchange (<a href="http://en.wikipedia.org/wiki/Ascii" target="_blank">ASCII</a>) and was developed by the American Standards Association (currently: <a href="http://en.wikipedia.org/wiki/American_National_Standards_Institute" target="_blank">ANSI</a>). This 7-bit encoding lacked digital representations for many characters of e.g. foreign characters (as respectively Greek, Russian and Chinese are mentioned above) but also accents like å, è, ï, ó and û were not represented in the set. But as you can see in this paragraph, improvements have been made to facilitate such &#8217;special&#8217; characters.<span id="more-180"></span></p>
<h2>Character sets</h2>
<p>Other character sets have been defined. The International Organization for Standardization came up with the <a href="http://alis.isoc.org/codage/iso8859/jeuxiso.en.htm" target="_blank">ISO-8859</a> series to satisfy this shortcoming and defined some different character sets using 8 bits per character. Microsoft developed its own schemes too like <a href="http://en.wikipedia.org/wiki/Windows-1252" target="_blank">cp-1252</a> for instance, along with some others (like IBM). Also some local institutes needed to create encoding to facilitate the needs of their native language which the standardized sets still lacked. This introduced the problem of multiple interpretations of numbers. What characters are they mapped to? What character set do I need to use to decode 65 for instance? Does <em>65</em> mean <em>A or </em><em>a</em> or <em>ä</em> or <em>ç</em> or <em>R</em> or&#8230;</p>
<p>With the ISO 8-bit character sets, 256 characters were possible (190 characters without control characters, etc). This was sufficient as the sets covered the top ten most used languages. Still for instance Chinese and Japanese we not at all covered. This is where the Unicode Transformation Format (<a href="http://en.wikipedia.org/wiki/Unicode" target="_blank">UTF</a>) comes into play, developed by the Unicode Consortium. Unicode are multi-byte character sets which means that per character one, two, three or four 8-bit bytes are used to identify that character. This UTF-8 scheme is nowadays the most commonly used set. It is backward compatible with ASCII and with over 30.000 characters it is able to represent most of the living languages with a single code. UTF contains information of how to convert lowercase characters to uppercase and vice verse which is not the same or even symmetric for every character, and it has sorting rules. E.g. traditional Spanisch knows a single <em>ch</em> character which is sorted between the <em>c</em> and <em>d</em> and in Greek the uppercase <em>Σ</em> in lowercase is a <em>σ</em>, but if it is the last character of a word the lowercase of <em>Σ</em> is ς.</p>
<h2>Problems or challenges</h2>
<p>When programming web applications for instance a programmer often has to work with a database, file system, web server and one or several browsers. These systems and the data traffic between them (protocols) need to be tuned so they use the same character sets. If this is not the case errors can occur. For instance, if you send &#8220;Hellø world&#8221; from the database in UTF-8 to the browser, which interprets the bit stream it is receiving as CP-1252, the string is displayed as &#8220;Hell? world&#8221; and you have probably seen the <em>€</em> sign appear as <em>�</em> because of this error. The solution to this problem, of course is to keep all character sets the same when different systems (applications) communicate with each other.</p>
<div id="attachment_350" class="wp-caption alignright" style="width: 308px"><a rel="attachment wp-att-350" href="http://www.edesign.nl/2009/05/04/character-entities/copy-html-entity/"><img class="size-full wp-image-350" title="Copyright sign and it's NCR" src="http://www.edesign.nl/wp-content/uploads/2009/05/copy-html-entity.jpg" alt="Copy sign sgml entity" width="298" height="71" /></a><p class="wp-caption-text">Copyright sign, it&#39;s Numeric Character Reference and alias</p></div>
<p>A way to prevent this problem from occurring is to obey the HTML standard which requires characters not defined in certain character sets to be converted to their <a href="http://www.w3.org/International/tutorials/tutorial-char-enc/#Slide0430" target="_blank">Numeric Character Reference</a> (<a href="http://www.w3.org/TR/html4/sgml/entities.html" target="_blank">HTML entity</a>). That doing this is mandatory for element attribute values is fairly unknown. This is done by code points denoted by &amp;#[code]; where [code] is replaced by a number. For most used characters an alias is available. E.g. the equivalent for <em>&amp;</em> is <em>&amp;amp;</em>, <em>ë </em>has <em>&amp;euml;</em>, a space is denoted by <em>&amp;nbsp; </em>and the copyright sign has equivalent <em>&amp;#169;</em> with alias <em>&amp;copy;</em>. The codes (digits) and aliases itself only contain characters from the ASCII set, meaning an entire SGML (including HMTL, XML, etc) document can be composed of only ASCII characters. PHP has a <a href="http://www.php.net/htmlentities" target="_blank">built in function</a> that searches and converts such characters in an input. Also, a nice tool to <a href="http://leftlogic.com/lounge/articles/entity-lookup/" target="_blank">look up entities</a> is available at LeftLogic.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.edesign.nl/2009/05/04/character-entities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Find the differences</title>
		<link>http://www.edesign.nl/2009/04/12/find-the-differences/</link>
		<comments>http://www.edesign.nl/2009/04/12/find-the-differences/#comments</comments>
		<pubDate>Sun, 12 Apr 2009 10:34:00 +0000</pubDate>
		<dc:creator>Jurgen</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Text processing]]></category>

		<guid isPermaLink="false">http://www.edesign.nl/?p=103</guid>
		<description><![CDATA[Comparing files is something developers do every once in a while. For example, comparing configuration files to see what is different in the other environment or compare programming files to see what has changed in the source code. Implementations of text comparison algorithms are therefore widespread and used in several fields. For instance, in blogs [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-167" href="http://www.edesign.nl/2009/04/12/find-the-differences/spot-differences-city-picture/"><img class="alignleft size-thumbnail wp-image-167" src="http://www.edesign.nl/wp-content/uploads/2009/04/spot-differences-city-picture-150x150.jpg" alt="Spot differences city picture" width="150" height="150" /></a>Comparing files is something developers do every once in a while. For example, comparing configuration files to see what is different in the other environment or compare programming files to see what has changed in the source code. Implementations of text comparison algorithms are therefore widespread and used in several fields. For instance, in blogs and content managements systems, one might need to know what was altered in an update of a text (in <a href="http://wordpress.org/" target="_blank">cms like systems</a>) or a programmer in a team would like to see what changed in the source code (<a href="http://subversion.tigris.org/" target="_blank">svn</a>). Also a lot of (combined) search, spell checking, speech recognition and plagiarism detection software compare texts (strings) in a certain way. This article covers the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank">Levenshtein distance algorithm</a> and how to use it to indicate alterations to texts.<span id="more-103"></span></p>
<p>The are several ways to compare texts and find <a href="http://en.wikipedia.org/wiki/Diff" target="_blank">differences</a> and <a href="http://en.wikipedia.org/wiki/Category:String_similarity_measures" target="_blank">similarity scores</a>. For this article the similarity scores are not relevant because these scores are just numbers. We are interested in what is added, deleted or substituted in the transformation from text <em>A</em> to text <em>B</em>. In other words, we would like to mark the minimal number of primitive operations needed to transform <em>A</em> to <em>B</em>. To do this we&#8217;ll need the basics from the classic computer science problem: the <a href="http://en.wikipedia.org/wiki/Longest_common_subsequence_problem" target="_blank">longest common subsequence problem</a>. The technique described hereunder which is derived from this problem is the Levenshtein distance algorithm. The algorithm was developed by <a href="http://www.keldysh.ru/departments/dpt_10/lev.html" target="_blank">Vladimir Levenshtein</a> to replace the <a href="http://en.wikipedia.org/wiki/Hamming_distance">Hamming distance</a>. The result of Levenshtein&#8217;s algorithm is exactly the minimal number of operations, but you can use the unpolished  result of this algorithm to determine what parts of text were added, deleted or substituted. Originally this is done per character, but with a little tweak this can be changed to a per word, line or paragraph level function. When you&#8217;ve read this article you will know how this works (and the demo, which is referred to at the end). This demo takes two texts as input and outputs what was added, deleted and replaced.</p>
<p>For this article and the demo I used <a href="http://www.google.nl/search?q=levenshtein" target="_blank">search results</a> for some inspiration. A nice explanation already there is this <a href="http://www.merriampark.com/ld.htm" target="_blank">description of the Levenshtein algorithm</a>, as well as the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance" target="_blank">Wiki page on it</a>. For this article, let&#8217;s use two sample lines: &#8220;The brown dog jumped away from the sprinkler&#8221; and &#8220;The dog ran towards the green sprinkler&#8221;. Now we want to know with words were added, deleted or replaced in the transition from the first sentence to the second. To do this, let&#8217;s take a closer look on how the iterative process of the Levenshtein algorithm is executed.</p>
<ol>
<li>The first step is to contruct a matrix of <em>n</em>+1 by <em>m</em>+1, where <em>n</em> is the number of words in the first line and <em>m</em> the number of words in the second line.</li>
<li>Secondly, fill the first row and column with (from top left to bottom or right) zero to respectively <em>m</em> and <em>n</em>.</li>
<li>Now for each <em>n</em>, evaluate each <em>m</em>. If  the evaluated word of <em>n</em> matches <em>m</em>, the cost is 0, otherwise it&#8217;s 2.</li>
<li>Fill out cell (<em>n</em>, <em>m</em>) having (<em>n</em>, <em>m</em>) is the minumum of:
<ul>
<li>The value of the cell above + 1</li>
<li>The left neighbour cell value + 1</li>
<li>The above left cell value + cost</li>
</ul>
</li>
</ol>
<p>For the Levenshtein distance it stops right here, when the iteration is completed. The distance is the value in the lower right cell. Here we are not interested in the Levenshtein distance itself, but in the matrix we&#8217;ve just constructed. The lowest cost route from bottom right to top left reveals information on what words have been added, deleted and/or substituted. To show how, we need to costruct the matrix with the algorithm above.</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th></th>
<th>The</th>
<th>brown</th>
<th>dog</th>
<th>jumped</th>
<th>away</th>
<th>from</th>
<th>the</th>
<th>sprinkler</th>
</tr>
<tr>
<th></th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<th>The</th>
<td>1</td>
<td>*</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>dog</th>
<td>2</td>
<td>**</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>ran</th>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>towards</th>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>the</th>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>green</th>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<th>sprinkler</th>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>For the first cell (*), the words &#8220;The&#8221; versus &#8220;The&#8221; are equal, so the cost is 0. Now the minumum of the cell above + 1 (2), the cell to the left + 1 (2) and the above left + cost (0), is the latter one.</p>
<p>The cell underneath it (**) has cost 1 (&#8220;The&#8221; versus &#8220;dog&#8221;) and gets a value equal to the minimum of the cell above + 1 (1), the cell to the left + 1 (3) and the above left + cost (2), which is 1.</p>
<p>Continue to fill this out and the table will look like this:</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th></th>
<th>The</th>
<th>brown</th>
<th>dog</th>
<th>jumped</th>
<th>away</th>
<th>from</th>
<th>the</th>
<th>sprinkler</th>
</tr>
<tr>
<th></th>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<th>The</th>
<td>1</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>dog</th>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>ran</th>
<td>3</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>towards</th>
<td>4</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>the</th>
<td>5</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<th>green</th>
<td>6</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<th>sprinkler</th>
<td>7</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>5</td>
</tr>
</tbody>
</table>
<p>Now we need to find the lowest cost path from the bottom right to the zero in the top left. To do this simply jump to the cell with the lowest value adjacent to the current cell (to left, above or diagonal). Jumping diagonal is only allowed if the words are the same (column and row). If two or more have the same (lower) value, the priority of choosing a route is to try diagonal first, then either left or above. So, from the bottom right 5 we start the route to the diagonally adjacent 5 (because &#8217;sprinkler&#8217; equals &#8217;sprinkler&#8217;). From the 5  the next step would be the lower 4 above it, then the diagonally adjacent 4, etc&#8230; The route table will look like this:</p>
<table border="0">
<tbody>
<tr>
<th></th>
<th></th>
<th>The</th>
<th>brown</th>
<th>dog</th>
<th>jumped</th>
<th>away</th>
<th>from</th>
<th>the</th>
<th>sprinkler</th>
</tr>
<tr>
<th></th>
<td><span style="color: #ff8401;">0</span></td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<th>The</th>
<td>1</td>
<td><span style="color: #ff8401;">0</span></td>
<td><span style="color: #ff8401;">1</span></td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<th>dog</th>
<td>2</td>
<td>1</td>
<td>1</td>
<td><span style="color: #ff8401;">1</span></td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>ran</th>
<td>3</td>
<td>2</td>
<td>2</td>
<td><span style="color: #ff8401;">2</span></td>
<td><span style="color: #ff8401;">2</span></td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>towards</th>
<td>4</td>
<td>3</td>
<td>3</td>
<td>3</td>
<td><span style="color: #ff8401;">3</span></td>
<td><span style="color: #ff8401;">3</span></td>
<td><span style="color: #ff8401;">4</span></td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<th>the</th>
<td>5</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td><span style="color: #ff8401;">4<br />
</span></td>
<td>5</td>
</tr>
<tr>
<th>green</th>
<td>6</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td><span style="color: #ff8401;">5</span></td>
<td>5</td>
</tr>
<tr>
<th>sprinkler</th>
<td>7</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td><span style="color: #ff8401;">5</span></td>
</tr>
</tbody>
</table>
<p>After the route is calculated, every step in it tells something about the operations needed (from top left to bottom right).</p>
<ul>
<li>Every diagonal step (not increasing the score) tells us nothing happened. E.g. the first step from 0 to 0 tells us &#8220;The&#8221; from the first line stays &#8220;The&#8221; in the second line.</li>
<li>Every horizontal step means a word is deleted. E.g. the step from 0 to 1 tells us &#8220;brown&#8221; was deleted at this point.</li>
<li>Every vertical step means a word is added. E.g. the step from 4 to 5 tells us &#8220;green&#8221; was added at this point.</li>
<li>Every diagonal step having the score increased means a word is substituted (added and deleted) at this point. E.g. the step from 2 to 3 tells &#8220;away&#8221; is substituted by &#8220;ran&#8221;. This is an illegal opperation in the detection of addition and deletion of words.</li>
</ul>
<p>This way a text indicating the operations can be constructed:</p>
<p>The <span style="color: #ff0000;">brown</span> dog <span style="color: #ff0000;">jumped</span> <span style="color: #008000;">ran</span><span style="color: #ff0000;">away</span> <span style="color: #008000;">towards</span><span style="color: #ff0000;">from</span> the <span style="color: #008000;">green</span> sprinkler.</p>
<p>Red indicates deletion, green for insertion and a red and green pair indicates substitution.</p>
<p>Of course some optimalizations can be performed. The above for instance does give a good indication of what happened to the text. Imagine a larger text than just these lines and the relevance of changes are marked this way will become more obvious. But because only the primitive operations are detected at the word level, word groups are not taken into account. In this example for instance, the algorithm would be better if it marked &#8220;jumped away from&#8221; as replaced by &#8220;ran towards&#8221; instead of each seperate word as it does now:</p>
<p>The <span style="color: #ff0000;">brown</span> dog <span style="color: #008000;">ran towards</span><span style="color: #ff0000;">jumped away from</span> the <span style="color: #008000;">green</span> sprinkler.</p>
<p>This operation is not that hard to implement, simply replace subsequent differing operations by substitute operations.</p>
<p>An <a href="http://www.edesign.nl/examples/levenshtein/" target="_blank">implementation of this algorithm</a>, with the optimalization patch suggested here, is <a href="http://www.edesign.nl/examples/levenshtein/" target="_blank">now available as an example</a>. Source code (PHP) is available as well.</p>
<p>And about the featured picture on top: there are <a href="http://www.smart-kit.com/s749/birds-eye-view-can-you-spot-all-12-differences/" target="_blank">12 differences to spot</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.edesign.nl/2009/04/12/find-the-differences/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Regular expression tester</title>
		<link>http://www.edesign.nl/2009/03/26/regular-expression-tester/</link>
		<comments>http://www.edesign.nl/2009/03/26/regular-expression-tester/#comments</comments>
		<pubDate>Thu, 26 Mar 2009 02:41:36 +0000</pubDate>
		<dc:creator>Jurgen</dc:creator>
				<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[Text processing]]></category>

		<guid isPermaLink="false">http://www.edesign.nl/?p=36</guid>
		<description><![CDATA[First one to be back is the simple but very useful regular expression tester. What it does is simply dump the contents of a pattern match and its subpattern matches. Developers might find this a useful tool not only to test their regular expressions, but also to see the way subpatterns are counted to use [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-44" href="http://www.edesign.nl/2009/03/26/regular-expression-tester/hipowl/"><img class="alignleft size-medium wp-image-44" title="O'Reilly Regular Expression Owl" src="http://www.edesign.nl/wp-content/uploads/2009/03/hipowl-234x300.png" alt="O'Reilly Regular Expression Owl" width="140" height="180" /></a>First one to be back is the simple but very useful <a href="http://search.oreilly.com/?q=regular+expressions">regular expression</a> tester. What it does is simply dump the contents of a pattern match and its subpattern matches. Developers might find this a useful tool not only to test their regular expressions, but also to see the way subpatterns are counted to use backreferences.</p>
<p>At the tester links to <a href="http://www.php.net/manual/en/pcre.pattern.php" target="_blank">PCRE documentation</a> (Perl Compatible Regular Expressions) are available. Also I would like to point to a handy <a href="http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/" target="_blank">cheat sheet</a> on general regular expressions by <a href="http://www.addedbytes.com/" target="_blank">Dave Child</a> and another <a href="http://www.visibone.com/products/bbk14-15_425.html" target="_blank">cheat sheet</a> on JavaScript regular expressions from <a href="http://www.visibone.com/" target="_blank">Visibone</a>.</p>
<p>This tool is accessible again at <a href="http://regex.edesign.nl/" target="_blank">regex.edesign.nl</a> and can be downloaded there as wel (use the <a href="http://regex.edesign.nl/?show_source" target="_blank">src</a> link).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.edesign.nl/2009/03/26/regular-expression-tester/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
