Source for file Query.php
Documentation is available at Query.php
* ----------------------------------------------------------------------
* Copyright (c) 2006-2016 Khaled Al-Sham'aa.
* ----------------------------------------------------------------------
* This program is open source product; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License (LGPL)
* as published by the Free Software Foundation; either version 3
* of the License, or (at your option) any later version.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
* You should have received a copy of the GNU Lesser General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/lgpl.txt>.
* ----------------------------------------------------------------------
* Class Name: Arabic Queary Class
* Original Author(s): Khaled Al-Sham'aa <khaled@ar-php.org>
* Purpose: Build WHERE condition for SQL statement using MySQL REGEXP and
* ----------------------------------------------------------------------
* PHP class build WHERE condition for SQL statement using MySQL REGEXP and
* With the exception of the Qur'an and pedagogical texts, Arabic is generally
* written without vowels or other graphic symbols that indicate how a word is
* pronounced. The reader is expected to fill these in from context. Some of the
* graphic symbols include sukuun, which is placed over a consonant to indicate that
* it is not followed by a vowel; shadda, written over a consonant to indicate it is
* doubled; and hamza, the sign of the glottal stop, which can be written above or
* below (alif) at the beginning of a word, or on (alif), (waaw), (yaa'),
* or by itself on the line elsewhere. Also, common spelling differences regularly
* appear, including the use of (haa') for (taa' marbuuta) and (alif maqsuura)
* for (yaa'). These features of written Arabic, which are also seen in Hebrew as
* well as other languages written with Arabic script (such as Farsi, Pashto, and
* Urdu), make analyzing and searching texts quite challenging. In addition, Arabic
* morphology and grammar are quite rich and present some unique issues for
* information retrieval applications.
* There are essentially three ways to search an Arabic text with Arabic queries:
* literal, stem-based or root-based.
* A literal search, the simplest search and retrieval method, matches documents
* based on the search terms exactly as the user entered them. The advantage of this
* technique is that the documents returned will without a doubt contain the exact
* term for which the user is looking. But this advantage is also the biggest
* disadvantage: many, if not most, of the documents containing the terms in
* different forms will be missed. Given the many ambiguities of written Arabic, the
* success rate of this method is quite low. For example, if the user searches
* for (kitaab, book), he or she will not find documents that only
* contain (`al-kitaabu, the book).
* Stem-based searching, a more complicated method, requires some normalization of
* the original texts and the queries. This is done by removing the vowel signs,
* unifying the hamza forms and removing or standardizing the other signs.
* Additionally, grammatical affixes and other constructions which attach directly
* to words, such as conjunctions, prepositions, and the definite article, should be
* identified and removed. Finally, regular and irregular plural forms need to be
* identified and reduced to their singular forms. Performing this type of stemming
* leads to more successful searches, but can be problematic due to over-generation
* or incorrect generation of stems.
* A third method for searching Arabic texts is to index and search for the root
* forms of each word. Since most verbs and nouns in Arabic are derived from
* triliteral (or, rarely, quadriliteral) roots, identifying the underlying root of
* each word theoretically retrieves most of the documents containing a given search
* term regardless of form. However, there are some significant challenges with this
* approach. Determining the root for a given word is extremely difficult, since it
* requires a detailed morphological, syntactic and semantic analysis of the text to
* fully disambiguate the root forms. The issue is complicated further by the fact
* that not all words are derived from roots. For example, loan words (words
* borrowed from another language) are not based on root forms, although there are
* even exceptions to this rule. For example, some loans that have a structure
* similar to triliteral roots, such as the English word film, are handled
* grammatically as if they were root-based, adding to the complexity of this type
* of search. Finally, the root can serve as the foundation for a wide variety of
* words with related meanings. The root (k-t-b) is used for many words related
* to writing, including (kataba, to write), (kitaab, book), (maktab,
* office), and (kaatib, author). But the same root is also used for regiment/
* battalion, (katiiba). As a result, searching based on root forms results in
* very high recall, but precision is usually quite low.
* While search and retrieval of Arabic text will never be an easy task, relying on
* linguistic analysis tools and methods can help make the process more successful.
* Ultimately, the search method you choose should depend on how critical it is to
* retrieve every conceivable instance of a word or phrase and the resources you
* have to process search returns in order to determine their true relevance.
* Source: Volume 13 Issue 7 of MultiLingual Computing &
* Technology published by MultiLingual Computing, Inc., 319 North First Ave.,
* Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
* include('./I18N/Arabic.php');
* $obj = new I18N_Arabic('Query');
* $dbh = new PDO('mysql:host=localhost;dbname='.$dbname, $dbuser, $dbpwd);
* // Set the error reporting attribute
* $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
* $dbh->exec("SET NAMES 'utf8'");
* if ($_GET['keyword'] != '') {
* $keyword = @$_GET['keyword'];
* $keyword = str_replace('\"', '"', $keyword);
* $obj->setStrFields('headline');
* $obj->setMode($_GET['mode']);
* $strCondition = $Arabic->getWhereCondition($keyword);
* $StrSQL = "SELECT `headline` FROM `aljazeera` WHERE $strCondition";
* foreach ($dbh->query($StrSQL) as $row) {
* $headline = $row['headline'];
* echo "<tr bgcolor=\"$bg\"><td>$headline</td></tr>";
* // Close the databse connection
* } catch (PDOException $e) {
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
* This PHP class build WHERE condition for SQL statement using MySQL REGEXP and
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
private $_fields = array();
private $_lexPatterns = array();
private $_lexReplacements = array();
* Loads initialize values
foreach ($xml->xpath("//preg_replace[@function='__construct']/pair")
array_push($this->_lexPatterns, (string) $pair->search);
array_push($this->_lexReplacements, (string) $pair->replace);
* Setting value for $_fields array
* @param array $arrConfig Name of the fields that SQL statement will search
* them (in array format where items are those
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$this->_fields = $arrConfig;
* Setting value for $_fields array
* @param string $strConfig Name of the fields that SQL statement will search
* them (in string format using comma as delimated)
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$this->_fields = explode(',', $strConfig);
* Setting $mode propority value that refer to search mode
* [0 for OR logic | 1 for AND logic]
* @param integer $mode Setting value to be saved in the $mode propority
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
// Set search mode [0 for OR logic | 1 for AND logic]
* Getting $mode propority value that refer to search mode
* [0 for OR logic | 1 for AND logic]
* @return integer Value of $mode properity
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
// Get search mode value [0 for OR logic | 1 for AND logic]
* Getting values of $_fields Array in array format
* @return array Value of $_fields array in Array format
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$fields = $this->_fields;
* Getting values of $_fields array in String format (comma delimated)
* @return string Values of $_fields array in String format (comma delimated)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$fields = implode(',', $this->_fields);
* Build WHERE section of the SQL statement using defind lex's rules, search
* mode [AND | OR], and handle also phrases (inclosed by "") using normal
* LIKE condition to match it as it is.
* @param string $arg String that user search for in the database table
* @return string The WHERE section in SQL statement
* (MySQL database engine format)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
//$arg = mysql_real_escape_string($arg);
$search = array("\\", "\x00", "\n", "\r", "'", '"', "\x1a");
$replace = array("\\\\","\\0","\\n", "\\r", "\'", '\"', "\\Z");
// Check if there are phrases in $arg should handle as it is
if (count($phrase) > 2) {
// (It will contain the rest of $arg except phrases).
for ($i = 0; $i < count($phrase); $i++ ) {
$subPhrase = $phrase[$i];
if ($i % 2 == 0 && $subPhrase != '') {
// Re-build $arg variable after restricting phrases
} elseif ($i % 2 == 1 && $subPhrase != '') {
// Handle phrases using reqular LIKE matching in MySQL
$this->wordCondition[] = $this->getWordLike($subPhrase);
// Handle normal $arg using lex's and regular expresion
foreach ($words as $word) {
//if (is_numeric($word) || strlen($word) > 2) {
// Take off all the punctuation
//$word = preg_replace("/\p{P}/", '', $word);
$exclude = array('(', ')', '[', ']', '{', '}', ',', ';', ':',
'?', '!', '،', '؛', '؟');
if (!empty($this->wordCondition)) {
$sql = '(' . implode(') OR (', $this->wordCondition) . ')';
} elseif ($this->_mode == 1) {
$sql = '(' . implode(') AND (', $this->wordCondition) . ')';
* Search condition in SQL format for one word in all defind fields using
* REGEXP clause and lex's rules
* @param string $arg String (one word) that you want to build a condition for
* @return string sub SQL condition (for internal use)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
//$sql = implode(" REGEXP '$arg' OR ", $this->_fields) . " REGEXP '$arg'";
implode(", 'ـ', '') REGEXP '$arg' OR REPLACE(", $this->_fields) .
", 'ـ', '') REGEXP '$arg'";
* Search condition in SQL format for one word in all defind fields using
* @param string $arg String (one word) that you want to build a condition for
* @return string sub SQL condition (for internal use)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$sql = implode(" LIKE '$arg' OR ", $this->_fields) . " LIKE '$arg'";
* Get more relevant order by section related to the user search keywords
* @param string $arg String that user search for in the database table
* @return string sub SQL ORDER BY section
* @author Saleh AlMatrafe <saleh@saleh.cc>
// Check if there are phrases in $arg should handle as it is
if (count($phrase) > 2) {
// (It will contain the rest of $arg except phrases).
for ($i = 0; $i < count($phrase); $i++ ) {
if ($i % 2 == 0 && $phrase[$i] != '') {
// Re-build $arg variable after restricting phrases
} elseif ($i % 2 == 1 && $phrase[$i] != '') {
// Handle phrases using reqular LIKE matching in MySQL
// Handle normal $arg using lex's and regular expresion
foreach ($words as $word) {
$wordOrder[] = 'CASE WHEN ' .
$order = '((' . implode(') + (', $wordOrder) . ')) DESC';
* This method will implement various regular expressin rules based on
* pre-defined Arabic lexical rules
* @param string $arg String of one word user want to search for
* @return string Regular Expression format to be used in MySQL query statement
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
protected function lex($arg)
$arg = preg_replace($this->_lexPatterns, $this->_lexReplacements, $arg);
* Get most possible Arabic lexical forms for a given word
* @param string $word String that user search for
* @return string list of most possible Arabic lexical forms for a given word
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
$wordForms = array($word);
$postfix1 = array('كم', 'كن', 'نا', 'ها', 'هم', 'هن');
$postfix2 = array('ين', 'ون', 'ان', 'ات', 'وا');
if ($len >= 6 && $last3 == 'تين') {
$wordForms[] = $str3 . 'ة';
$wordForms[] = $word . 'ة';
if ($len >= 6 && ($last3 == 'كما' || $last3 == 'هما')) {
$wordForms[] = $str3 . 'كما';
$wordForms[] = $str3 . 'هما';
if ($len >= 5 && in_array($last2, $postfix2)) {
$wordForms[] = $str2. 'ة';
$wordForms[] = $str2. 'تين';
foreach ($postfix2 as $postfix) {
$wordForms[] = $str2 . $postfix;
if ($len >= 5 && in_array($last2, $postfix1)) {
$wordForms[] = $str2. 'ي';
$wordForms[] = $str2. 'ك';
$wordForms[] = $str2. 'كما';
$wordForms[] = $str2. 'هما';
foreach ($postfix1 as $postfix) {
$wordForms[] = $str2 . $postfix;
if ($len >= 5 && $last2 == 'ية') {
if (($len >= 4 && ($last1 == 'ة' || $last1 == 'ه' || $last1 == 'ت'))
|| ($len >= 5 && $last2 == 'ات')
$wordForms[] = $str1 . 'ة';
$wordForms[] = $str1 . 'ه';
$wordForms[] = $str1 . 'ت';
$wordForms[] = $str1 . 'ات';
if ($len >= 4 && $last1 == 'ى') {
$wordForms[] = $str1 . 'ا';
$trans = array('أ' => 'ا', 'إ' => 'ا', 'آ' => 'ا');
foreach ($wordForms as $word) {
$normWord = strtr($word, $trans);
if ($normWord != $word) {
$wordForms[] = $normWord;
* Get most possible Arabic lexical forms of user search keywords
* @param string $arg String that user search for
* @return string list of most possible Arabic lexical forms for given keywords
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
foreach ($words as $word) {
|