File Source for Query.php

<?php
/**
* ----------------------------------------------------------------------
*
* Copyright (c) 2006-2016 Khaled Al-Sham'aa.
*
* http://www.ar-php.org
*
* PHP Version 5
*
* ----------------------------------------------------------------------
*
* LICENSE
*
* This program is open source product; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License (LGPL)
* as published by the Free Software Foundation; either version 3
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/lgpl.txt>.
*
* ----------------------------------------------------------------------
*
* Class Name: Arabic Queary Class
*
* Filename: Query.php
*
* Original Author(s): Khaled Al-Sham'aa <khaled@ar-php.org>
*
* Purpose: Build WHERE condition for SQL statement using MySQL REGEXP and
* Arabic lexical rules
*
* ----------------------------------------------------------------------
*
* Arabic Queary Class
*
* PHP class build WHERE condition for SQL statement using MySQL REGEXP and
* Arabic lexical rules.
*
* With the exception of the Qur'an and pedagogical texts, Arabic is generally
* written without vowels or other graphic symbols that indicate how a word is
* pronounced. The reader is expected to fill these in from context. Some of the
* graphic symbols include sukuun, which is placed over a consonant to indicate that
* it is not followed by a vowel; shadda, written over a consonant to indicate it is
* doubled; and hamza, the sign of the glottal stop, which can be written above or
* below (alif) at the beginning of a word, or on (alif), (waaw), (yaa'),
* or by itself on the line elsewhere. Also, common spelling differences regularly
* appear, including the use of (haa') for (taa' marbuuta) and (alif maqsuura)
* for (yaa'). These features of written Arabic, which are also seen in Hebrew as
* well as other languages written with Arabic script (such as Farsi, Pashto, and
* Urdu), make analyzing and searching texts quite challenging. In addition, Arabic
* morphology and grammar are quite rich and present some unique issues for
* information retrieval applications.
*
* There are essentially three ways to search an Arabic text with Arabic queries:
* literal, stem-based or root-based.
*
* A literal search, the simplest search and retrieval method, matches documents
* based on the search terms exactly as the user entered them. The advantage of this
* technique is that the documents returned will without a doubt contain the exact
* term for which the user is looking. But this advantage is also the biggest
* disadvantage: many, if not most, of the documents containing the terms in
* different forms will be missed. Given the many ambiguities of written Arabic, the
* success rate of this method is quite low. For example, if the user searches
* for (kitaab, book), he or she will not find documents that only
* contain (`al-kitaabu, the book).
*
* Stem-based searching, a more complicated method, requires some normalization of
* the original texts and the queries. This is done by removing the vowel signs,
* unifying the hamza forms and removing or standardizing the other signs.
* Additionally, grammatical affixes and other constructions which attach directly
* to words, such as conjunctions, prepositions, and the definite article, should be
* identified and removed. Finally, regular and irregular plural forms need to be
* identified and reduced to their singular forms. Performing this type of stemming
* leads to more successful searches, but can be problematic due to over-generation
* or incorrect generation of stems.
*
* A third method for searching Arabic texts is to index and search for the root
* forms of each word. Since most verbs and nouns in Arabic are derived from
* triliteral (or, rarely, quadriliteral) roots, identifying the underlying root of
* each word theoretically retrieves most of the documents containing a given search
* term regardless of form. However, there are some significant challenges with this
* approach. Determining the root for a given word is extremely difficult, since it
* requires a detailed morphological, syntactic and semantic analysis of the text to
* fully disambiguate the root forms. The issue is complicated further by the fact
* that not all words are derived from roots. For example, loan words (words
* borrowed from another language) are not based on root forms, although there are
* even exceptions to this rule. For example, some loans that have a structure
* similar to triliteral roots, such as the English word film, are handled
* grammatically as if they were root-based, adding to the complexity of this type
* of search. Finally, the root can serve as the foundation for a wide variety of
* words with related meanings. The root (k-t-b) is used for many words related
* to writing, including (kataba, to write), (kitaab, book), (maktab,
* office), and (kaatib, author). But the same root is also used for regiment/
* battalion, (katiiba). As a result, searching based on root forms results in
* very high recall, but precision is usually quite low.
*
* While search and retrieval of Arabic text will never be an easy task, relying on
* linguistic analysis tools and methods can help make the process more successful.
* Ultimately, the search method you choose should depend on how critical it is to
* retrieve every conceivable instance of a word or phrase and the resources you
* have to process search returns in order to determine their true relevance.
*
* Source: Volume 13 Issue 7 of MultiLingual Computing &
* Technology published by MultiLingual Computing, Inc., 319 North First Ave.,
* Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.
*
* Example:
* <code>
* include('./I18N/Arabic.php');
* $obj = new I18N_Arabic('Query');
*
* $dbuser = 'root';
* $dbpwd = '';
* $dbname = 'test';
*
* try {
* $dbh = new PDO('mysql:host=localhost;dbname='.$dbname, $dbuser, $dbpwd);
*
* // Set the error reporting attribute
* $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
*
* $dbh->exec("SET NAMES 'utf8'");
*
* if ($_GET['keyword'] != '') {
* $keyword = @$_GET['keyword'];
* $keyword = str_replace('\"', '"', $keyword);
*
* $obj->setStrFields('headline');
* $obj->setMode($_GET['mode']);
*
* $strCondition = $Arabic->getWhereCondition($keyword);
* } else {
* $strCondition = '1';
* }
*
* $StrSQL = "SELECT `headline` FROM `aljazeera` WHERE $strCondition";
*
* $i = 0;
* foreach ($dbh->query($StrSQL) as $row) {
* $headline = $row['headline'];
* $i++;
* if ($i % 2 == 0) {
* $bg = "#f0f0f0";
* } else {
* $bg = "#ffffff";
* }
* echo "<tr bgcolor=\"$bg\"><td>$headline</td></tr>";
* }
*
* // Close the databse connection
* $dbh = null;
*
* } catch (PDOException $e) {
* echo $e->getMessage();
* }
* </code>
*
* @category I18N
* @package I18N_Arabic
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
*
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
*/
/**
* This PHP class build WHERE condition for SQL statement using MySQL REGEXP and
* Arabic lexical rules
*
* @category I18N
* @package I18N_Arabic
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
* @copyright 2006-2016 Khaled Al-Sham'aa
*
* @license LGPL <http://www.gnu.org/licenses/lgpl.txt>
* @link http://www.ar-php.org
*/
class I18N_Arabic_Query
{
private $_fields = array();
private $_lexPatterns = array();
private $_lexReplacements = array();
private $_mode = 0;
/**
* Loads initialize values
*/
public function __construct()
{
$xml = simplexml_load_file(dirname(__FILE__).'/data/ArQuery.xml');
foreach ($xml->xpath("//preg_replace[@function='__construct']/pair")
as $pair) {
array_push($this->_lexPatterns, (string)$pair->search);
array_push($this->_lexReplacements, (string)$pair->replace);
}
}
/**
* Setting value for $_fields array
*
* @param array $arrConfig Name of the fields that SQL statement will search
* them (in array format where items are those
* fields names)
*
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function setArrFields($arrConfig)
{
if (is_array($arrConfig)) {
// Get _fields array
$this->_fields = $arrConfig;
}
return $this;
}
/**
* Setting value for $_fields array
*
* @param string $strConfig Name of the fields that SQL statement will search
* them (in string format using comma as delimated)
*
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function setStrFields($strConfig)
{
if (is_string($strConfig)) {
// Get _fields array
$this->_fields = explode(',', $strConfig);
}
return $this;
}
/**
* Setting $mode propority value that refer to search mode
* [0 for OR logic | 1 for AND logic]
*
* @param integer $mode Setting value to be saved in the $mode propority
*
* @return object $this to build a fluent interface
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function setMode($mode)
{
if (in_array($mode, array('0', '1'))) {
// Set search mode [0 for OR logic | 1 for AND logic]
$this->_mode = $mode;
}
return $this;
}
/**
* Getting $mode propority value that refer to search mode
* [0 for OR logic | 1 for AND logic]
*
* @return integer Value of $mode properity
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function getMode()
{
// Get search mode value [0 for OR logic | 1 for AND logic]
return $this->_mode;
}
/**
* Getting values of $_fields Array in array format
*
* @return array Value of $_fields array in Array format
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function getArrFields()
{
$fields = $this->_fields;
return $fields;
}
/**
* Getting values of $_fields array in String format (comma delimated)
*
* @return string Values of $_fields array in String format (comma delimated)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function getStrFields()
{
$fields = implode(',', $this->_fields);
return $fields;
}
/**
* Build WHERE section of the SQL statement using defind lex's rules, search
* mode [AND | OR], and handle also phrases (inclosed by "") using normal
* LIKE condition to match it as it is.
*
* @param string $arg String that user search for in the database table
*
* @return string The WHERE section in SQL statement
* (MySQL database engine format)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function getWhereCondition($arg)
{
$sql = '';
//$arg = mysql_real_escape_string($arg);
$search = array("\\", "\x00", "\n", "\r", "'", '"', "\x1a");
$replace = array("\\\\","\\0","\\n", "\\r", "\'", '\"', "\\Z");
$arg = str_replace($search, $replace, $arg);
// Check if there are phrases in $arg should handle as it is
$phrase = explode("\"", $arg);
if (count($phrase) > 2) {
// Re-init $arg variable
// (It will contain the rest of $arg except phrases).
$arg = '';
for ($i = 0; $i < count($phrase); $i++) {
$subPhrase = $phrase[$i];
if ($i % 2 == 0 && $subPhrase != '') {
// Re-build $arg variable after restricting phrases
$arg .= $subPhrase;
} elseif ($i % 2 == 1 && $subPhrase != '') {
// Handle phrases using reqular LIKE matching in MySQL
$this->wordCondition[] = $this->getWordLike($subPhrase);
}
}
}
// Handle normal $arg using lex's and regular expresion
$words = preg_split('/\s+/', trim($arg));
foreach ($words as $word) {
//if (is_numeric($word) || strlen($word) > 2) {
// Take off all the punctuation
//$word = preg_replace("/\p{P}/", '', $word);
$exclude = array('(', ')', '[', ']', '{', '}', ',', ';', ':',
'?', '!', '،', '؛', '؟');
$word = str_replace($exclude, '', $word);
$this->wordCondition[] = $this->getWordRegExp($word);
//}
}
if (!empty($this->wordCondition)) {
if ($this->_mode == 0) {
$sql = '(' . implode(') OR (', $this->wordCondition) . ')';
} elseif ($this->_mode == 1) {
$sql = '(' . implode(') AND (', $this->wordCondition) . ')';
}
}
return $sql;
}
/**
* Search condition in SQL format for one word in all defind fields using
* REGEXP clause and lex's rules
*
* @param string $arg String (one word) that you want to build a condition for
*
* @return string sub SQL condition (for internal use)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
protected function getWordRegExp($arg)
{
$arg = $this->lex($arg);
//$sql = implode(" REGEXP '$arg' OR ", $this->_fields) . " REGEXP '$arg'";
$sql = ' REPLACE(' .
implode(", 'ـ', '') REGEXP '$arg' OR REPLACE(", $this->_fields) .
", 'ـ', '') REGEXP '$arg'";
return $sql;
}
/**
* Search condition in SQL format for one word in all defind fields using
* normal LIKE clause
*
* @param string $arg String (one word) that you want to build a condition for
*
* @return string sub SQL condition (for internal use)
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
protected function getWordLike($arg)
{
$sql = implode(" LIKE '$arg' OR ", $this->_fields) . " LIKE '$arg'";
return $sql;
}
/**
* Get more relevant order by section related to the user search keywords
*
* @param string $arg String that user search for in the database table
*
* @return string sub SQL ORDER BY section
* @author Saleh AlMatrafe <saleh@saleh.cc>
*/
public function getOrderBy($arg)
{
// Check if there are phrases in $arg should handle as it is
$phrase = explode("\"", $arg);
if (count($phrase) > 2) {
// Re-init $arg variable
// (It will contain the rest of $arg except phrases).
$arg = '';
for ($i = 0; $i < count($phrase); $i++) {
if ($i % 2 == 0 && $phrase[$i] != '') {
// Re-build $arg variable after restricting phrases
$arg .= $phrase[$i];
} elseif ($i % 2 == 1 && $phrase[$i] != '') {
// Handle phrases using reqular LIKE matching in MySQL
$wordOrder[] = $this->getWordLike($phrase[$i]);
}
}
}
// Handle normal $arg using lex's and regular expresion
$words = explode(' ', $arg);
foreach ($words as $word) {
if ($word != '') {
$wordOrder[] = 'CASE WHEN ' .
$this->getWordRegExp($word) .
' THEN 1 ELSE 0 END';
}
}
$order = '((' . implode(') + (', $wordOrder) . ')) DESC';
return $order;
}
/**
* This method will implement various regular expressin rules based on
* pre-defined Arabic lexical rules
*
* @param string $arg String of one word user want to search for
*
* @return string Regular Expression format to be used in MySQL query statement
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
protected function lex($arg)
{
$arg = preg_replace($this->_lexPatterns, $this->_lexReplacements, $arg);
return $arg;
}
/**
* Get most possible Arabic lexical forms for a given word
*
* @param string $word String that user search for
*
* @return string list of most possible Arabic lexical forms for a given word
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
protected function allWordForms($word)
{
$wordForms = array($word);
$postfix1 = array('كم', 'كن', 'نا', 'ها', 'هم', 'هن');
$postfix2 = array('ين', 'ون', 'ان', 'ات', 'وا');
$len = mb_strlen($word);
if (mb_substr($word, 0, 2) == 'ال') {
$word = mb_substr($word, 2);
}
$wordForms[] = $word;
$str1 = mb_substr($word, 0, -1);
$str2 = mb_substr($word, 0, -2);
$str3 = mb_substr($word, 0, -3);
$last1 = mb_substr($word, -1);
$last2 = mb_substr($word, -2);
$last3 = mb_substr($word, -3);
if ($len >= 6 && $last3 == 'تين') {
$wordForms[] = $str3;
$wordForms[] = $str3 . 'ة';
$wordForms[] = $word . 'ة';
}
if ($len >= 6 && ($last3 == 'كما' || $last3 == 'هما')) {
$wordForms[] = $str3;
$wordForms[] = $str3 . 'كما';
$wordForms[] = $str3 . 'هما';
}
if ($len >= 5 && in_array($last2, $postfix2)) {
$wordForms[] = $str2;
$wordForms[] = $str2.'ة';
$wordForms[] = $str2.'تين';
foreach ($postfix2 as $postfix) {
$wordForms[] = $str2 . $postfix;
}
}
if ($len >= 5 && in_array($last2, $postfix1)) {
$wordForms[] = $str2;
$wordForms[] = $str2.'ي';
$wordForms[] = $str2.'ك';
$wordForms[] = $str2.'كما';
$wordForms[] = $str2.'هما';
foreach ($postfix1 as $postfix) {
$wordForms[] = $str2 . $postfix;
}
}
if ($len >= 5 && $last2 == 'ية') {
$wordForms[] = $str1;
$wordForms[] = $str2;
}
if (($len >= 4 && ($last1 == 'ة' || $last1 == 'ه' || $last1 == 'ت'))
|| ($len >= 5 && $last2 == 'ات')
) {
$wordForms[] = $str1;
$wordForms[] = $str1 . 'ة';
$wordForms[] = $str1 . 'ه';
$wordForms[] = $str1 . 'ت';
$wordForms[] = $str1 . 'ات';
}
if ($len >= 4 && $last1 == 'ى') {
$wordForms[] = $str1 . 'ا';
}
$trans = array('أ' => 'ا', 'إ' => 'ا', 'آ' => 'ا');
foreach ($wordForms as $word) {
$normWord = strtr($word, $trans);
if ($normWord != $word) {
$wordForms[] = $normWord;
}
}
$wordForms = array_unique($wordForms);
return $wordForms;
}
/**
* Get most possible Arabic lexical forms of user search keywords
*
* @param string $arg String that user search for
*
* @return string list of most possible Arabic lexical forms for given keywords
* @author Khaled Al-Sham'aa <khaled@ar-php.org>
*/
public function allForms($arg)
{
$wordForms = array();
$words = explode(' ', $arg);
foreach ($words as $word) {
$wordForms = array_merge($wordForms, $this->allWordForms($word));
}
$str = implode(' ', $wordForms);
return $str;
}
}

Source for file Query.php