In a previous post I explored some of Linux commands to find root words in the entire Quran or in a particular sura of the Quran.

While, Linux commands could be very productive at certain cases, they are not meant for data analysis. Hence, we need to resort to a proper data science programming language like R or Python.

In this post, I will explore the power of Python to address a the same topic (i.e., root words in the Quran) but will extend the problem to much more interesting queries.

This post is not intended to be a beginner’s tutorial to either python or pandas which is a special python package for data analysis. I expect you to have some experience with both these tools. Anyway, there are tons of resources out there. One good place is available under kaggle website.

Without further ado, let us get started.

First, let us start with few setup steps, like loading the pandas package and rename it for ease of usage as pd.

import pandas as pd

Next, read the file that contains the morphological information. Pandas has the read_csv function that can read directly from a URL. Note that the file contains some copyright information in the first 56 lines, and hence I am using skiprows option. Also, note that read_csv by default assumes the separator to be a comma, if not -as is the case in this file- we need to explicitly specify the delimiter and hence the sep='\t' option. Finally, we are displaying few lines from the top by the head() function.

url = 'http://textminingthequran.com/data/quranic-corpus-morphology-0.4.txt'
qdforiginal = pd.read_csv(url, sep='\t',skiprows=56)
qdforiginal.head()
LOCATION FORM TAG FEATURES
0 (1:1:1:1) bi P PREFIX|bi+
1 (1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
2 (1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
3 (1:1:3:1) {l DET PREFIX|Al+
4 (1:1:3:2) r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN

It is a good idea to save the first version locally by the to_csv() function.

qdforiginal.to_csv('quran-morphology-v1.csv')

Looking at the first few lines of the file above we see that the LOCATION and FEATURES columns need to be split further.

Our file contains 128k lines (you can verify that by the command qdforiginal.shape). I prefer to take a small sample of this big file and run the experimentations of splitting. When successful, we can then run it on the entire file.

Splitting columns

Here is my strategy: since I am interested on root words, I want to select first all rows that contain the word ROOT: in the FEATURES column. This can be done by a command like the following:

qdforiginal.FEATURES.str.contains('ROOT')[:3]
0    False
1     True
2     True
Name: FEATURES, dtype: bool

I only took the first 3 lines of the entire 128k lines. It returns boolean values of True or False. So, we can pass this boolean result to filter the entire dataframe by:

qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].head(3)
LOCATION FORM TAG FEATURES
1 (1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
2 (1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
4 (1:1:3:2) r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN

To ensure some random sampling, I can always use sample() method as follows. I will name this sample as qsample.

qsample = qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].sample(10); qsample
LOCATION FORM TAG FEATURES
46292 (10:89:3:1) >ujiybat V STEM|POS:V|PERF|PASS|(IV)|LEM:>ujiybat|ROOT:jw...
37295 (7:193:7:1) sawaA^'N N STEM|POS:N|LEM:sawaA^'|ROOT:swy|M|INDEF|NOM
30207 (6:118:9:2) _#aAya`ti N STEM|POS:N|LEM:'aAyap|ROOT:Ayy|FP|GEN
94595 (36:43:2:1) n~a$a>o V STEM|POS:V|IMPF|LEM:$aA^'a|ROOT:$yA|1P|MOOD:JUS
51228 (12:48:13:1) qaliylFA N STEM|POS:N|LEM:qaliyl|ROOT:qll|MS|INDEF|ACC
107812 (46:17:28:2) >aw~aliyna N STEM|POS:N|LEM:>aw~al|ROOT:Awl|MP|GEN
87207 (30:38:16:2) mufoliHuwna N STEM|POS:N|ACT|PCPL|(IV)|LEM:mufoliHuwn|ROOT:f...
78161 (25:74:8:2) *ur~iy~a`ti N STEM|POS:N|LEM:*ur~iy~a`t|ROOT:*rr|FP|GEN
123443 (74:41:2:2) mujorimiyna N STEM|POS:N|ACT|PCPL|(IV)|LEM:mujorim|ROOT:jrm|...
68794 (21:1:3:1) HisaAbu N STEM|POS:N|VN|(III)|LEM:HisaAb|ROOT:Hsb|M|NOM

My intention is to split the LOCATION column into four columns, and then the FEATURES column into a column for Root and another for Lemma.

First, I am going to split the first column LOCATION into four columns. This is done through the extract method. It takes a regular expression and hence the r'...' input. The ?P<...> construct within the regular expression creates columns with these names. The four parenthesis within the regular expression specifies the four grouping we are interested to collect.

tmp1 = qsample.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)'); tmp1
sura aya word w_seg
46292 10 89 3 1
37295 7 193 7 1
30207 6 118 9 2
94595 36 43 2 1
51228 12 48 13 1
107812 46 17 28 2
87207 30 38 16 2
78161 25 74 8 2
123443 74 41 2 2
68794 21 1 3 1

Now, let us extract the roots from the FEATURES column in the same way.

tmp2 = qsample.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)'); tmp2
Root
46292 jwb
37295 swy
30207 Ayy
94595 $yA
51228 qll
107812 Awl
87207 flH
78161 *rr
123443 jrm
68794 Hsb

Similarly, I can extract Lemmas as well. Note that Lemmas are actual words, whereas Roots are not actual words, so at times Lemmas could be more informative.

tmp3 = qsample.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)'); tmp3
Lemma
46292 >ujiybat
37295 sawaA^'
30207 'aAyap
94595 $aA^'a
51228 qaliyl
107812 >aw~al
87207 mufoliHuwn
78161 *ur~iy~a`t
123443 mujorim
68794 HisaAb

Finally, all that is left is to cancatenate the orginal sample qsample with these three splits tmp1, tmp2, tmp3, as follows. The axis=1 option means that run the concatenation on columns (not rows).

pd.concat([tmp1, qsample, tmp2,tmp3], axis=1)
sura aya word w_seg LOCATION FORM TAG FEATURES Root Lemma
46292 10 89 3 1 (10:89:3:1) >ujiybat V STEM|POS:V|PERF|PASS|(IV)|LEM:>ujiybat|ROOT:jw... jwb >ujiybat
37295 7 193 7 1 (7:193:7:1) sawaA^'N N STEM|POS:N|LEM:sawaA^'|ROOT:swy|M|INDEF|NOM swy sawaA^'
30207 6 118 9 2 (6:118:9:2) _#aAya`ti N STEM|POS:N|LEM:'aAyap|ROOT:Ayy|FP|GEN Ayy 'aAyap
94595 36 43 2 1 (36:43:2:1) n~a$a>o V STEM|POS:V|IMPF|LEM:$aA^'a|ROOT:$yA|1P|MOOD:JUS $yA $aA^'a
51228 12 48 13 1 (12:48:13:1) qaliylFA N STEM|POS:N|LEM:qaliyl|ROOT:qll|MS|INDEF|ACC qll qaliyl
107812 46 17 28 2 (46:17:28:2) >aw~aliyna N STEM|POS:N|LEM:>aw~al|ROOT:Awl|MP|GEN Awl >aw~al
87207 30 38 16 2 (30:38:16:2) mufoliHuwna N STEM|POS:N|ACT|PCPL|(IV)|LEM:mufoliHuwn|ROOT:f... flH mufoliHuwn
78161 25 74 8 2 (25:74:8:2) *ur~iy~a`ti N STEM|POS:N|LEM:*ur~iy~a`t|ROOT:*rr|FP|GEN *rr *ur~iy~a`t
123443 74 41 2 2 (74:41:2:2) mujorimiyna N STEM|POS:N|ACT|PCPL|(IV)|LEM:mujorim|ROOT:jrm|... jrm mujorim
68794 21 1 3 1 (21:1:3:1) HisaAbu N STEM|POS:N|VN|(III)|LEM:HisaAb|ROOT:Hsb|M|NOM Hsb HisaAb

Now that we ran the experiment successfully with the sample, let us repeat it on the actual file qdforiginal

tmp1 = qdforiginal.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)')
tmp2 = qdforiginal.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)')
tmp3 = qdforiginal.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)')
df_qruan = pd.concat([tmp1, qdforiginal, tmp2,tmp3], axis=1)

To confirm the shape of the new dataframe df_qruan I can use the shape attribute, also I can display randomly some rows.

df_qruan.shape
(128219, 10)
df_qruan.sample(5)
sura aya word w_seg LOCATION FORM TAG FEATURES Root Lemma
33209 7 50 19 1 (7:50:19:1) EalaY P STEM|POS:P|LEM:EalaY` NaN EalaY`
26277 5 106 40 1 (5:106:40:1) wa REM PREFIX|w:REM+ NaN NaN
61047 17 55 4 1 (17:55:4:1) fiY P STEM|POS:P|LEM:fiY NaN fiY
125339 81 14 1 1 (81:14:1:1) Ealimato V STEM|POS:V|PERF|LEM:Ealima|ROOT:Elm|3FS Elm Ealima
120502 67 19 1 1 (67:19:1:1) >a INTG PREFIX|A:INTG+ NaN NaN

It could be possible that our newly introduced columns could have extra spaces which we can get rid of by using the strip() method of string as follows.

quran.Root = quran.Root.str.strip()
quran.Lemma = quran.Lemma.str.strip()

It is good idea to save this version into a csv file. index=False avoids unncessesarity including an extra index column in the output file.

df_qruan.to_csv('quran-morphology-v2.csv', index=False)

join with Meccan/Medinan file

It would be very useful to augment our file with a new column that tells me if a sura is Meccan or Medinan. This will later allow to answer question like, what are the unique root words in the Quran that appear only in Meccan sura? for example.

To do this, I am referring to a table of contents page I created some time back using Angular here

My idea is to go that page, and use mouse to select the table, copy it in the clipboard and then perform the following operation to read the clipboard and create a dataframe qtoc as follows.

qtoc=pd.read_clipboard()
qtoc.head()
No. Name Arabic Name English Meaning No of verses Place Chronology
0 1 الفاتحة Al-Fatiha The Opening 7 Meccan 5
1 2 البقرة Al-Baqara The Cow 286 Medinan 87
2 3 آل عمران Al Imran The House of Joachim 200 Medinan 89
3 4 النساء An-Nisa' Women 176 Medinan 92
4 5 المائدة Al-Ma'ida The Table Spread 120 Medinan 112

Again, let me save this dataframe locally.

qtoc.to_csv('toc.csv', index=False)

I will now use the merge function to merge our original file df_qruan with the qtoc on the sura number (which is sura in the left df_qruan file and No. column in the right qtoc file. The left join is the one that makes sense here. The new dataframe is saved in a quran.

quran = df_qruan.merge(qtoc.loc[:,['No.', 'Place']], how='left', left_on='sura', right_on='No.')

I can display few useful information through the info() method.

quran.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 128219 entries, 0 to 128218
Data columns (total 12 columns):
sura        128219 non-null int64
aya         128219 non-null object
word        128219 non-null object
w_seg       128219 non-null object
LOCATION    128219 non-null object
FORM        128011 non-null object
TAG         128219 non-null object
FEATURES    128219 non-null object
Root        49968 non-null object
Lemma       74608 non-null object
No.         128219 non-null int64
Place       128219 non-null object
dtypes: int64(2), object(10)
memory usage: 12.7+ MB

I noticed that I no longer need the LOCATION and No. column as they are now redundent. So, just drop them.

quran.drop(columns=['LOCATION','No.'], inplace=True)

As usual, here is a local copy of the final file after doing all these setup steps.

quran.to_csv('quran-morphology-final.csv', index=False)

converting Buckwalter to Arabic

Our file contains Quranic words and roots in Buckwalter form, and I wanted a handy function to convert that into Arabic form. Here is how we do it.

First, referencing this site, I can construct the following dictionary of all mappings of unicode symbols into buckwalter as follows. I will call this dictionary abjad.

abjad = {u"\u0627":'A',
u"\u0628":'b', u"\u062A":'t', u"\u062B":'v', u"\u062C":'j',
u"\u062D":'H', u"\u062E":'x', u"\u062F":'d', u"\u0630":'*', u"\u0631":'r',
u"\u0632":'z', u"\u0633":'s', u"\u0634":'$', u"\u0635":'S', u"\u0636":'D',
u"\u0637":'T', u"\u0638":'Z', u"\u0639":'E', u"\u063A":'g', u"\u0641":'f',
u"\u0642":'q', u"\u0643":'k', u"\u0644":'l', u"\u0645":'m', u"\u0646":'n',
u"\u0647":'h', u"\u0648":'w', u"\u0649":'Y', u"\u064A":'y'}
abjad[' ']=' '
abjad[u"\u0621"] = '\''
abjad[u"\u0623"] = '>'
abjad[u"\u0625"] = '<'
abjad[u"\u0624"] = '&'
abjad[u"\u0626"] = '}'
#abjad[u"\u0655"] = '\'' # Hamza below
abjad[u"\u0622"] = '|'
abjad[u"\u064E"] = 'a'
abjad[u"\u064F"] = 'u'
abjad[u"\u0650"] = 'i'
abjad[u"\u0651"] = '~'
abjad[u"\u0652"] = 'o'
abjad[u"\u064B"] = 'F'
abjad[u"\u064C"] = 'N'
abjad[u"\u064D"] = 'K'
abjad[u"\u0640"] = '_'
abjad[u"\u0670"] = '`'
abjad[u"\u0629"] = 'p'
abjad[u"\u0653"] = '^'
abjad[u"\u0654"] = '#'
abjad[u"\u0671"] = '{'
abjad[u"\u06DC"] = ':'
abjad[u"\u06DF"] = '@'
abjad[u"\u0653"] = '^'
abjad[u"\u06E0"] = '"'
abjad[u"\u06E2"] = '['
abjad[u"\u06E3"] = ';'
abjad[u"\u06E5"] = ','
abjad[u"\u06E6"] = '.'
abjad[u"\u06E8"] = '!'
abjad[u"\u06EA"] = '-'
abjad[u"\u06EB"] = '+'
abjad[u"\u06EC"] = '%'
abjad[u"\u06ED"] = ']'

Let us also construct the reverse dictionary called alphabet that maps the bucwalter symbols back to unicode and hence can display Arabic words.

# Create the reverse
alphabet = {}
for key in abjad:
    alphabet[abjad[key]] = key

Using these two dictionaries, we can always convert a string from one form to other using the following two handy functions.

def arabic_to_buc(ara):
    return ''.join(map(lambda x:abjad[x], list(ara)))

def buck_to_arabic(buc):
    return ''.join(map(lambda x:alphabet[x], list(buc)))

Here is a small test.

buck_to_arabic('EalaY`')
'عَلَىٰ'
arabic_to_buc('الحمد لله')
'AlHmd llh'

counting roots

Now it is time to get into the core of our query: What are the unique root words that appear in Meccan sura, but not in the Medinan surah?

As we saw before, we can: (1) filter a dataframe by logical checks like quran.Place== 'Meccan'. With that we (2) get set of all Meccan words, (3) then we select only the Root column, (4) then we run the unique() method to get an array of unique words which we can (5) then convert to list using tolist() function. Finally (6) we wrap the whole thing to a set() function, and hence we get the set of Meccan unique root words called k here. So, note how through chaining I could perform six operations into one.

k = set(quran[quran.Place == 'Meccan'].Root.unique().tolist())

With the same logic, we produce the unique list of Medinan words in a set called d.

d = set(quran[quran.Place == 'Medinan'].Root.unique().tolist())

With this we can now remove the roots from Meccan list that are also in the Medinan, but the following set operation. We find out that there are 547 of such words, and 198 Medinan only words, and 898 root words appear in both.

makki_words = k-d; len(makki_words)
547
madani_words = d - k; len(madani_words)
198
both = k & d
len(both)
898

We now have at our hand all nuts and bolts to define two useful functions as follows.

Our first function is sura_words. It takes as input a list of sura numbers (for example [113,114] means sura 113 and 114). It also takes which kind of unique words we want to find for this list of sura: W is the default word list, R is the Root list and L is the Lemma list. Note how we use the isin() method to filter the dataframe on only the list of sura we provide. Also note the dropna() function to drop the null values from the list. Finally note how we are returnting Arabic form of the final resuls using the buck_to_arabic() function we defined earlier.

# function to return words given a list of sura
def sura_words(s_list, kind='W'):
    if (kind=='R'):
        result = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
    elif (kind=='L'):
        result = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
    else:
        result = quran[quran.sura.isin(s_list)].FORM.unique().tolist()
    return [buck_to_arabic(x) for x in result]

Here is a test on Lemma words of suran No. 111.

sura_words([111],'L')
['تَبَّ',
 'يَد',
 'أَبٌ',
 'لَهَب',
 'مَا',
 'أَغْنَىٰ',
 'عَن',
 'مَال',
 'كَسَبَ',
 'يَصْلَى',
 'نَار',
 'ذُو',
 'ٱمْرَأَت',
 'حَمَّالَة',
 'حَطَب',
 'فِى',
 'جِيد',
 'حَبْل',
 'مِن',
 'مَّسَد']

The above function can have lots of utilities. Among them you may want to increase your Quranic vocabulary gradually by memorizing roots of one sura at a time. This function conviniently will give you the unique list of roots (or lemmas, or just words).

With a small variation and exploiting the set operations, we can define another function called unique_sura_words that again takes a list of sura and returns root (or lemma or raw words) that appears only in this list of suras. Note the ~ operator to negate a condition. So ~quran.sura.isin([113,114]) means all sura except 113 and 114.

# function to return words given a list of sura
def unique_sura_words(s_list, kind='W'):
    if (kind=='R'):
        first = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].Root.dropna().unique().tolist()
        result = list(set(first)-set(second))
    elif (kind=='L'):
        first = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
        result = list(set(first)-set(second))
    else:
        first = quran[quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
        second = quran[~quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
        result = list(set(first)-set(second))
    return [buck_to_arabic(x) for x in result]

Using this function we know that sura 113 has these two root words that can be found no where else in the Quran.

unique_sura_words([113],'R')
['نفث', 'وقب']

From here one can extend this utility to create a web app using frameworks like flask. I am leaving a jupyter notebook for you to try things out yourself.