In a previous post I explored some of Linux
commands to find root words in the entire Quran or in a particular sura of the Quran.
While, Linux
commands could be very productive at certain cases, they are not meant for data analysis. Hence, we need to resort to a proper data science programming language like R or Python.
In this post, I will explore the power of Python
to address a the same topic (i.e., root words in the Quran) but will extend the problem to much more interesting queries.
This post is not intended to be a beginner’s tutorial to either python
or pandas
which is a special python package for data analysis. I expect you to have some experience with both these tools. Anyway, there are tons of resources out there. One good place is available under kaggle website.
Without further ado, let us get started.
First, let us start with few setup steps, like loading the pandas
package and rename it for ease of usage as pd
.
import pandas as pd
Next, read the file that contains the morphological information. Pandas has the read_csv
function that can read directly from a URL. Note that the file contains some copyright information in the first 56 lines, and hence I am using skiprows
option. Also, note that read_csv
by default assumes the separator to be a comma, if not -as is the case in this file- we need to explicitly specify the delimiter and hence the sep='\t'
option. Finally, we are displaying few lines from the top by the head()
function.
url = 'http://textminingthequran.com/data/quranic-corpus-morphology-0.4.txt'
qdforiginal = pd.read_csv(url, sep='\t',skiprows=56)
qdforiginal.head()
LOCATION | FORM | TAG | FEATURES | |
---|---|---|---|---|
0 | (1:1:1:1) | bi | P | PREFIX|bi+ |
1 | (1:1:1:2) | somi | N | STEM|POS:N|LEM:{som|ROOT:smw|M|GEN |
2 | (1:1:2:1) | {ll~ahi | PN | STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN |
3 | (1:1:3:1) | {l | DET | PREFIX|Al+ |
4 | (1:1:3:2) | r~aHoma`ni | ADJ | STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN |
It is a good idea to save the first version locally by the to_csv()
function.
qdforiginal.to_csv('quran-morphology-v1.csv')
Looking at the first few lines of the file above we see that the LOCATION
and FEATURES
columns need to be split further.
Our file contains 128k lines (you can verify that by the command qdforiginal.shape
). I prefer to take a small sample of this big file and run the experimentations of splitting. When successful, we can then run it on the entire file.
Splitting columns
Here is my strategy: since I am interested on root words, I want to select first all rows that contain the word ROOT:
in the FEATURES
column. This can be done by a command like the following:
qdforiginal.FEATURES.str.contains('ROOT')[:3]
0 False
1 True
2 True
Name: FEATURES, dtype: bool
I only took the first 3 lines of the entire 128k lines. It returns boolean
values of True
or False
. So, we can pass this boolean result to filter the entire dataframe by:
qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].head(3)
LOCATION | FORM | TAG | FEATURES | |
---|---|---|---|---|
1 | (1:1:1:2) | somi | N | STEM|POS:N|LEM:{som|ROOT:smw|M|GEN |
2 | (1:1:2:1) | {ll~ahi | PN | STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN |
4 | (1:1:3:2) | r~aHoma`ni | ADJ | STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN |
To ensure some random sampling, I can always use sample()
method as follows. I will name this sample as qsample
.
qsample = qdforiginal[qdforiginal.FEATURES.str.contains('ROOT:')].sample(10); qsample
LOCATION | FORM | TAG | FEATURES | |
---|---|---|---|---|
46292 | (10:89:3:1) | >ujiybat | V | STEM|POS:V|PERF|PASS|(IV)|LEM:>ujiybat|ROOT:jw... |
37295 | (7:193:7:1) | sawaA^'N | N | STEM|POS:N|LEM:sawaA^'|ROOT:swy|M|INDEF|NOM |
30207 | (6:118:9:2) | _#aAya`ti | N | STEM|POS:N|LEM:'aAyap|ROOT:Ayy|FP|GEN |
94595 | (36:43:2:1) | n~a$a>o | V | STEM|POS:V|IMPF|LEM:$aA^'a|ROOT:$yA|1P|MOOD:JUS |
51228 | (12:48:13:1) | qaliylFA | N | STEM|POS:N|LEM:qaliyl|ROOT:qll|MS|INDEF|ACC |
107812 | (46:17:28:2) | >aw~aliyna | N | STEM|POS:N|LEM:>aw~al|ROOT:Awl|MP|GEN |
87207 | (30:38:16:2) | mufoliHuwna | N | STEM|POS:N|ACT|PCPL|(IV)|LEM:mufoliHuwn|ROOT:f... |
78161 | (25:74:8:2) | *ur~iy~a`ti | N | STEM|POS:N|LEM:*ur~iy~a`t|ROOT:*rr|FP|GEN |
123443 | (74:41:2:2) | mujorimiyna | N | STEM|POS:N|ACT|PCPL|(IV)|LEM:mujorim|ROOT:jrm|... |
68794 | (21:1:3:1) | HisaAbu | N | STEM|POS:N|VN|(III)|LEM:HisaAb|ROOT:Hsb|M|NOM |
My intention is to split the LOCATION
column into four columns, and then the FEATURES
column into a column for Root and another for Lemma.
First, I am going to split the first column LOCATION
into four columns. This is done through the extract
method. It takes a regular expression
and hence the r'...'
input. The ?P<...>
construct within the regular expression creates columns with these names. The four parenthesis within the regular expression specifies the four grouping we are interested to collect.
tmp1 = qsample.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)'); tmp1
sura | aya | word | w_seg | |
---|---|---|---|---|
46292 | 10 | 89 | 3 | 1 |
37295 | 7 | 193 | 7 | 1 |
30207 | 6 | 118 | 9 | 2 |
94595 | 36 | 43 | 2 | 1 |
51228 | 12 | 48 | 13 | 1 |
107812 | 46 | 17 | 28 | 2 |
87207 | 30 | 38 | 16 | 2 |
78161 | 25 | 74 | 8 | 2 |
123443 | 74 | 41 | 2 | 2 |
68794 | 21 | 1 | 3 | 1 |
Now, let us extract the roots from the FEATURES
column in the same way.
tmp2 = qsample.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)'); tmp2
Root | |
---|---|
46292 | jwb |
37295 | swy |
30207 | Ayy |
94595 | $yA |
51228 | qll |
107812 | Awl |
87207 | flH |
78161 | *rr |
123443 | jrm |
68794 | Hsb |
Similarly, I can extract Lemmas as well. Note that Lemmas are actual words, whereas Roots are not actual words, so at times Lemmas could be more informative.
tmp3 = qsample.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)'); tmp3
Lemma | |
---|---|
46292 | >ujiybat |
37295 | sawaA^' |
30207 | 'aAyap |
94595 | $aA^'a |
51228 | qaliyl |
107812 | >aw~al |
87207 | mufoliHuwn |
78161 | *ur~iy~a`t |
123443 | mujorim |
68794 | HisaAb |
Finally, all that is left is to cancatenate the orginal sample qsample
with these three splits tmp1, tmp2, tmp3
, as follows. The axis=1
option means that run the concatenation on columns (not rows).
pd.concat([tmp1, qsample, tmp2,tmp3], axis=1)
sura | aya | word | w_seg | LOCATION | FORM | TAG | FEATURES | Root | Lemma | |
---|---|---|---|---|---|---|---|---|---|---|
46292 | 10 | 89 | 3 | 1 | (10:89:3:1) | >ujiybat | V | STEM|POS:V|PERF|PASS|(IV)|LEM:>ujiybat|ROOT:jw... | jwb | >ujiybat |
37295 | 7 | 193 | 7 | 1 | (7:193:7:1) | sawaA^'N | N | STEM|POS:N|LEM:sawaA^'|ROOT:swy|M|INDEF|NOM | swy | sawaA^' |
30207 | 6 | 118 | 9 | 2 | (6:118:9:2) | _#aAya`ti | N | STEM|POS:N|LEM:'aAyap|ROOT:Ayy|FP|GEN | Ayy | 'aAyap |
94595 | 36 | 43 | 2 | 1 | (36:43:2:1) | n~a$a>o | V | STEM|POS:V|IMPF|LEM:$aA^'a|ROOT:$yA|1P|MOOD:JUS | $yA | $aA^'a |
51228 | 12 | 48 | 13 | 1 | (12:48:13:1) | qaliylFA | N | STEM|POS:N|LEM:qaliyl|ROOT:qll|MS|INDEF|ACC | qll | qaliyl |
107812 | 46 | 17 | 28 | 2 | (46:17:28:2) | >aw~aliyna | N | STEM|POS:N|LEM:>aw~al|ROOT:Awl|MP|GEN | Awl | >aw~al |
87207 | 30 | 38 | 16 | 2 | (30:38:16:2) | mufoliHuwna | N | STEM|POS:N|ACT|PCPL|(IV)|LEM:mufoliHuwn|ROOT:f... | flH | mufoliHuwn |
78161 | 25 | 74 | 8 | 2 | (25:74:8:2) | *ur~iy~a`ti | N | STEM|POS:N|LEM:*ur~iy~a`t|ROOT:*rr|FP|GEN | *rr | *ur~iy~a`t |
123443 | 74 | 41 | 2 | 2 | (74:41:2:2) | mujorimiyna | N | STEM|POS:N|ACT|PCPL|(IV)|LEM:mujorim|ROOT:jrm|... | jrm | mujorim |
68794 | 21 | 1 | 3 | 1 | (21:1:3:1) | HisaAbu | N | STEM|POS:N|VN|(III)|LEM:HisaAb|ROOT:Hsb|M|NOM | Hsb | HisaAb |
Now that we ran the experiment successfully with the sample, let us repeat it on the actual file qdforiginal
tmp1 = qdforiginal.LOCATION.str.extract(r'(?P<sura>\d*):(?P<aya>\d*):(?P<word>\d*):(?P<w_seg>\d*)')
tmp2 = qdforiginal.FEATURES.str.extract(r'ROOT:(?P<Root>[^|]*)')
tmp3 = qdforiginal.FEATURES.str.extract(r'LEM:(?P<Lemma>[^|]*)')
df_qruan = pd.concat([tmp1, qdforiginal, tmp2,tmp3], axis=1)
To confirm the shape of the new dataframe df_qruan
I can use the shape
attribute, also I can display randomly some rows.
df_qruan.shape
(128219, 10)
df_qruan.sample(5)
sura | aya | word | w_seg | LOCATION | FORM | TAG | FEATURES | Root | Lemma | |
---|---|---|---|---|---|---|---|---|---|---|
33209 | 7 | 50 | 19 | 1 | (7:50:19:1) | EalaY | P | STEM|POS:P|LEM:EalaY` | NaN | EalaY` |
26277 | 5 | 106 | 40 | 1 | (5:106:40:1) | wa | REM | PREFIX|w:REM+ | NaN | NaN |
61047 | 17 | 55 | 4 | 1 | (17:55:4:1) | fiY | P | STEM|POS:P|LEM:fiY | NaN | fiY |
125339 | 81 | 14 | 1 | 1 | (81:14:1:1) | Ealimato | V | STEM|POS:V|PERF|LEM:Ealima|ROOT:Elm|3FS | Elm | Ealima |
120502 | 67 | 19 | 1 | 1 | (67:19:1:1) | >a | INTG | PREFIX|A:INTG+ | NaN | NaN |
It could be possible that our newly introduced columns could have extra spaces which we can get rid of by using the strip()
method of string as follows.
quran.Root = quran.Root.str.strip()
quran.Lemma = quran.Lemma.str.strip()
It is good idea to save this version into a csv
file. index=False
avoids unncessesarity including an extra index column in the output file.
df_qruan.to_csv('quran-morphology-v2.csv', index=False)
join with Meccan/Medinan file
It would be very useful to augment our file with a new column that tells me if a sura is Meccan or Medinan. This will later allow to answer question like, what are the unique root words in the Quran that appear only in Meccan sura? for example.
To do this, I am referring to a table of contents page I created some time back using Angular
here
My idea is to go that page, and use mouse to select the table, copy it in the clipboard and then perform the following operation to read the clipboard and create a dataframe qtoc
as follows.
qtoc=pd.read_clipboard()
qtoc.head()
No. | Name Arabic | Name | English Meaning | No of verses | Place | Chronology | |
---|---|---|---|---|---|---|---|
0 | 1 | الفاتحة | Al-Fatiha | The Opening | 7 | Meccan | 5 |
1 | 2 | البقرة | Al-Baqara | The Cow | 286 | Medinan | 87 |
2 | 3 | آل عمران | Al Imran | The House of Joachim | 200 | Medinan | 89 |
3 | 4 | النساء | An-Nisa' | Women | 176 | Medinan | 92 |
4 | 5 | المائدة | Al-Ma'ida | The Table Spread | 120 | Medinan | 112 |
Again, let me save this dataframe locally.
qtoc.to_csv('toc.csv', index=False)
I will now use the merge
function to merge our original file df_qruan
with the qtoc
on the sura number (which is sura
in the left df_qruan
file and No.
column in the right qtoc
file. The left
join is the one that makes sense here. The new dataframe is saved in a quran
.
quran = df_qruan.merge(qtoc.loc[:,['No.', 'Place']], how='left', left_on='sura', right_on='No.')
I can display few useful information through the info()
method.
quran.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 128219 entries, 0 to 128218
Data columns (total 12 columns):
sura 128219 non-null int64
aya 128219 non-null object
word 128219 non-null object
w_seg 128219 non-null object
LOCATION 128219 non-null object
FORM 128011 non-null object
TAG 128219 non-null object
FEATURES 128219 non-null object
Root 49968 non-null object
Lemma 74608 non-null object
No. 128219 non-null int64
Place 128219 non-null object
dtypes: int64(2), object(10)
memory usage: 12.7+ MB
I noticed that I no longer need the LOCATION
and No.
column as they are now redundent. So, just drop them.
quran.drop(columns=['LOCATION','No.'], inplace=True)
As usual, here is a local copy of the final file after doing all these setup steps.
quran.to_csv('quran-morphology-final.csv', index=False)
converting Buckwalter to Arabic
Our file contains Quranic words and roots in Buckwalter form, and I wanted a handy function to convert that into Arabic form. Here is how we do it.
First, referencing this site, I can construct the following dictionary of all mappings of unicode symbols into buckwalter as follows. I will call this dictionary abjad
.
abjad = {u"\u0627":'A',
u"\u0628":'b', u"\u062A":'t', u"\u062B":'v', u"\u062C":'j',
u"\u062D":'H', u"\u062E":'x', u"\u062F":'d', u"\u0630":'*', u"\u0631":'r',
u"\u0632":'z', u"\u0633":'s', u"\u0634":'$', u"\u0635":'S', u"\u0636":'D',
u"\u0637":'T', u"\u0638":'Z', u"\u0639":'E', u"\u063A":'g', u"\u0641":'f',
u"\u0642":'q', u"\u0643":'k', u"\u0644":'l', u"\u0645":'m', u"\u0646":'n',
u"\u0647":'h', u"\u0648":'w', u"\u0649":'Y', u"\u064A":'y'}
abjad[' ']=' '
abjad[u"\u0621"] = '\''
abjad[u"\u0623"] = '>'
abjad[u"\u0625"] = '<'
abjad[u"\u0624"] = '&'
abjad[u"\u0626"] = '}'
#abjad[u"\u0655"] = '\'' # Hamza below
abjad[u"\u0622"] = '|'
abjad[u"\u064E"] = 'a'
abjad[u"\u064F"] = 'u'
abjad[u"\u0650"] = 'i'
abjad[u"\u0651"] = '~'
abjad[u"\u0652"] = 'o'
abjad[u"\u064B"] = 'F'
abjad[u"\u064C"] = 'N'
abjad[u"\u064D"] = 'K'
abjad[u"\u0640"] = '_'
abjad[u"\u0670"] = '`'
abjad[u"\u0629"] = 'p'
abjad[u"\u0653"] = '^'
abjad[u"\u0654"] = '#'
abjad[u"\u0671"] = '{'
abjad[u"\u06DC"] = ':'
abjad[u"\u06DF"] = '@'
abjad[u"\u0653"] = '^'
abjad[u"\u06E0"] = '"'
abjad[u"\u06E2"] = '['
abjad[u"\u06E3"] = ';'
abjad[u"\u06E5"] = ','
abjad[u"\u06E6"] = '.'
abjad[u"\u06E8"] = '!'
abjad[u"\u06EA"] = '-'
abjad[u"\u06EB"] = '+'
abjad[u"\u06EC"] = '%'
abjad[u"\u06ED"] = ']'
Let us also construct the reverse dictionary called alphabet
that maps the bucwalter symbols back to unicode and hence can display Arabic words.
# Create the reverse
alphabet = {}
for key in abjad:
alphabet[abjad[key]] = key
Using these two dictionaries, we can always convert a string from one form to other using the following two handy functions.
def arabic_to_buc(ara):
return ''.join(map(lambda x:abjad[x], list(ara)))
def buck_to_arabic(buc):
return ''.join(map(lambda x:alphabet[x], list(buc)))
Here is a small test.
buck_to_arabic('EalaY`')
'عَلَىٰ'
arabic_to_buc('الحمد لله')
'AlHmd llh'
counting roots
Now it is time to get into the core of our query: What are the unique root words that appear in Meccan sura, but not in the Medinan surah?
As we saw before, we can: (1) filter a dataframe by logical checks like quran.Place== 'Meccan'
. With that we (2) get set of all Meccan words, (3) then we select only the Root
column, (4) then we run the unique()
method to get an array of unique words which we can (5) then convert to list using tolist()
function. Finally (6) we wrap the whole thing to a set()
function, and hence we get the set of Meccan unique root words called k
here. So, note how through chaining I could perform six operations into one.
k = set(quran[quran.Place == 'Meccan'].Root.unique().tolist())
With the same logic, we produce the unique list of Medinan words in a set called d
.
d = set(quran[quran.Place == 'Medinan'].Root.unique().tolist())
With this we can now remove the roots from Meccan list that are also in the Medinan, but the following set operation. We find out that there are 547 of such words, and 198 Medinan only words, and 898 root words appear in both.
makki_words = k-d; len(makki_words)
547
madani_words = d - k; len(madani_words)
198
both = k & d
len(both)
898
We now have at our hand all nuts and bolts to define two useful functions as follows.
Our first function is sura_words
. It takes as input a list of sura numbers (for example [113,114]
means sura 113 and 114). It also takes which kind of unique words we want to find for this list of sura: W
is the default word list, R
is the Root list and L
is the Lemma list. Note how we use the isin()
method to filter the dataframe on only the list of sura we provide. Also note the dropna()
function to drop the null
values from the list. Finally note how we are returnting Arabic form of the final resuls using the buck_to_arabic()
function we defined earlier.
# function to return words given a list of sura
def sura_words(s_list, kind='W'):
if (kind=='R'):
result = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
elif (kind=='L'):
result = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
else:
result = quran[quran.sura.isin(s_list)].FORM.unique().tolist()
return [buck_to_arabic(x) for x in result]
Here is a test on Lemma
words of suran No. 111.
sura_words([111],'L')
['تَبَّ',
'يَد',
'أَبٌ',
'لَهَب',
'مَا',
'أَغْنَىٰ',
'عَن',
'مَال',
'كَسَبَ',
'يَصْلَى',
'نَار',
'ذُو',
'ٱمْرَأَت',
'حَمَّالَة',
'حَطَب',
'فِى',
'جِيد',
'حَبْل',
'مِن',
'مَّسَد']
The above function can have lots of utilities. Among them you may want to increase your Quranic vocabulary gradually by memorizing roots of one sura at a time. This function conviniently will give you the unique list of roots (or lemmas, or just words).
With a small variation and exploiting the set operations, we can define another function called unique_sura_words
that again takes a list of sura and returns root (or lemma or raw words) that appears only in this list of suras. Note the ~
operator to negate a condition. So ~quran.sura.isin([113,114])
means all sura except 113 and 114.
# function to return words given a list of sura
def unique_sura_words(s_list, kind='W'):
if (kind=='R'):
first = quran[quran.sura.isin(s_list)].Root.dropna().unique().tolist()
second = quran[~quran.sura.isin(s_list)].Root.dropna().unique().tolist()
result = list(set(first)-set(second))
elif (kind=='L'):
first = quran[quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
second = quran[~quran.sura.isin(s_list)].Lemma.dropna().unique().tolist()
result = list(set(first)-set(second))
else:
first = quran[quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
second = quran[~quran.sura.isin(s_list)].FORM.dropna().unique().tolist()
result = list(set(first)-set(second))
return [buck_to_arabic(x) for x in result]
Using this function we know that sura 113 has these two root words that can be found no where else in the Quran.
unique_sura_words([113],'R')
['نفث', 'وقب']
From here one can extend this utility to create a web app using frameworks like flask
.
I am leaving a jupyter notebook for you to try things out yourself.