Finding Unique Words in a Quranic Chapter

A Quranic chapter is a world by itself. Quranic chapters in Arabic are called surah, and one of the two meanings of this word means a boundary wall. That is so because each surah is like a wall that has its own meaning and theme kept inside.

This gave me the idea to write a small python code that takes a number of sura as input and returns the unique words that appeared only in this surah.

Let’s get started.

First, I have downloaded the simple text from the Tanzil project (http://tanzil.net/download/) as a txt file. The file does not have a header, and is of the format:

surah no|aya no| ayah text

I will manipulate this data using python DataFrame, so let us load the pandas module.

import pandas as pd

I am using read_csv to read a text file, and I have used the appropriate separator to identify the three fields.

quran = pd.read_csv('quran-simple-clean.txt', sep="|", header=None)

Check first few lines to make sure everything looks OK.

quran.head()

	0	1	2
0	1	1.0	بسم الله الرحمن الرحيم
1	1	2.0	الحمد لله رب العالمين
2	1	3.0	الرحمن الرحيم
3	1	4.0	مالك يوم الدين
4	1	5.0	إياك نعبد وإياك نستعين

Also, check from the end of file.

quran.tail()

	0	1	2
6259	# derived from or containing substantial po...	NaN	NaN
6260	#	NaN	NaN
6261	# Please check updates at: http://tanzil.net/...	NaN	NaN
6262	#	NaN	NaN
6263	#=============================================...	NaN	NaN

We noticed that there are some trailing text at the end that starts at index no 6236 until the end of file, so we need to drop those lines.

quran.drop(quran.index[6236:], inplace=True)

The above code drops those lines and updates the dataframe (hence the option inplace is True)

quran.tail()

	0	1	2
6231	114	2.0	ملك الناس
6232	114	3.0	إله الناس
6233	114	4.0	من شر الوسواس الخناس
6234	114	5.0	الذي يوسوس في صدور الناس
6235	114	6.0	من الجنة والناس

The original files had no headers, so we need to give some meaningful column names to our dataframe.

quran.columns = ['sura_no', 'aya_no', 'text']

Let us check the data types.

quran.dtypes

sura_no     object
aya_no     float64
text        object
dtype: object

We noticed sura_no as object, so need to convert it to .to_numeric() and

quran['sura_no'] = pd.to_numeric(quran['sura_no'], downcast = 'integer')

quran['aya_no'] = pd.to_numeric(quran['aya_no'], downcast = 'integer')

quran.dtypes

sura_no      int8
aya_no      int16
text       object
dtype: object

A final check again.

quran.head()

	sura_no	aya_no	text
0	1	1	بسم الله الرحمن الرحيم
1	1	2	الحمد لله رب العالمين
2	1	3	الرحمن الرحيم
3	1	4	مالك يوم الدين
4	1	5	إياك نعبد وإياك نستعين

Now, let us get to the core of our topic.

Here is a function definition that takes a particular surah and then returns a set containing words in that surah (i.e., set means it will remove duplicates).

# a function to find unique words
def unique_words(sura, neg=0):
    if neg==0:
        selection = quran[quran['sura_no']==sura].text.str.split().tolist()
    else:
        selection = quran[quran['sura_no']!=sura].text.str.split().tolist()
    flat_list = [item for aya in selection for item in aya]
    return set(flat_list)

Notice in the definition above, I have made the optional parameter of neg to return a unique words in the Quran that are NOT in that surah.

So, unique_words(1) means all unique words in Surah no. 1 (i.e., Fatihah), and unique_words(1,1) means all unique words not in Fatihah.

Now, using the above function we can easily define a function that takes a surah number and then returns the unique words that can be found only in this surah in the entire Quran. This can be done using set functions as below. Also note the returned list is sorted.

def unique(sura):
    return (sorted(list(set(unique_words(sura))-set(unique_words(sura,1)))))

Let us put it to test. Here are the unique words in surah Fatihah.

unique(1)

['إياك', 'المغضوب', 'اهدنا', 'نستعين', 'وإياك']

unique(111)

['الحطب', 'جيدها', 'حمالة', 'سيصلى', 'لهب', 'مسد', 'يدا']

Go ahead and test it in the code snippet below. (First, press the run button.)

This function can be enhanced by working at root-words level instead of unique words. That will give more concise list. Nevertheless, try out small chapters and you will notice that the unique word list of a surah truly captures some of the key concepts and themes of that surah.

Also, here is the jupyter notebook for you to try.

I have also other posts to investigate this question using Linux commands or python pandas.