A Quranic chapter is a world by itself. Quranic chapters in Arabic are called surah, and one of the two meanings of this word means a boundary wall. That is so because each surah is like a wall that has its own meaning and theme kept inside.
This gave me the idea to write a small python code that takes a number of sura as input and returns the unique words that appeared only in this surah.
Let’s get started.
First, I have downloaded the simple text from the Tanzil project (http://tanzil.net/download/) as a txt file. The file does not have a header, and is of the format:
surah no|aya no| ayah text
I will manipulate this data using python DataFrame
, so let us load the pandas
module.
import pandas as pd
I am using read_csv
to read a text file, and I have used the appropriate separator to identify the three fields.
quran = pd.read_csv('quran-simple-clean.txt', sep="|", header=None)
Check first few lines to make sure everything looks OK.
quran.head()
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 1.0 | بسم الله الرحمن الرحيم |
1 | 1 | 2.0 | الحمد لله رب العالمين |
2 | 1 | 3.0 | الرحمن الرحيم |
3 | 1 | 4.0 | مالك يوم الدين |
4 | 1 | 5.0 | إياك نعبد وإياك نستعين |
Also, check from the end of file.
quran.tail()
0 | 1 | 2 | |
---|---|---|---|
6259 | # derived from or containing substantial po... | NaN | NaN |
6260 | # | NaN | NaN |
6261 | # Please check updates at: http://tanzil.net/... | NaN | NaN |
6262 | # | NaN | NaN |
6263 | #=============================================... | NaN | NaN |
We noticed that there are some trailing text at the end that starts at index no 6236 until the end of file, so we need to drop those lines.
quran.drop(quran.index[6236:], inplace=True)
The above code drops those lines and updates the dataframe (hence the option inplace
is True)
quran.tail()
0 | 1 | 2 | |
---|---|---|---|
6231 | 114 | 2.0 | ملك الناس |
6232 | 114 | 3.0 | إله الناس |
6233 | 114 | 4.0 | من شر الوسواس الخناس |
6234 | 114 | 5.0 | الذي يوسوس في صدور الناس |
6235 | 114 | 6.0 | من الجنة والناس |
The original files had no headers, so we need to give some meaningful column names to our dataframe.
quran.columns = ['sura_no', 'aya_no', 'text']
Let us check the data types.
quran.dtypes
sura_no object
aya_no float64
text object
dtype: object
We noticed sura_no
as object, so need to convert it to .to_numeric()
and
quran['sura_no'] = pd.to_numeric(quran['sura_no'], downcast = 'integer')
quran['aya_no'] = pd.to_numeric(quran['aya_no'], downcast = 'integer')
quran.dtypes
sura_no int8
aya_no int16
text object
dtype: object
A final check again.
quran.head()
sura_no | aya_no | text | |
---|---|---|---|
0 | 1 | 1 | بسم الله الرحمن الرحيم |
1 | 1 | 2 | الحمد لله رب العالمين |
2 | 1 | 3 | الرحمن الرحيم |
3 | 1 | 4 | مالك يوم الدين |
4 | 1 | 5 | إياك نعبد وإياك نستعين |
Now, let us get to the core of our topic.
Here is a function definition that takes a particular surah and then returns a set containing words in that surah (i.e., set
means it will remove duplicates).
# a function to find unique words
def unique_words(sura, neg=0):
if neg==0:
selection = quran[quran['sura_no']==sura].text.str.split().tolist()
else:
selection = quran[quran['sura_no']!=sura].text.str.split().tolist()
flat_list = [item for aya in selection for item in aya]
return set(flat_list)
Notice in the definition above, I have made the optional parameter of neg
to return a unique words in the Quran that are NOT in that surah.
So, unique_words(1)
means all unique words in Surah no. 1 (i.e., Fatihah), and unique_words(1,1)
means all unique words not in Fatihah.
Now, using the above function we can easily define a function that takes a surah number and then returns the unique words that can be found only in this surah in the entire Quran. This can be done using set functions as below. Also note the returned list is sorted.
def unique(sura):
return (sorted(list(set(unique_words(sura))-set(unique_words(sura,1)))))
Let us put it to test. Here are the unique words in surah Fatihah.
unique(1)
['إياك', 'المغضوب', 'اهدنا', 'نستعين', 'وإياك']
unique(111)
['الحطب', 'جيدها', 'حمالة', 'سيصلى', 'لهب', 'مسد', 'يدا']
Go ahead and test it in the code snippet below. (First, press the run button.)
This function can be enhanced by working at root-words level instead of unique words. That will give more concise list. Nevertheless, try out small chapters and you will notice that the unique word list of a surah truly captures some of the key concepts and themes of that surah.
Also, here is the jupyter notebook for you to try.
I have also other posts to investigate this question using Linux commands or python pandas.