Linux command line is a powerful and free tool. Using just few commands one can productively do powerful text analysis. In this quick tutorial I will demonstrate few of such Linux commands.
Any data science task starts with a question. So, what is our question in this exercise?
The Question is: what are the top root words in the Quran?
For this, we are going to use the Quranic Arabic Corpus (QAC) which contains morphological information (among them root word) for each word of the Qruan.
Without further ado, let us get started.
curl
curl -s
allows us to visit a URL and display its contents. -s
option allows that to be in silent mode. I have downloaded the original file for QAC and kept a version in my website. Let us use curl
to visit my website and download that file.
!curl -s http://textminingthequran.com/data/quranic-corpus-morphology-0.4.txt > quran_tags.txt
Above, curl
brings the file and displays it into screen, but I wanted to send those lines to a file instead by using >
.
We can get some initial information about this file using word count command wc
!wc -l quran_tags.txt
128276 quran_tags.txt
This file contains 128,276 lines. The -l
option of wc
indicates number of lines.
Let me display first few lines of this file to see what it contains. This can be done by using first cat
command to list the entire file but instead of displaying this big file, I will pipe
the results to another command called head
to display only first 10 lines. This piping business is a very powerful UNIX tool and we used |
to do that.
! cat quran_tags.txt | head
# PLEASE DO NOT REMOVE OR CHANGE THIS COPYRIGHT BLOCK
#====================================================================
#
# Quranic Arabic Corpus (morphology, version 0.4)
# Copyright (C) 2011 Kais Dukes
# License: GNU General Public License
#
# The Quranic Arabic Corpus includes syntactic and morphological
# annotation of the Quran, and builds on the verified Arabic text
# distributed by the Tanzil project.
We notice that the file contains some copyright block at the beginning and the actual Quranic annotation starts from line number 57. To prove this, I will take the first 60 lines and then pipe
the results to tail -5
to show me the last 5 lines.
! cat quran_tags.txt | head -60 | tail -5
LOCATION FORM TAG FEATURES
(1:1:1:1) bi P PREFIX|bi+
(1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
(1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
So, we know that our file has 128,276 lines of code and 57 lines from top has some copyright notes. Using tail
I can get a working copy of the file by chopping these few lines from top and display only (128,276 - 57 = 128,219) lines from button and save it into a handy file called qt
(short for Quranic tags).
! cat quran_tags.txt | tail -128219 > qt
Let us display the top of the new file to make sure things went as intended.
!cat qt | head
(1:1:1:1) bi P PREFIX|bi+
(1:1:1:2) somi N STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
(1:1:2:1) {ll~ahi PN STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
(1:1:3:1) {l DET PREFIX|Al+
(1:1:3:2) r~aHoma`ni ADJ STEM|POS:ADJ|LEM:r~aHoma`n|ROOT:rHm|MS|GEN
(1:1:4:1) {l DET PREFIX|Al+
(1:1:4:2) r~aHiymi ADJ STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN
(1:2:1:1) {lo DET PREFIX|Al+
(1:2:1:2) Hamodu N STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM
(1:2:2:1) li P PREFIX|l:P+
By looking into this structure, here is a brief on what each column means.
first column (1:1:1:1) (Seq No.:sura no.:verse no.:word no. (within that verse):segment no. (within that word))
Second column This is word form in Buckwalter transliteration as documented in Kais Dukes work here.
Third Column The part-of-speech tagging of this word, see here for a listing of these tags.
Fourth column contains a number of morphological features seperated by |
. The one concerning us in this exercise is the feature prefixed by ROOT:
. See this documentation for more details of Qruanic morphological features.
cut
and sed
cut
is cool. It is a handy swiss knife at the hand of a data scientist. It splits the line by a delimiter with option -d
and then pick whatever column you specify by option f
. In our case, the fields are separated by a tab for which I need to specify -d'\t\'
as the delimiter, this however will throw an error, because -d
accepts only one character as delimiter but \t
is two characters. (I found a solution to use -d$'\t'
but for some reason the Jupyter notebook is not allowing me to use it). If it were just comma, I would have used -d','
instead.
This gives me an opportunity to introduce another giant called sed
. Among other things, it will find and replace texts in our file. Let us use it to replace those tabs with comma, so later cut
can use cut -d','
without any error.
!cat qt | sed 's/\t/,/g'| tail
(114:5:3:1),fiY,P,STEM|POS:P|LEM:fiY
(114:5:4:1),Suduwri,N,STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN
(114:5:5:1),{l,DET,PREFIX|Al+
(114:5:5:2),n~aAsi,N,STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
(114:6:1:1),mina,P,STEM|POS:P|LEM:min
(114:6:2:1),{lo,DET,PREFIX|Al+
(114:6:2:2),jin~api,N,STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN
(114:6:3:1),wa,CONJ,PREFIX|w:CONJ+
(114:6:3:2),{l,DET,PREFIX|Al+
(114:6:3:3),n~aAsi,N,STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
Now pipe that with cut
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | tail
STEM|POS:P|LEM:fiY
STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN
PREFIX|Al+
STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
STEM|POS:P|LEM:min
PREFIX|Al+
STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN
PREFIX|w:CONJ+
PREFIX|Al+
STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
So, above we asked to split the Quranic root files by comma -d','
and wanted to preserve only the fourth column -f4
, and then we are showing only the tail
.
(a side note: I initially tried head
but was getting a broken pipe error, because head hurried to show things before it is ready by the process before it, so, I switched to tail to give enough time for cut
to finish its business)
grep
And now the powerful grep
. I will use it to pick the lines that has the ROOT:
words. I am using regular expressions hence the -E
option and want to output only the captured word and not the entire lines by specifying the -o
option. The pattern I used is 'ROOT:[^|]*'
which in plain English means: Traverse all lines and return only the lines that has the word ROOT:followed by anything except the character |
.
Regular Expression is a wild beast and worth investing time if you want to analyze text. You can always test various patterns on-line for example at regexr.com
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |tail
ROOT:Alh
ROOT:nws
ROOT:$rr
ROOT:wsws
ROOT:xns
ROOT:wsws
ROOT:Sdr
ROOT:nws
ROOT:jnn
ROOT:nws
I want to the exact the word that appears after the prefix ROOT:
. To do this I can employ a cut
again on the :
delimiter and take the second column. (I am sure there are better ways though).
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 | tail
Alh
nws
$rr
wsws
xns
wsws
Sdr
nws
jnn
nws
This way I have the list of all roots in the Quran.
Next, I want to sort them and count them.
sort
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | tail
zyn
zyn
zyn
zyt
zyt
zyt
zyt
zyt
zyt
zyt
All sort
does is to alphabetically sort the list, hence all the repeated roots are stacked. What I wanted is a unique list of those roots. Here comes the uniq
command.
uniq
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq | tail
zwd
zwj
zwl
zwr
zxrf
zyd
zyg
zyl
zyn
zyt
The above gives the unique list of all roots of the Quran without preserving how many of each roots are there.
While here, let me use wc
with the -l
option to tell me the number of lines, and hence number of unique root words in the Quran.
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq | wc -l
1651
Wow! with just few commands I managed to discover that the Quran contains a total of 1651 roots. This should give hope to those who intend to learn vocabulary of the Quran, which has nearly 77k words, but only 1,651 root words to memorize.
Now, let us revisit uniq
with the -c
option to preserve the counts.
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq -c | tail
2 zwd
81 zwj
4 zwl
6 zwr
4 zxrf
61 zyd
9 zyg
10 zyl
46 zyn
7 zyt
All that is left is to do a sort
again but with options -n
to make numeric sort on the counts and -r
to make reverse sort in the descending order. Let us take the most popular 20 roots.
(note: again with the above I run into broken pipe issue, so I resorted to tail, then removed the reverse order)
!cat qt | sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq -c | sort -n | tail -20
346 jEl
360 Eml
363 kll
373 E*b
381 smw
382 Ayy
405 ywm
461 ArD
513 rsl
514 byn
519 $yA
525 kfr
549 Aty
660 qwm
854 Elm
879 Amn
980 rbb
1390 kwn
1722 qwl
2851 Alh
Great revelations
See the power of the Linux shell. With just few piped commands, I produced a sorted list of the roots and their frequency in the Quran. Here are some winning roots.
Alh
Sure, no doubt that the word ‘Allah’ will be the winner far exceeding the runner’s up.
qwl
The second most frequent root word in the Quran is (قول) qwl
which are the derivatives of saying
. After all, the Quran is the sayings of Allah. Anyone would recognize the thousands of times Allah command His prophet Muhammad - peace be upon him - to say
whatever Allah wants him to say, and hence it is the greatest prove that the job of Prophet Muhammad is to convey whatever he is told to convey and never to author anything himself. This data analysis just comes to prove that.
kwn
In the third position comes all derivatives of (كون) kwn
which refers to the verb to be. And verb to be is the mother of all actions, and it is the word through which Allah executes His orders. When He intends something, He just says ‘Be’ and it becomes. Shakespeare might had some glimpse of the significance of this verb when he said, “To be or not to be, that is the Question”.
rbb
The fourth most frequent word refers to the word Lord which is another way to say Allah. Actually -with few exceptions- you can add this count to the counts of the word Allah.
Amn and Elm
These includes all derivatives of faith/believe Amn
and knowledge Elm
, showing great emphasis on these two qualities as essential ingredients for salvation. Islam is nothing but seeking knowledge and having faith accordingly. Also note Amn
brings in all derivatives of peace
as well.
One can derive much more insights from just this root frequencies. Spend some time studying the significance of each of those roots, but here I focused more on the technical bits.
Extension: Roots of a surah
We are very close to another handy extension of finding roots of not the entire Quran, rather only a particular surah. First let us revisit the file again.
!cat qt|tail
(114:5:3:1) fiY P STEM|POS:P|LEM:fiY
(114:5:4:1) Suduwri N STEM|POS:N|LEM:Sador|ROOT:Sdr|MP|GEN
(114:5:5:1) {l DET PREFIX|Al+
(114:5:5:2) n~aAsi N STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
(114:6:1:1) mina P STEM|POS:P|LEM:min
(114:6:2:1) {lo DET PREFIX|Al+
(114:6:2:2) jin~api N STEM|POS:N|LEM:jin~ap|ROOT:jnn|F|GEN
(114:6:3:1) wa CONJ PREFIX|w:CONJ+
(114:6:3:2) {l DET PREFIX|Al+
(114:6:3:3) n~aAsi N STEM|POS:N|LEM:n~aAs|ROOT:nws|MP|GEN
So, the secret of grabbing a surah is to grep '(114:'
to get all lines for surah no. 114. With this in mind, I will just add this small addition at the beginning of my already existing pipe as follows.
!cat qt| grep '(114:'|sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq -c | sort -nr
5 nws
2 wsws
1 xns
1 rbb
1 qwl
1 mlk
1 jnn
1 Sdr
1 Ew*
1 Alh
1 $rr
All that is left is to place this entire code inside a shell script and allow the user to run the script passing the surah no. as parameter and get the list of roots for that surah.
Open up a file using your favorite editor (I use mcedit
), and create a file, say qr.sh
(short for Quran Roots) with the following content:
#!/usr/bin/env bash
NUM="$1"
cat qt| grep '('$NUM':'|sed 's/\t/,/g'| cut -d',' -f4 | grep -oE 'ROOT:[^|]*' |\
cut -d':' -f2 |\
sort | uniq -c | sort -n
The first line is for the shell to recognize where to find the bash
shell if needed. The second line is a variable that captures the first argument from user as we will see later. The rest is the exact code we have seen earlier, just note how I am using this variable $NUM
in the first grep
.
After that, you need to make this script executable by the following command.
!chmod u+x qr.sh
Now, let us try to launch our new script asking to return all roots for sura no. 1
!./qr.sh 1
1 Dll
1 Ebd
1 Elm
1 Ewn
1 Hmd
1 dyn
1 gDb
1 gyr
1 hdy
1 mlk
1 nEm
1 qwm
1 rbb
1 smw
1 ywm
2 Alh
2 SrT
4 rHm
Sure enough, the derivatives of the root word of rHm
(رحم) (meaning mercy) is the most frequent word in this surah.
Just out of curiosity, we know Surah No. 55 (ar-Rahman) has lots of repetition of the verse:
(فبأي آلاء ربكما تكذبان)
Let us test that
!./qr.sh 55 | tail
4 Hsn
4 byn
4 mrj
4 wzn
5 smw
6 Ans
8 jnn
31 Alw
32 k*b
36 rbb
Conclusion
This file is a gold mine that we only are scratching its surface, and starting with Linux command line. I will take this problem into more interesting one using python
’s pandas
tool here.
You will find useful material in this book and as well as in the data36 blog.