1. Phrase Mining

This vignette will introduce you to phrase mining using the phm package. Those who are familiar with the tm package will recognize that there are similarities in functionality between the two.

Phrase Mining is done on a corpus with texts, or on a vector where each element is a text, via the function phraseDoc. This function will create a phraseDoc object, which is equivalent to a term-document matrix stored in a more efficient manner. To see the term-document matrix, use the function as.matrix on its output.

The term-document matrix will have phrases on its rows and documents on its columns, and on the intersection of row and column there will be a frequency indicating the number of times a phrase occurs in the document.

In the case that the phrase document is created on a vector with texts, each element in the vector is considered to be a document, with as its ID the index of the element.

The phraseDoc function will extract principal phrases from the texts given to it. A principal phrase is a phrase that is frequent in its own right (so not as part of a different phrase), is meaningful, does not cross punctuation marks, and does not start or end with so-called stop-words (with a few exceptions).

The phraseDoc function gives progress updates, since at times it may take a while to complete. These can be silenced if desired. When using this function in a Shiny application, these progress updates can be given via a Shiny progress meter; the function uses about 100 progress steps, so it should be created inside a withProgress function with the argument max set to at least 100. The argument shiny in the phraseDoc function should be set to TRUE in that case.

When converting a phraseDoc object to a term-document matrix using as.matrix (see ?as.matrix.phraseDoc) by default the Ids of the documents are displayed on the columns. This can be changed to display the indices of the documents instead.

Once the phraseDoc object has been created, there are several functions available that will obtain information from it:

As an example, we create the following vector with texts:

tst=c("This is a test text",
      "This is a test text 2",
      "This is another test text",
      "This is another test text 2",
      "This girl will test text that man",
      "This boy will test text that man")

Create the phraseDoc object on it:

#> [1] "2024-01-26 17:23:49 EST"
#> [1] "2024-01-26 17:23:49 EST"
#> [1] "Rectifying frequencies..."
#> [1] "2024-01-26 17:23:49 EST"

Display the term-document matrix:

#>                          docs
#> phrases                   1 2 3 4 5 6
#>   another test text       0 0 1 1 0 0
#>   test text               1 1 0 0 0 0
#>   will test text that man 0 0 0 0 1 1

Get the 3 most frequent principal phrases:

#>                         frequency
#> will test text that man         2
#> test text                       2
#> another test text               2

Obtain all frequencies for documents with the phrases “test text” or “another test text”:

getDocs(pd,c("test text","another test text"))
#>                   1 2 3 4
#> another test text 0 0 1 1
#> test text         1 1 0 0

Obtain all frequencies for principal phrases in documents 1 and 2:

getPhrases(pd, 1:2)
#>           1 2
#> test text 1 1

Remove the phrase “test text” from the phrase document:

pd=removePhrases(pd, "test text")
#>                          docs
#> phrases                   1 2 3 4 5 6
#>   another test text       0 0 1 1 0 0
#>   will test text that man 0 0 0 0 1 1

Note that removePhrases will remove a phrase from the phrase document, but it is not able to restore the frequencies of the phrases inside the removed phrases. If this is desired, the phrases should be removed when creating the phrase document using the argument sp instead.

2. Text Clustering

The phm package also provides a distance measure that is optimal for text. Text distance is calculated as the proportion of unmatched frequencies, i.e., the number of unmatched frequencies divided by the total frequencies among the two vectors. Text clustering functions can be used for term-document matrices with phrases, as well as for regular term-document matrices where the terms are words (usually obtained via functions in the tm package).

Text distance is a number between 0 and 1, where 0 means that the two texts have the same terms and the same frequencies of those terms, and 1 indicates that they have no terms in common. A smaller number means that the texts are more alike, while a larger number (closer to 1) means they are less alike.

The function textDist will calculate the text distance between two numeric vectors:

#> [1] 0.6

Each vector represents a document, and the numbers in the vectors are the frequencies of terms. In the example, the first document/vector has one occurrence of the first term, while the second document has none.

The function textDist can also be used on matrices, in which case a vector with the text distance between corresponding columns is returned:

#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    1   10
#> [3,]    0    0
#> [4,]    2   14
#>      [,1] [,2]
#> [1,]   12    1
#> [2,]    0    3
#> [3,]    8    1
#> [4,]    0    2
#> [1] 1.0000000 0.6774194

Note that the first columns of the two matrices have no terms in common, and so their distance is the highest possible: 1.

The function textDistMatrix calculates the text distance between all combinations of the columns of a matrix:

#>   1  2  3 4
#> A 0  0 12 1
#> B 1 10  0 0
#> C 0  0  8 1
#> D 2 14  0 0
#>           1         2         3
#> 2 0.7777778                    
#> 3 1.0000000 1.0000000          
#> 4 1.0000000 1.0000000 0.8181818
#> [1] "dist"

Note that the output of this function is of type dist.

The function textCluster will use text clustering to cluster any term-document matrix. Its output is similar to the output of the function kmeans. However, note that, if there are any documents without terms, they will all be stored in the last cluster.

First we create a term-document matrix:

#>   1 2  3  4 5 6
#> A 0 0  0 12 1 0
#> B 0 1 10  0 0 0
#> C 0 0  0  8 1 0
#> D 0 2 14  0 0 0

Then we cluster it into 3 clusters:

#> << textCluster >>
#> 3 clusters
#> 6 documents

We can look at the output of the function:

#This shows for each document what cluster it is in
#> 1 2 3 4 5 6 
#> 3 1 1 2 2 3

#This shows for each cluster how many documents it contains
#> 1 2 3 
#> 2 2 2

#This matrix shows the centroid for each cluster on the columns, with terms on
#the rows
#>     1   2 3
#> A 0.0 6.5 0
#> B 5.5 0.0 0
#> C 0.0 4.5 0
#> D 8.0 0.0 0

The function showCluster will show the contents of one specific cluster. It will also show a column with for each term the number of documents it appears in, and a column with the total frequency of each term in the cluster. The terms are displayed in descending order of those last two columns, so the most common terms are displayed first.

Note that this function can be used with any clustering method; all it needs is the term-document matrix, a vector with the cluster ID for each document, and the number of the cluster to be displayed.

Let’s take a look at the clusters we have created:

#>   2  3 nDocs totFreq
#> D 2 14     2      16
#> B 1 10     2      11
#>    4 5 nDocs totFreq
#> A 12 1     2      13
#> C  8 1     2       9
#> $docs
#> [1] "1" "6"
#> $note
#> [1] "Documents have no terms"

We see for example that the first cluster consists of documents 2 and 3, which contain only terms “B” and “D”, both occurring in both documents, with “D” having the greatest overall frequency in the cluster, so it occurs first.

The second cluster consists of documents 4 and 5, which contain only terms “A” and “C”. Both documents contain those two terms, but the frequency of “A” is the largest and thus it appears first.

Note that the last cluster looks different from the others; it contains all documents without terms. These are documents 1 and 6.

3. Create a Corpus from a Data Frame

We can create a corpus from a data frame such that all variables except the text variable are stored in the meta fields of the documents. This is done using the function DFSource, in conjunction with the VCorpus function, which resides in the tm package:

(df=data.frame(id=LETTERS[1:3],text=c("First text","Second text","Third text"),
#>   id        text title author
#> 1  A  First text    N1  Smith
#> 2  B Second text    N2  Jones
#> 3  C  Third text    N3  Jones

#Create the corpus

#The content of one of the documents
#> [1] "First text"

#The meta data of one of the documents; all variables are present.
#>   author       : Smith
#>   datetimestamp: 2024-01-26 22:23:49.3588500022888
#>   description  : character(0)
#>   heading      : character(0)
#>   id           : A
#>   language     : en
#>   origin       : character(0)
#>   title        : N1

Note that the data frame must have the variables text and id.

4. Obtain Data from PubMed

The PubMed website will allow a user to enter search criteria, and will return abstracts of medical publications related to those search criteria. These abstracts can be saved in PubMed format, in which case a file will be created on the user’s host system containing these abstracts together with some additional information for each publication.

Running the function getPubMed on this file will create a data table with its contents, with each row representing a publication. The data table will have an id variable, containing the PMIDs of the publications, a text variable containing the text of the abstracts, and several other variables.

Running the function tm::VCorpus on DFSource on this data table will create a corpus with a plain text document for each publication. This corpus may then be used to perform phrase mining or regular (word) text mining.