Tutorial LDABiplots

Luis Pilacuan-Bonete, Purificación Galindo-Villardón, Javier De La Hoz-Maestre y Francisco Javier Delgado Álvarez

LDABiplots

Introduction

LDABiplots it is an extraction, analysis, and visualization tool for the exploratory analysis of news published on the web by digital newspapers, which, by extracting data from the web (Bradley et al. 2019), allows the implementation of the Latent Dirichlet Allocation probabilistic model(LDA) (Blei, Ng y Jordán, 2003) and the generation of Biplot (Gabriel K.R, 1971) and HJ-Biplot (Galindo-Villardón P, 1986) visualizations of the main topics of the headlines of the news published on the web. LDABiplots allows for optimizing the data extraction from the web, the LDA modeling routine, and the generation of Biplot visualizations in an interactive way for users who are not adapted to the use of R.

Download & installation

To download install the stable version of Comprehensive R Archive Network (CRAN)

# install.packages("LDABiplots")
# library(LDABiplots)

Once the library is loaded, to use the web interface, type in the R console

# runLDABiplots()

Import or Load of Data

LDABiplots allows us to extract data from the web page www.google.com, the data belongs to the news section in the GOOGLE search engine. For users using a different extraction page LDABiplots also allows the loading of files in Excel format.

Importing Data from File

The data can be imported from a file in the directory, by selecting the Import or Load Data tab the Import excel file option, and selecting the file to upload from Browse and the work tab where the data is located Worksheet Name. The data to be uploaded must have the header and format according to figure 1

.

Figura 1. File format to import

Figura 1. File format to import

.

LDABiplots
Video 1. Importing Data

Data Extraction from the WEB

It is recommended that before performing news extraction through LDABiplots, you set your computer’s google search engine to Advanced Search, set the region and preferred search language in the news section for better extraction results.(Video 2)

.

LDABiplots
Video 2. Browser Configuration

.

The data can be extracted by web scraping directly in LDABiplots, selecting in the Import or Load Data tab the Load web data option, and writing the search keywords (use a maximum of 4 keywords, for a better search performance, select the search language in Choose Language, and the pagination number to be extracted, by default google shows 10 news per page, select Run to execute the search and extraction.(Video 3)
.
LDABiplots
Video 3. Webscraping News from the WEB

Application Example

To exemplify the operation of LDABiplots, we will extract from the web the news related to “covid, coronavirus, France”, according to what is shown in video 3, this extraction allows us to list in two tables the number of newspapers with their respective frequency of news, as well as showing us each of the newspapers with the headlines of the news, these tables can be downloaded in various formats from the application, for the processing of this data we will proceed as follows:

Selection of Digital Newspapers to Analyze, it is recommended to select the newspapers with the most frequent digital news.

Inclusion of n-grams, n-grams is a contiguous sequence of words. According to your study, you must select between unigrams, bigrams, or trigrams, for the example bigrams are selected

Remove numbers, this option allows us to remove the numbers, in case they are not informative.

Select language for stopwords, stopwords are those words that have no lexical meaning and that appear with high frequency in the news, such as articles or pronouns. We proceeded to select according to the language of extraction of the news, that is, in English.

Add stopword words, this allows us to eliminate from the study words that can be considered highly frequent and that do not add value to the study, in the example the words covid and France were added.

Select Lemmatization, this allows us to reduce the words to the basic form, it should be used with caution in studies, for example, the lemmatization was not selected.

Selecting Sparcity, allows us to eliminate terms that are used infrequently in very few news before generating the models. Allowing better computational performance since it eliminates information that does not contribute to the model, in our case sparcity was used with 0.985(98.5%), that is, the DTM will be generated with the terms that appear in the 1.5% of the headlines of the news.

Create DTM, finally the document terms matrix (DTM) is generated, once the process is finished, the dimensions of the DTM are shown in a summary table, see video 4.
.
LDABiplots
Video 4. Obtaining DTM matrix

Visualization of Terms Matrix

After processing the original data of the selected newspapers, a matrix of the corpus of 402 unique terms have been obtained, out of the 444 of the original corpus, this allows obtaining a better computational performance for the following analyses. The DTM matrix obtained can be seen by selecting Document Term Matrix Visualization, in the Data tab in a tabular manner, and can be downloaded in different formats, such as Excel, CSV, and pdf. This DTM matrix shows us the frequencies of the terms, the number of documents in which each term appears, and the IDF or inverse frequency of documents, which is a measure of the importance of the term.

The Barplot option allows us to generate an ordered bar graph, where the words are displayed according to their frequency, and enable the option of changing the color of the bars in the graph and downloading it in various formats with the export button. The Worcloud Option shows us a cloud of words, which can be modified, by selecting the number of words to show, with the export button, the graph can be downloaded in various formats.

Co-occurrence displays a word co-occurrence plot which plots the sparse term correlations as a graph structure, based on the glasso procedure (Lasso Plot), to reduce the correlation matrix and keep only the relevant correlations between terms, with the Select Number option, it allows us to select the number of terms for the correlation graph, and Download the plot, allows us to download the graph in png and pdf format. Visualize video 5.
.
LDABiplots
Video 5. DTM matrix visualization

Selection of k optimum.

For the inference and selection of the optimal number of topics for the LDA model, we start from the DTM matrix, taking into account that a small K can generate wide and heterogeneous topics, and a high K will produce specific topics LDABiplots obtains this optimal k from the coherence of the topic, this being a measure of the quality of the desired topic from the point of view of human interpretability. This is based on the distribution hypothesis that states that words with similar interpretations tend to coexist in similar contexts. The best number of topics will be the one that offers the greatest measure of coherence, this is done based on probability theory and consists of adjusting several models with different topics and calculating the coherence of each of them. For the option of this number, the models that you want to check must be parameterized in the Inference section in Candidate number of topics K, it must be identified from the range of topics for the test, in the Parameters section Gibbs sampling control, you must select the number of iterations Iteratition of the sample based on Gibbs sampling and the number of the first N samples to discard Burn-in, to choose an N that is big enough.

An Alpha hyperparameter value should be selected, considering that a high alpha value means that each document is likely to contain a combination of most topics, and not a single particular topic. A low alpha value places fewer restrictions on documents and means that a document is more likely to contain a combination of only a few, or even just one, of the topics. Several authors have defined some rules to determine the value of alpha, α = (0.1, 50/K)(Griffiths et al, 2004), also (0.1, 0.1)(Asunción et al, 2009) and (1/K, 1/K)( Rehůřek and Sojka, 2010), by default LDABiplots uses the value of 0.1 for calculation, see video 6.
.
LDABiplots
Video 6. Obtaining of k optimum

Obtaining LDA Model

Once the number of topics was defined, according to the obtained coherence of 0.069, it was inferred that the best number of topics is 4, with this optimal K, the LDA model is generated from the DTM matrix, with the optimal K number You must define the parameters similar to the process where the inference was obtained, for the example, 100 iterations and a Burn-in of 5 were selected, as well as an Alpha of 0.1, after evaluating the optimal K according to the determined rules. The result obtained with LDABiplots are two matrices, the first is the Theta matrix, which shows in the columns an identifier of the news of the analyzed newspapers and in the rows a distribution of topics in the analyzed documents. Another matrix obtained is the phi, which shows in the rows that represent a distribution of words on the topics.

Both matrices can be downloaded in the Tabular result section, where before downloading the matrices you can select the number of terms Select number of term, select the number of labels in Select number of label, and the value of the assignments Select Assignments, to parameterize the number of words and the labels that you want to observe and download.

In Worcloud we can observe through a graph of words, which ones have greater weight in each of the topics. In Heatmap we observe through a heat map the probabilities of belonging to each of the newspapers, where, according to the color scale shown, it can be seen which topics are found more in any of the digital news newspapers in particular.

In the Cluster tab, you can see the grouping of the topics found, for which you can select the grouping method in Agglomeration method among the methods included in the LDABiplots we have complete, single, Ward. D, Ward.D2, average, mcquitty, median, centroid, Ward’s minimum variance method aims to find compact and spherical groups. The complete method finds similar groups. The single method, which is closely related to the minimal spanning tree, adopts a friend of friend grouping strategy. The other methods can be thought of as targeting groups with features somewhere between the single and complete methods. The methods median and centroid do not lead to a monotonic distance measure or, equivalently. In the type of plot section, you can select the type of graph to display, there are the options of rectangle which draws rectangles around the branches of a dendrogram highlighting the corresponding groups, and circular which generates a graph efficiently and optimally with a heuristic and phylogenic circular grouping that shows through a phylogenetic tree how the hypothetical topics are related to each other, as well as a scroll bar to select the number of clusters to perform between topics, the package allows you to download the plot in pdf or png format. see video 7
.
LDABiplots
Video 7. LDA Model and Representations

Representations Biplot

Biplot graphs approximate the distribution of a multivariate sample in a reduced dimension space, and superimpose on its representations of the variables on which the sample is measured, this graph allows graphically displaying the information of the rows (represented by points, markers rows) and columns (represented by Vectors, column markers), LDABiplots, allows us to graphically and tabularly display the results obtained when processing the Biplots, we select the desired Biplot among the JK-Biplot, where the coordinates of the rows are the coordinates on the main components and the coordinates of the columns are the eigenvectors of the covariance or correlation matrix. The Euclidean distances between row points in the Biplot approximate the Euclidean distances between rows in multidimensional space. Or the GH-Biplot, where the coordinates of the rows are standardized and the distance between rows approximates the Mahalanobis distance in multidimensional space. And the HJ-Biplot that generates a high quality of representation for both rows and columns, by presenting both identical goodnesses of fit, it is possible to interpret the row-column relationship.

For our example, the HJ-Biplot was selected, for the interpretation, the following rules are considered: See figure 2.
.
Figura 2. Interpretation HJ-Biplot

Figura 2. Interpretation HJ-Biplot

Where;
  • The distances between row markers are interpreted as an inverse function of their similarities, so that neighboring markers are more similar

  • The length of the vectors (column markers) approximates the standard deviation of the daily news.

  • The cosines of the angles between the column markers approximate the correlations between the Diaries, acute angles associate a high positive correlation between them, obtuse angles indicate a negative correlation and right angles indicate uncorrelated variables.

  • The order of the orthogonal projections of the points (row markers) onto a vector (column marker) approximates the order of the row elements (centers) in that column. The greater the projection of a point on a vector, the more the center deviates from the mean of that journal.

Before selecting the generation of the Biplot representation to be carried out, it is necessary to mark how the centering of the covariance matrix will be carried out, LDABiplots gives us 4 options for centering and scaling the matrix, such: scale, center, center_scale, or none.

By clicking on run, the selected Biplots and the results will be generated in tabular form, which can be downloaded in different formats, the tabular results shown are: Eigenvalues the vectors with the eigenvalues, Variance explained a vector containing the proportion of variance explained by the first 1, 2,., K main components obtained, loadings The loadings of the main components, Coordinates of individuals matrix with the coordinates of the individuals, Coordinates of variables array with the coordinates of the variables.

In the Biplot tab, you will find the graphical representation generated according to the previously selected parameters, the graphic can be modified in its form, and with the different options offered by the package, it can be modified in the Options to Customize the section. Biplot, the theme, the axes to display in Axis-X and in Axis-Y, the color of the column markers, and the color of the row markers, you can also change the size of the markers and add the labels of both markers in different sizes. The representation can be downloaded in png or pdf format. See video 8
.
LDABiplots
Video 8. Representations Biplot

Citation

If you use LDABiplots, please cite it in your work as:

Pilacuan-Bonete L., Galindo-Villardón P., De la Hoz-M J., & Delgado-Álvarez F.(2022). LDABiplots: Biplot Graphical Interface for LDA Models. R package version 0.1.2

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

Galindo-Villardón,P. (1986). Una alternativa de representación simultánea: HJ-Biplot (An alternative of simultaneous representation: HJ-Biplot). Questíio 1986, 10, 13–23.

Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453-467.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.