Convert data frame or character vector to a ‘corporaexplorerobject’ for subsequent exploration.
prepare_data(dataset, ...)
# S3 method for class 'data.frame'
prepare_data(
dataset,
date_based_corpus = TRUE,
text_column = "Text",
grouping_variable = NULL,
within_group_identifier = "sequential",
columns_doc_info = c("Date", "Title", "URL"),
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
tile_length_range = c(1, 10),
columns_for_ui_checkboxes = NULL,
...
)
# S3 method for class 'character'
prepare_data(
dataset,
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
...
)
Object to convert to corporaexplorerobject:
A data frame with a specified column containing text (default column name: "Text") (class
character), and optionally other columns.
If date_based_corpus
is TRUE
(the default),
dataset
must contain a column "Date" (of class Date).
Or a non-empty character vector.
Other arguments to be passed to prepare_data
.
Logical. Set to FALSE
if the corpus
is not to be organised according to document dates.
Character. Default: "Text".
The column in dataset
containing texts to be explored.
Character string indicating column name in dataset. If date_based_corpus is TRUE, this argument is ignored. If date_based_corpus is FALSE, this argument is used to group the documents, e.g., if dataset is organised by chapters belonging to different books. The order of groups in the app is determined as follows:
If grouping_variable is a factor column, the factor levels determine the order.
If grouping_variable is not a factor, the order is determined by the sequence in which unique values first appear in the dataset.
Character string indicating column name in dataset
.
If date_based_corpus
is TRUE
, this argument is ignored.
If date_based_corpus
is FALSE
,
"sequential"
, the default, means the rows in each group are assigned
a numeric sequence 1:n where n is the number of rows in the group.
Used in document tab title in non-date based corpora.
Character vector. The columns from dataset
to display in
the "document information" tab in the corpus exploration app. By default
"Date", "Title" and "URL" will be
displayed, if included. If columns_doc_info
includes a column which is not
present in dataset, it will be ignored.
Character string with name of corpus.
Logical. Should the function create a document term matrix
for fast searching? If TRUE
, data preparation will run longer and demand
more memory. If FALSE
, the returning corporaexplorerobject will be more light-weight, but
searching will be slower.
Should punctuation and digits be stripped
from the text before constructing the document term matrix? If TRUE
,
the default:
The corporaexplorer object will be lighter and most searches in the corpus exploration app will be faster.
Searches including punctuation and digits will be carried out in the full text documents.
The only "risk" with this strategy is that the corpus exploration
app in some cases can produce false positives. E.g. searching for the
term "donkey" will also find the term "don%key".
This should not be a problem for the vast majority of use cases, but if
one so desires, there are three different solutions: set this parameter to
FALSE
, create a corporaexplorerobject without a matrix by setting
the use_matrix
parameter to FALSE
, or run
explore
with the
use_matrix
parameter set to FALSE
.
If FALSE
, the corporaexplorer object will be larger, and most
simple searches will be slower.
Numeric vector of length two.
Fine-tune the tile lengths in document wall
and day corpus view. Tile length is calculated by
scales::rescale(nchar(dataset[[text_column]]),
to = tile_length_range,
from = c(0, max(.)))
Default is c(1, 10)
.
Character. Character or factor column(s) in dataset.
Include sets of checkboxes in the app sidebar for
convenient filtering of corpus.
Typical useful for columns with a small set of unique
(and short) values.
Checkboxes will be arranged by sort()
,
unless columns_for_ui_checkboxes
is a vector of factors, in which case the order will be according to
factor level order (easy relevelling with forcats::fct_relevel()
).
To use a different
label in the sidebar than the columnn name,
simply pass a named character vector to columns_for_ui_checkboxes
.
If columns_for_ui_checkboxes
includes a column which is not
present in dataset, it will be ignored.
A corporaexplorer
object to be passed as argument to
explore
and
run_document_extractor
.
For data.frame: Each row in dataset
is treated as a base differentiating unit in the corpus,
typically chapters in books, or a single document in document collections.
The following column names are reserved and cannot be used in dataset
:
"Date_",
"cx_ID",
"Text_original_case",
"Text_column_",
"Tile_length",
"Year_",
"cx_Seq",
"Weekday_n",
"Day_without_docs",
"Invisible_fake_date",
"Tile_length".
A character vector will be converted to a simple corporaexplorerobject with no metadata.
## From data.frame
# Constructing test data frame:
dates <- as.Date(paste(2011:2020, 1:10, 21:30, sep = "-"))
texts <- paste0(
"This is a document about ", month.name[1:10], ". ",
"This is not a document about ", rev(month.name[1:10]), "."
)
titles <- paste("Text", 1:10)
test_df <- tibble::tibble(Date = dates, Text = texts, Title = titles)
# Converting to corporaexplorerobject:
corpus <- prepare_data(test_df, corpus_name = "Test corpus")
#> Starting.
#> Document data frame done.
#> Calendar data frame done.
#> Document term matrix: text processed.
#> Document term matrix: tokenising completed.
#> Document term matrix: word list created.
#> Document term matrix done.
#> Done.
if(interactive()){
# Running exploration app:
explore(corpus)
# Running app to extract documents:
run_document_extractor(corpus)
}
## From character vector
alphabet_corpus <- prepare_data(LETTERS)
#> Starting.
#> Document data frame done.
#> Corpus is not date based. Calendar data frame skipped.
#> Document term matrix: text processed.
#> Document term matrix: tokenising completed.
#> Document term matrix: word list created.
#> Document term matrix done.
#> Done.
if(interactive()){
# Running exploration app:
explore(alphabet_corpus)
}