man/trade_classification.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data_aggregation.R
\name{trade_classification}
\alias{trade_classification}
\alias{classify_trades}
\alias{aggregate_trades}
\title{Classification and aggregation of high-frequency data}
\usage{
classify_trades(data, algorithm = "Tick", timelag = 0, ..., verbose = TRUE)

aggregate_trades(
  data,
  algorithm = "Tick",
  timelag = 0,
  frequency = "day",
  unit = 1,
  ...,
  verbose = TRUE
)
}
\arguments{
\item{data}{A dataframe with 4 variables in the following
order (\code{timestamp}, \code{price}, \code{bid}, \code{ask}).}

\item{algorithm}{A character string refers to the algorithm used
to determine the trade initiator, a buyer or a seller. It takes one of four
values (\code{"Tick"}, \code{"Quote"}, \code{"LR"}, \code{"EMO"}). The default value is
\code{"Tick"}. For more information about the different algorithms, check the
details section.}

\item{timelag}{A number referring to the time lag in milliseconds
used to calculate the lagged midquote, bid and ask for the algorithms
\code{"Quote"}, \code{"EMO"} and \code{"LR"}.}

\item{...}{Additional arguments passed on to the functions \code{classify_trades()}
\code{aggregate_trades()}. The recognized arguments are \code{fullreport},
and \code{is_parallel}. Other arguments will be ignored.
\itemize{
\item \code{fullreport} is binary variable passed to \code{aggregate_trades()} that
specifies whether the variable \code{freq} is returned. The default value is
\code{FALSE}.
\item \code{is_parallel} is a logical variable passed to \code{classify_trades()} that
specifies whether the computation is performed using parallel or sequential
processing. #' The default value is \code{TRUE}. For more details, please refer to the
vignette 'Parallel processing' in the package, or
\href{https://pinstimation.com/articles/parallel_processing.html}{online}.
}}

\item{verbose}{A binary variable that determines whether detailed
information about the progress of the trade classification is displayed.
No output is produced when \code{verbose} is set to \code{FALSE}. The default
value is \code{TRUE}.}

\item{frequency}{The frequency used to aggregate intraday data. It takes one
of the following values: \code{"sec"}, \code{"min"}, \code{"hour"}, \code{"day"}, \code{"week"},
\code{"month"}. The default value is \code{"day"}.}

\item{unit}{An integer referring to the size of the aggregation window
used to aggregate intraday data. The default value is \code{1}. For example, when
the parameter \code{frequency} is set to \code{"min"}, and the parameter \code{unit} is set
to 15, then the intraday data is aggregated every 15 minutes.}
}
\value{
The function classify_trades() returns a dataframe of five variables. The
first four variables are obtained from the argument \code{data}: \code{timestamp},
\code{price}, \code{bid}, \code{ask}. The fifth variable is \code{isbuy}, which takes the value
\code{TRUE}, when the trade is classified as a buyer-initiated trade, and \code{FALSE}
when the trade is classified as a seller-initiated trade.

The function aggregate_trades() returns a dataframe of two
(or three) variables. If \code{fullreport} is set to \code{TRUE}, then
the returned dataframe has three variables \verb{\{freq, b, s\}}. If
\code{fullreport} is set to \code{FALSE}, then the returned dataframe has
two variables \verb{\{b, s\}}, and, therefore, can be #'directly used for the
estimation of the \code{PIN} and \code{MPIN} models.
}
\description{
\code{classify_trades()} classifies high-frequency trading data into
buyer-initiated and seller-initiated trades using different algorithms, and
different time lags.
\cr \code{aggregate_trades()} aggregates high-frequency trading data into aggregated
data for provided frequency of aggregation. The aggregation is preceded by
a trade classification step which classifies trades using different trade
classification algorithms and time lags.
}
\details{
The argument \code{algorithm} takes one of four values:
\itemize{
\item \code{"Tick"} refers to the tick algorithm: Trade is classified as a
buy (sell) if the price of the trade to be classified
is above (below) the closest different price of a previous trade.
\item \code{"Quote"} refers to the quote algorithm: it classifies a
trade as a buy (sell) if the trade price of the trade to be
classified is above (below) the mid-point of the bid and ask spread.
Trades executed at the mid-spread are not classified.
\item \code{"LR"}  refers to \code{LR} algorithm as in
\insertCite{LeeReady1991;textual}{PINstimation}. It classifies a trade
as a buy (sell) if its price is above (below) the mid-spread (quote
algorithm), and  uses the tick algorithm if the trade price is at
the mid-spread.
\item \code{"EMO"} refers to \code{EMO} algorithm as in
\insertCite{Ellis2000;textual}{PINstimation}.
It classifies trades at the bid (ask) as sells (buys) and uses the tick
algorithm to classify trades within the then prevailing bid-ask spread.
}

\code{LR} recommend the use of mid-spread five-seconds earlier ('5-second'
rule) mitigating trade misclassifications for many of the \code{150}
NYSE stocks they analyze. On the other hand, in more recent studies such
as \insertCite{piwowar2006;textual}{PINstimation} and
\insertCite{Aktas2014;textual}{PINstimation}, the use of
1-second lagged midquotes are shown to yield lower rates of
misclassifications. The default value is set to \code{0} seconds (no time-lag).
Considering the ultra-fast nature of today’s financial markets, time-lag
is in the unit of milliseconds. Shorter than 1-second lags can also be
implemented by entering values such as  \code{100} or \code{500}.
}
\examples{
# There is a preloaded dataset called 'hfdata' contained in the package.
# It is an artificially created high-frequency trading data. The dataset
# contains  100 000 trades and five variables 'timestamp', 'price',
# 'volume', 'bid', and 'ask'. For more information, type ?hfdata.

xdata <- hfdata
xdata$volume <- NULL
\donttest{
# Use the EMO algorithm with a timelag of 500 milliseconds to classify
# high-frequency trades in the dataset 'xdata'

ctrades <- classify_trades(xdata, algorithm = "EMO", timelag = 500, verbose = FALSE)

# Use the LR algorithm with a timelag of 1 second to aggregate intraday data
# in the dataset 'xdata' at a frequency of 15 minutes.


lrtrades <- aggregate_trades(xdata, algorithm = "LR", timelag = 1000,
frequency = "min", unit = 15, verbose = FALSE)

# Use the Quote algorithm with a timelag of 1 second to aggregate intraday data
# in the dataset 'xdata' at a daily frequency.

qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000,
frequency = "day", unit = 1, verbose = FALSE)

# Since the argument 'fullreport' is set to FALSE by default, then the
# output 'qtrades' can be used directly for the estimation of the PIN
# model, namely using pin_ea().

estimate <- pin_ea(qtrades, verbose = FALSE)

# Show the estimate

show(estimate)
}
}
\references{
\insertAllCited
}