Introduction

The brain–computer interface (BCI) is an evolving technology that facilitates the communication between the human brain and any external device without using the normal output pathways [1]. The BCI is an interface that translates human brain signals into machine control signals to be used where no muscular movements is made. The machine can be a computer, wheelchair, robot, assistive device, or an alternative communication device. The BCI has a broad range of applications, in both the medical and non-medical domains. Using BCI for speech communication is one such application; the attempted speech is used to actuate a speech synthesizer, enabling a person to communicate with the external world through his brain signals. An electroencephalography (EEG)-based BCI for speech communication measures the brain electrical activity of an individual during attempted speech through the scalp electrodes. The brain signals picked by the electrodes are sent to the computer, processed, and converted into meaningful words that can be communicated as aural information. Though the primary goal of the speech BCI is to act as an alternative communication device for physically challenged people, it also extends its applications to non-medical fields such as silent speech communication, synthetic telepathy, and cognitive biometrics [2, 3].

A survey of leading-edge literature identifies a gap in the ability to provide speech communication using brain signals to produce meaningful words (such a provision already exists, only for syllables and phonemes). The silent speech can be produced in three ways: (1) talking by moving the speech articulators but without producing any audible sound. The signals are captured mainly by using EMG sensors placed around the neck and mouth; (2) speech imagery—imagine the word to be produced; (3) talking in the mind without moving any speech articulators and without making any audible sound (Subvocalization). Although earlier research has demonstrated that EEG-based BCI for speech communication is possible with imagined speech, the lack of lateralization exhibits a significant challenge in analyzing the neural signals of imagined speech [4]. To date, most of the studies on speech communication are based on invasive approaches. However, a few researchers have decoded only the phonemes and syllables, in a noninvasive way, using EEG signals (shown in Table 4). Therefore, in the current study an effort is made to develop a BCI designed for speech communication using the neural activity of the brain through subvocalized speech. The authors tested the subvocalized speech behavior of the subject, for a selected number of words, measured by the scalp electrodes. Subvocalization is the mental rehearsal of the word without making any audible sound and without moving any speech articulators. Subvocalization refers to the subconscious motor activity that occurs during speech without the presence of a sound wave. Neuroscience studies have shown that the subvocalization of speech plays several roles in auditory imagery. Subvocalization activates motor and auditory pathways, so during subvocal verbalization, additional brain pathways are activated. These induce significantly different activation pattern when compared to the results of imagined speech or visual imagery.

One of the main techniques for studying subvocalization is electromyography, which detects minute muscle potentials in speech organs. The procedure records diffuse muscle activities and hence show only the overall activity level, but not the exact words or sounds being subvocalized. Identifying the articulatory pattern is not possible accurately. Hence, in the current work, EEG is used to acquire the brain signals during subvocal verbalization of the words. The basis of this effort is the presumption that, whether the speech is overt or covert, it always originates in the mind. The human act of talking involves a complex set of phonatory and articulatory mechanisms. But even when the acoustic aspects of phonetics are removed, we are still “speaking” in our head. This introduces brain activation and changes in the power dynamics of the brain. So, in the current work, EEG is used to measure changes in voltage in the brain during subvocalized speech production. As a preliminary investigation, in this study the subvocalized speech behavior of three normal subjects was tested for later comparison to speech disabled subjects. The experiment was conducted for a selected number of words measured from the scalp EEG electrodes. The model used in developing a BCI for speech communication is presented in module 2. Limitations of the methods used and future enhancements are also discussed.

Methods

The architecture used in the BCI speech communication system is shown in Fig. 1. The data acquisition system captures the EEG data from the electrodes at specific locations on the scalp for input to the BCI system. The preprocessing module involves amplification, analog filtering, A/D conversion, and artifact removal, thereby improving the signal-to-noise ratio of the EEG signal. Next, in the feature selection stage the dependent discriminatory features of each subvocalized word are extracted from the preprocessed signals. These features form a feature vector, upon which the classification is done. In the classification stage, feature patterns are identified to determine the subvocalized words spoken by the user. Once the classification into such categories is done, the words are quickly recognized and the speech sounds are produced by the speech synthesizer.

Fig. 1
figure 1

Functional model of the EEG-based BCI system

To accept the electrical activity of the brain from scalp recordings, the signals must be of sufficient strength and duration from a considerable number of trials. The EEG signals are extracted from a large number of channels due to the brain’s voluminous conductive output, yielding a large dataset and a significant computational challenge. Selecting relevant features from such large datasets is a fundamental challenge in EEG-based BCI. Much of the data extracted from electrodes, placed in various regions, may be extraneous, irrelevant, or even “noise” for the classification problem at hand. Hence, the objective of this study is to investigate data reduction and classification methods to minimize the computational complexity of analyzing the EEG signals. EEG signals suffer from the curse of dimensionality due to the intrinsic biological and electromagnetic complexities. In this context, the subset selection method (SSM), based on a focused set of principal representative features (PRF), is used to select data and reduce the dimensionality. The respective variance contributions, computed for an optimal number of channels, are considered as principal features. The prominent variances obtained from the SSM, based on principal features, are selected for multiclass EEG signal analysis. At this point, a multiclass support vector machine (SVM) is used to classify the EEG signals for five subvocalized words extracted from the scalp electrodes.

Data Acquisition Paradigm

The EEG data are recorded using a Neuroscan 64-channel cap-based EEG device with the standard 10–20 montage placements. Vertical eye movements were detected using separate channels, placed above and below the subject’s right eye. Horizontal eye movements were recorded by different electrodes put on either side of the eyes (temple region—temporal). In this study, meaningful words, catering to the basic needs of a person, are considered. The EEG data are extracted during subvocalization of the word, i.e., when the subject talks silently in his mind without any overt physical manifestation of speech. The data acquisition paradigm is shown in Fig. 2. The experiment involved three volunteer participants referred to as subject1 through subject3. The five words selected are “water,” “help,” “thanks,” “food,” and “stop”—referred as word1, word2, word3, word4, and word5, respectively, in subsequent modules. Subject1 had been trained in the BCI experiments of subvocalized speech; the other subjects had never participated before in BCI experiments. All volunteer subjects were right-handed male students between the ages of 20 and 25. All subjects are otherwise normal and underwent the EEG process without any neurological antecedents.

Fig. 2
figure 2

Data acquisition paradigm includes the experiment design to capture brain signal behavior during subvocalized speech. The diagram shows capturing one trial of a particular word

While the participant subvocalized the word in his mind, the brain electrical activity was recorded by the EEG system. The experimental paradigm was presented with E-Prime 2.0 software. In each trial, a word is presented on the computer screen at time zero. The display of word is followed by three beeps in a particular rhythm. After the third beep, the participant has to subvocalize the word in his mind five times in the same rhythm as the beeps. During this period, no audio stimulus is presented. Approximately 2 s after the last beep, the subject starts to subvocalize the word shown on the monitor at the given rhythm. The participant is instructed to avoid blinking or move any muscles and to concentrate on the word shown. Each trial has five instances of a particular word, and the duration of a single trial was 17 s followed by a short break. Then the next word would be displayed for the subject to subvocalize. The time interval for rest between each trial was randomized between 8 and 10 s, to prevent the subjects from getting used to the length of the rest period. A single experimental session was comprised of the EEG acquisition for 25 trials of each word. The data were recorded over two separate sessions with varying word order contributing to a total of 50 trials of each word (total number of trials = 50 trials × 5 words). Each trial has five instances of subvocalized word. The EEG was recorded in a controlled environment. The EEG data were recorded in a continuous mode with Neuroscan Synamps 2 amplifiers at a sampling rate of 1000 Hz.

Preprocessing

The EEG data were analyzed off-line using Neuroscan’s SCAN 4.5 software. The signals are filtered between 0.5 and 40 Hz using a band-pass filter and down-sampled to 250 Hz. The eyeblink artifact reduction was done in all of our experiments. The vertical and horizontal ocular artifacts were reduced using the independent component analysis-based blink artifact reduction algorithm implemented in SCAN 4.5. All blink activities were reduced from the continuous signal. Artifacts other than eye blinks were not removed. After the removal of artifacts, the signals are epoched and averaged.

For each trial, the signal was extracted with reference to the stimulus onset and offset markers on the continuous file. Each trial has five instances of a word (called epochs), each epoch with approximately 2-s duration. The epochs are extracted and averaged over each trial leaving only that activity that is consistently associated with the stimulus in a time-locked manner. All the spontaneous EEG signal that is random about the stimuli onset is averaged out, leaving only the event-related potentials. Finally, there are 50 epochs of averaged EEG signal of each word, forming a total of 250 epochs per subject (50 epochs/word × 5 words).

Feature Selection

Feature selection is a kind of dimensionality reduction that efficiently identifies and generates discriminatory features among different classes of data as a trampled feature vector. In the current work, EEG is measured from 64 channels with a 2-s epoch for each word, contributing to a huge amount of data. Hence, the need for dimension reduction is crucial in EEG data analysis. Due to volume conduction of the brain signals, the electric potential observed at the scalp becomes more widespread. Hence, the channels are highly correlated. Prominent signals are measured by scalp electrodes located above the active cerebral area involved in the mental processing. So, in multi-channel EEG data, groups of channels are interrelated. The reason for this multipronged data analysis is that more than one channel might be measuring the same EEG potential evoked by the mental processing. However, to avoid the redundancy of information, a group of channels can be replaced with a new single variable/channel. In our work, the representative feature SSM, using pairwise cross-correlation among the features, is used to reduce the size of the dataset with minimal loss of information. The desired outcome from the SSM, based on principal representative features (PRF), is to project the feature space onto a smaller subspace that represents the data with significant discrimination. This exercise facilitates analysis, as explained in the subsequent discussion.

figure a

The subset method generates a new set of variables, called PRF. Each PRF is a linear combination of the original variables. All the PRFs are orthogonal to each other, so there is no redundant information. This relationship is ascertained by a simple pairwise cross-correlation coefficient computation. The PRFs as a whole form an orthogonal basis for the space of the data. The coefficients are calculated so that the first PRF defines the maximum variance. The second PRF is calculated to have the second highest variance and, importantly, is uncorrelated with the first PRF. Subsequent PRFs exhibit decreasing contribution of variance and are uncorrelated with all other PRFs. The full set of PRFs is as large as the original set of variables. However, it is common that the cumulative sum of the variances of the first few PRFs exceeds 80 % of the total variance of the original data as observed in our experimental procedure. Only the first few variances can be considered; the remaining is discarded, thus reducing the dimensionality of the data. The output generated by the SSM based on principal features is described in Algorithm (1).

The algorithm is explained in detail as follows. Let \(X \in R^{m \times N}\) denote the original matrix, where m and N represent the number of channels and number of samples per channel, respectively. Let \(Y \in R^{m \times N}\) denote the transformed matrix derived from a linear transformation P on X. The sample mean M, of each channel, given by \(M = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {X_{i} }\), is subtracted from every measurement of each channel. For m channels, the covariance matrix C is computed, which is an m × m square symmetrical matrix. The elements of C are defined as:

$$c_{ik} = c_{ki} = \frac{1}{N - 1}\mathop \sum \limits_{t = 1}^{N} \left( {X_{it} - M_{i} } \right)\left( {X_{kt} - M_{k} } \right)$$
(1)

where X is the dataset with N samples and M i denotes the mean of channel i. The entry C ik in C for i ≠ k is called the covariance of X i and X k . C is positive definite [5] since it is of the form XX T.

The SSM based on principal features finds an orthonormal m × m matrix P that transforms X into Y such that X = PY.

$$\left[ {\begin{array}{*{20}c} {x_{1} } \\ {x_{2} } \\ \vdots \\ \vdots \\ {x_{m} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {u_{1} } & {\begin{array}{*{20}c} {u_{2} } & \ldots \\ \end{array} } & {u_{m} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {y_{1} } \\ {y_{2} } \\ \vdots \\ \vdots \\ {y_{m} } \\ \end{array} } \right]$$
(2)

Each row in P is a set of new basis vectors for expressing the columns of X. The new variables y 1, y 2, …, y m are uncorrelated and are arranged in decreasing order. Each observation of x i is transformed to y i , by rotation and scaling, to align a basis with the axis of maximum variance, such that x i  = Py i . The x i is rendered into m new uncorrelated variables y i . Obtaining the principal feature axes involves computing the eigenanalysis of the covariance matrix C. The eigenvalue λ i is found by solving the characteristic equation, \(\left| {{\mathbf{C}} - {\mathbf{\lambda I}}} \right| = 0\). The eigenvalue denotes the amount of variability captured along that dimension. The eigenvectors are the columns of matrix P such that

$$C = {\text{PDP}}^{\text{T}} ,\quad {\text{where}}\quad D = \left[ {\begin{array}{*{20}c} {\lambda_{1} } & 0 & 0 & 0 \\ 0 & {\lambda_{2} } & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & {\lambda_{m} } \\ \end{array} } \right]$$
(3)

The vectors u 1, u 2, …, u m are the unit vectors corresponding to the columns of the orthogonal matrix P. The unit vectors u 1, u 2, …, u m are called the PRF vectors. They are derived in decreasing order of importance. The first PRF u 1 determines the new variable y 1 as shown in Eq. (4). Thus, y 1 is a linear combination of the original variables x 1, x 2, …, x m where a 1, a 2, …, a m are the entries in PRF vector u 1. Similarly, u 2 determines the variable y 2 and so on.

$$y_{1} = a_{1} x_{1} + a_{2} x_{2} + \cdots + a_{m} x_{m}$$
(4)

Hence, the SSM based on principal features generates a subset of features endowed with large representative variances, thus embodying impressive structure, while the features with lower variances represent noise and are omitted from the feature space.

Figure 3 shows the variance of the covariance matrix, computed using the SSM based on principal features. The cumulative variance (CV) illustrates that the first four variances explain 99 % of the total variance. The remaining components contribute less than 1 % each. Therefore, the first four components are chosen to form the feature vector, and the remaining variances are discarded. A scree plot helps to select the specific number of variance. The number of variances to choose depends on the “elbow” point. After the elbow point, the remaining variance values are relatively small and are all about the same size, and hence, can be discarded.

Fig. 3
figure 3

Variance, relative variance (RV), and cumulative variance (CV) for word1 (first ten values), and the corresponding scree plot is shown

The first two PRFs are typically responsible for the bulk of the variance. They display most of the variance in the data and give the direction of the maximum spread of the data. The first PRF gives the direction of the maximum spread of the data. The second gives the direction of the maximum spread, perpendicular to the first direction. The loading plot in Fig. 4a reveals the relationships between variables/channels in the space of the first two PRFs. An intense loading for PRF-1 is observed in channels FPZ, F2, CP6, and FZ. Similarly, an intense loading for PRF-2 is found in electrode channels PO4 and FC2. A three-dimensional loading plot of PRF-1, PRF-2, and PRF-3 is shown in Fig. 4b.

Fig. 4
figure 4

a Two-dimensional PRFs plot and b three-dimensional PRFs plot reveal the relationship between variables in different subspaces

In Table 1, a significant difference in the values is observed for FPZ, FZ, F2, and CP6 of PRF-1. Also, note that the majority of the variance in the dataset is along the aforementioned channels. So, the information from these channels alone is just sufficient to infer the result. Therefore, the information from the remaining channels can be discarded, thus reducing the computational burden on the system. The PRF-1 expressed as a linear combination of the original variables is shown in Eq. (5).

$$\begin{aligned} {\text{PRF}}1 & = 0.0022*{\text{FP}}1 + 0.6192*{\text{FPZ}} + 0.0016*{\text{FP}}2 \\ & \quad + 0.0018 *{\text{AF}}3 + 0.0016*{\text{AF}}4 + 0.0008*{\text{F}}7 \\&\quad + 0.0028*{\text{F}}5 + \cdots \\ \end{aligned}$$
(5)

The current study shows significant differences in the variance across different words (given in Table 2). These variances correspond to the EEG change due to a particular component of mental rehearsal of different words. These features from the training set and the testing set were fed to the classifier, in the form of feature vectors, for classification of the test data.

Table 1 Coefficients of the first four PRFs
Table 2 Range (mean ± standard deviation) of the first four features (variances) across 50 trials of EEG signals for the five subvocalized words

Feature Classification

An SVM is a supervised learning model that classifies the data by finding the best hyperplane [6] that separates all data points of one class from those of the other class. The best hyperplane for an SVM is the one with the maximum margin between the two classes. Margin is the width of the slab parallel to the hyperplane that has no interior data points. Given the features of the data, the support vector machine is first trained to compute a model to distinguish the data from two classes. The trained model is then used to classify the new incoming data. The details are given in Algorithm (2).

figure b

For a specified set of training data (x i , y i ), where i = 1, …, N, and x i  ∈ R d and y i  ∈ {+1, −1} (representing two different classes of the subvocalized word), train a classifier f(x) such that:

$$f(x_{i} )\left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} { \ge 0} \\ { < 0} \\ \end{array} } & {\begin{array}{*{20}c} {y_{i} = + 1} \\ {y_{i} = - 1} \\ \end{array} } \\ \end{array} } \right.$$
(6)

The linear classifier is of the form f(x i ) = w T x i  + b (dot product) where w is the normal to the hyperplane, known as weight vector, and b is the bias. For a linear classifier, the w is learned from the training data and is needed for classifying the new incoming data. The support vectors are the data points x i on the boundary, for which y i f(x i ) = 1. The optimal hyperplane can be represented as |w T x i  + b| = 1. The distance between a support vector x i and the hyperplane can be written as shown in Eq. (7). For a canonical hyperplane, the numerator is equal to one. Therefore, the distance from the hyperplane to the support vectors is \(\frac{1}{{\left| {\left| W \right|} \right|}}\)

$${\text{Distance}} = \frac{{\left| {w^{\text{T}} x_{i} + b} \right|}}{{\left| {\left| w \right|} \right|}} = \frac{1}{{\left| {\left| w \right|} \right|}}$$
(7)

The margin M is twice the distance from the hyperplane to the support vectors. Therefore, \(M = 2/\left| {\left| w \right|} \right|\). To find the best separating hyperplane, estimate w and b that maximize the margin \(2/\left| {\left| w \right|} \right|\), such that for y i  = + 1, w T x i  + b ≥ 1 and for y i  = −1, w T x i  + b ≤ −1 or equivalently, minimize \(\frac{1}{2}\left| {\left| w \right|} \right|^{2}\) subject to the constraint y i (w T x i  + b) ≥ 1. Learning an SVM can be formulated as a convex quadratic optimization problem, subject to linear inequality constraints for a unique solution. The objective function [7] of this problem is formulated as:

$$\begin{aligned} & {}_{{w \in R^{d} }}^{\text{min }} \;\;J(w) = \frac{1}{2}\left| {\left| w \right|} \right|^{2} \\ & {\text{s}} . {\text{t}}\quad \left\{ { y_{i} \left( {w^{\text{T}} x_{i} + b} \right) \ge 1} \right.,\quad i = 1,2, \ldots ,N \\ \end{aligned}$$
(8)

We can express the inequality constraint as C i (w) = y i (w T x i  + b) − 1. The Lagrangian function is used as the method to find the solution for constrained optimization problems with one or more equalities. However, when the function has inequality constraints, we need to extend the method to Karush–Kuhn–Tucker (KKT) conditions. The KKT defines the necessary conditions for a local minimum of constrained optimization. The necessary conditions define the properties of the gradients of the objective and constraint functions. According to the KKT dual complementarity condition—α i C i (x) = 0, the objective function of Eq. (8) can be expressed by a Lagrangian function as shown in Eq. (9).

$$\begin{aligned} & { \hbox{min} }\quad L\left( {w,b,\alpha_{i} } \right) = \frac{1}{2}\left| {\left| w \right|} \right|^{2} - \mathop \sum \limits_{i = 1}^{d} \alpha_{\text{i}} \left[ {y_{i} \left( {w^{\text{T}} x_{i} + b} \right) - 1} \right] \\ & {\text{s}}.{\text{t}}\quad \alpha_{i} \ge 0,\quad i = 1,2, \ldots ,N \\ \end{aligned}$$
(9)

The scalar quantity α i is the Lagrange multiplier for the corresponding data point x i . The optimal condition for the Lagrange function is at some point w when no first-order feasible descent direction exists (saddle point). At this point w, there exists a scalar α i such that

$$\frac{\partial L}{\partial w} = 0 \Rightarrow w = \mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i} x_{i}$$
(10)

and

$$\frac{\partial L}{\partial b} = 0 \Rightarrow \mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i} = 0$$
(11)

If we exploit the definition of w from Eq. (10) and substitute it in the Lagrangian Eq. (9), then simplify, we get

$$L\left( {w,b,\alpha_{i} } \right) = \mathop \sum \limits_{i = 1}^{d} \alpha_{i} - \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{d} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^{\text{T}} x_{j} - b\mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i}$$
(12)

However, from Eq. (11) the last term in Eq. (12) must be zero. Positioning the constraints α i  ≥ 0 and the constraint given in Eq. (11), we obtain the dual optimization problem shown in Eq. (13).

$$\begin{aligned} \hbox{max} \;\;W(\alpha ) = \mathop \sum \limits_{i = 1}^{d} \alpha_{i} - \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{d} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^{\text{T}} x_{j} \hfill \\ {\text{s}} . {\text{t}}\quad \mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i} = 0\quad {\text{and}}\quad \alpha_{i} \ge 0\quad \forall i \hfill \\ \end{aligned}$$
(13)

The optimal value of α, substituted in Eq. (10), gives the optimal value of w in terms of α. There exists a Lagrange multiplier α i for every training data point x i . Suppose we have fit our model’s parameters to a training set, and now wish to make a prediction at a new input point, x. We would then calculate the linear discriminate function g(x) = w T x + b and predict y = 1 if and only if this quantity is greater than zero. Using Eq. (10), the discrimination function can be written as:

$$\begin{aligned} w^{\text{T}} x + b & = \left( {\mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i} x_{i} } \right)^{\text{T}} x + b \\ w^{\text{T}} x + b & = \mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i}\, \left\langle {x_{i } , x} \right\rangle + b \\ \end{aligned}$$
(14)

The prediction of the class labels from the Eq. (14) depends on the inner product between the input point x and the support vectors x i of the training set. In the solution, the points that have α i  > 0 are called the support vectors.

In general, if the problem does not have a simple hyperplane as a separating criterion, we need nonlinear separators. A nonlinear classifier can be created by applying the kernel trick. A kernel function maps the data points onto a higher-dimensional space, hoping to improve the separateness of data. The kernel function is expressed as a dot product in an infinite dimensional feature space. Therefore, the dot product between the input point x and the support vectors x i in Eq. (14) can be computed by a kernel function. Using kernels, the discriminate function g(x) with support vectors x i can be written as:

$$g(x) = w^{\text{T}} x + b = \mathop \sum \limits_{i = 1}^{d} \alpha_{i} y_{i} k\left( {x_{i} , x} \right) + b$$
(15)

Using the kernel function, the algorithm can be carried into a higher-dimensional space without explicitly mapping the input points into this space. This is highly desirable as sometimes our higher-dimensional feature space could even have infinite dimension and, thus, be infeasible to compute. With the kernel functions, it is possible to operate in a theoretical feature space of infinite dimension. Some standard kernel functions include the polynomial function, the radial basis function, and the Gaussian functions.

In the present work, a one-against-all multiclass SVM, with the default linear kernel, was constructed to discriminate the five subvocalized words competently. The feature classification, using the SVM classifier, is described in Algorithm (2). Linear kernel SVM was used as the data are found to be linearly separable. The linearity of the data was verified using the perceptron learning algorithm. The one-against-all model constructs N (N = 5 in the present work) binary SVM classifiers, each of which separates one class from the rest. The jth SVM is trained with the features of the jth class and labeled as a positive class; all of the others are labeled as a negative class. The N classes can be linearly separated such that the jth hyperplane puts the jth class on its positive side and the rest of the classes on its negative side. However, the drawback of this method is that when the results from the multiple classifiers are combined into the final decision, the outputs of the decision functions are directly compared, without considering the competence of the classifiers [8]. Another drawback of the SVM is that there is no definite method to select the best suitable kernel for the problem at hand.

Results and Discussion

A number of experiments were conducted to evaluate the performance of the designed BCI model in classifying the EEG signals of subvocalized words. The SSM, based on principal features, was applied to the preprocessed EEG signals of five subvocalized words. The dataset had 50 trials of each word, measured for 2 s, from a 64-channel EEG headset. A total of 250 trials, measured from five subvocalized words, were used for evaluating the possibility of recognizing the subvocalized word from the EEG signals. Due to the vast dimension of the dataset, the SSM, based on principal features, was used to project the data to reduce the dimension while preserving maximum useful information. An optimal number of coefficients, contributing to 99 % of the variance, were selected as features for each trial of the EEG signal. The classification performance is evaluated by a multiclass SVM (one-against-all) using a fivefold classification procedure. The features selected, using the SSM based on principal features, were used to build the classifier. To develop a generalized, robust classifier that performs well when new samples are input, we choose a fivefold cross-validation data re-sampling technique for training and testing the classifier. In this procedure, the data are split into five equal-sized subsamples. Four subsamples of the data are used for training the classifier, and one subsample is used for testing. This procedure is repeated five times using a randomly picked different subsample for testing in each case. Based on the results obtained, the precision, recall, F-measure, and accuracy are calculated. The average performance over fivefold is taken as the actual estimate of the classifier’s performance.

The classifier performance is determined by computing the precision, recall, F-measure, and classification accuracy drawn from the confusion matrix. The confusion matrix illustrates the true positive (TP), false negative (FN), false positive (FP), and true negative (TN) of the classified data. The metrics are calculated using the following formulae:

$${\text{Recall}} = {\text{TP}}/({\text{TP}} + {\text{FN}})$$
(16)
$${\text{Precision}} = {\text{TP}}/({\text{TP}} + {\text{FP}})$$
(17)
$$\begin{aligned} F{\text{-measure}} & = 2(({\text{precision}}*{\text{recall}}) \\ & \quad /({\text{precision}} + {\text{recall}})) \\ \end{aligned}$$
(18)
$${\text{Accuracy}} = ({\text{TP}} + {\text{TN}})/({\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}})$$
(19)

Table 3 shows the precision, recall, F-measure, and accuracy of the model in classifying the data. The recall represents the ability of the test to retrieve the correct information. In the present work, a recall of 0.6 was achieved which means 60 % of the activity was detected (TP), but 40 % of the activity was undetected (FN). Precision identifies the percentage of the selected information that is correct. A precision of 0.5 detected 50 % of the activity correctly (TP), but the remaining 50 % of the activity was mistaken as belonging to the same class (FP). Higher recall indicates that most of the relevant information was extracted, and higher precision means that substantially more relevant than irrelevant information was retrieved. The precision and recall are inversely related. Often it is possible to increase one at the cost of reducing the other. The feature selection and classifier model used in data analysis affect the level of recall and precision. The balanced F-measure is a combined measure that assesses the precision-recall trade-off. It is the average of the two parameters and varies between a best value of 1 and a worst value of 0. In the current work, the F-measure ranged between 0.27 and 0.75. The classification accuracy varied between 60 and 92 %, which appreciably is good compared to the results given in Table 4. The results indicate that there is a significant potential for the use of subvocalized speech in EEG-based direct speech communication.

Table 3 Precision, recall, F-measure, and accuracy assessment for different words with three subjects
Table 4 Comparison of the results obtained for speech communication using ECoG and EEG signals by different researchers

The scalp maps in Fig. 5a show the brain electrical activity during subvocal verbalization of the words. Note that the neural activations are significantly prominent in the frontal lobes during subvocalized speech. Since these regions are directly responsible for speech production, the results appear promising. Figure 5b shows the plot of the scalp maps after using the SSM. The discrete sources of EEG signals are decomposed to distinguish the co-occurrence of brain electrical activity in the spatial domain, and then the signal components are mapped to a lower-dimensional space, retaining the most discriminate features. The discrimination of the brain signals corresponding to five subvocalized words is shown in Fig. 5. The plot of classification accuracy against the increasing number of features used in the feature vector to discriminate the subvocalized speech of word1 is displayed in Fig. 6. It is observed that the classification accuracy increases as the number of features increases and remains constant after the fourth value. So in the current work, only four discriminating features are used to form a feature vector. The receiver operating characteristic curve (ROC) is drawn to show the effective discrimination of the proposed SSM algorithm through the multiclass SVM classifier (Fig. 7). The ROC curve serves as a measure of performance of the algorithm by plotting the true positive rate verses the false positive rate in a unit square. An ROC curve of Fig. 7 reflects that the performance of the proposed BCI model is better than chance level.

Fig. 5
figure 5

Scalp topographical presentation of the brain activity, recorded from 64-electrode EEG, during subvocal verbalization of the words by subject-1. a The brain activity during subvocal verbalization of five different words. b The signal components mapped to a lower-dimensional space using the SSM

Fig. 6
figure 6

Graph shows the variation in classification accuracy versus number of features used in the feature vector to discriminate the subvocalized speech of word1

Fig. 7
figure 7

ROC curve drawn for the classification of the subvocalized words using the proposed SSM algorithm

Related Work

Research on synthetic telepathy is being carried out by the US Army, with the intention to allow its soldiers to communicate just by thinking [9]. The aim is to build a thought-helmet, a device that can read and broadcast the unspoken speech of soldiers. The goal is to enable soldiers to communicate silently. Silent speech communication is one of the most exciting future technologies. Silent speech communication [10] allows people to communicate with each other by using a whispering sound or even soundless, speech. This technology is used by NASA astronauts who need to communicate despite surrounding noise. Currently, electromyography (EMG) signals captured by small, button-sized sensors affixed below the jawbone on either side of the throat [11] are used to collect the signals. A new class of biometrics, based on the cognitive aspects of human behavior, called cognitive biometrics, presents a novel approach to user authentication. The brain state of individuals, used for the authentication mechanism, increases the robustness and enables cross-validation when used in combination with traditional biometric methods. The cognitive biometric cannot be hacked, stolen, or transferred from one person to another; they are unique for each individual. The BCI for speech communication is used as an alternative augmentative communication (AAC) device for severely disabled people who can communicate only through computer interfaces that can interpret neurological signals. For example, people suffering from amyotrophic lateral sclerosis (ALS) and locked-in syndrome (LIS) are the targeted beneficiaries.

In the past decade, several BCI techniques have been developed to restore communication in patients with varied and severe paralysis. The indirect communication devices generally used in these types of communication, such as the speller device or a virtual keyboard, suffer from slow selection rate of just one word per minute [12]. This sometimes limits the user’s fluency and comprehension. Moreover, these indirect methods fail to improve the patients’ behavioral abnormalities. Besides, these methods do not improve the subjects’ psychological condition and constrain related speech communication. To address the above-mentioned problems and make BCI speech production more natural and fluent, direct methods are being developed. The direct approach involves capturing the neural activity of the intended speech through an EEG. The signals are then processed to predict the speech and synthesize speech production in real-time. Suppes et al. [13] used electrical and magnetic brain waves for recognition of words and sentences that were supposed to have been silently spoken. DaSalla et al. [14] developed a BCI using EEG for “imaginary speech” of the English vowels /a/ and /u/, and a no-action state as a control. The potential use of EEG as a means of silent communication has been explored by D’Zmura et al. [15] and Brigham et al. [16]. Subjects imagining two syllables, /ba/ and /ku/, without speaking or performing any overt actions, are assessed for feasibility and considered for subject identification [4, 17]. In Table 4, the authors are showing the evidence that though considerable amount of research is being carried out on silent speech, most of the work is using invasive method, which has lot of shortcomings. And the noninvasive teams are able to decode only the phonemes and syllables. Nobody has reported about decoding a complete meaningful word using EEG signals during subvocalized speech production, so we claim that the present work is novel.

Though speech communication has extensive scope in various domains of application, the challenges in processing the EEG signals in real-time are significant. At the very outset, it must be acknowledged that the EEG signals are extremely complex and prone to internal and external interference.

Conclusion

The motivation for this study was to build a practical BCI framework for speech communication using current technology. The priority is to enable communication with a simple BCI setup providing high performance and speed. The study was conducted with an eye on the vast number of applications for BCI speech communication. Potential applications include synthetic telepathy, speech communication in LIS patients, silent communication, and cognitive biometrics. EEG was chosen for this experiment since it is low cost, portable, and has high temporal resolution compared to other brain-imaging modalities. Also, EEG can detect covert processing in the brain, even without the external stimulus; our input was mainly from covert activities.

An essential contribution of the present work is the usage of subvocalized speech for the development of an EEG-based BCI for speech communication. Subvocal verbalization is associated with activation in the frontal and temporal cortex, with bi-hemispheric lateralization. This activation alleges the frontal and temporal lobes to be involved in the articulation of speech output.

The EEG signals were acquired from three healthy subjects, while they subvocally articulated one of the five words. The EEG patterns for those five essential words were then selected from each subject. The data acquired were for five complete words that were felt to relate to patients’ needs as opposed to phonemes and syllables. The signals were processed using an SSM algorithm devised to reduce the magnitude of sensor data processed for feature selection.

A multiclass SVM classifier (one-against-all) was used to classify the data. The developed model was verified/evaluated using standard metrics. Several performance measures were used to investigate the feasibility and limitations of the developed BCI model. In the present system, a satisfactory accuracy in the range 80–92 % was achieved. The results show that the presented model of BCI for speech communication, using subvocalized speech, is viable, but needs improvement in classification accuracy.

The significant challenge in analyzing the EEG signals is the low signal-to-noise ratio; they are prone to internal and external noise. A more refined data analysis and comparatively large number of data are required to extract useful information from EEG. In addition, as the number of words to be classified increases, we need to build intelligent algorithms that learn the most discriminatory features. In real-time application, classification of a vast number of words needs to be developed to make the system scalable. Furthermore, an accurate mechanism to capture the subject’s focus is required to impose discrimination among the different words. Advances in sensor technology, acquisition protocols, and intelligent algorithms will be needed by BCIs to meet the desired performance. The data were acquired in a restricted atmosphere, but promises also to work well in outside situations.

Future Work

As a future work, an improved method of feature selection and classification, using advanced machine learning techniques, could be explored. Furthermore, classifier ensembles could be used to capture the significant variability in EEG data and augment the accuracy and receptiveness of the system. The next step would be to work on an expanded list of recognized subvocalized words.

The work reported so far is just an embryo in the development of a BCI system for speech communication. Future steps would be to build a commercial headset with a minimum number of electrodes to enable speech communication through subvocal verbalization. We are confident that the technologies and methodologies presented in this study provide a foundation for future development that will enable the speechless to generate assisted speech in a geometrically augmenting mode.