Can EEG headsets be used for software usability testing?

Riordan Panayides
This is a guest post by our developer intern Riordan Panayides, who has been investigating the potential applications to the business of consumer grade EEG headsets.

Ideas & Aims

During my internship, we wanted to complete a project involving EEG headsets. These headsets, which measure electrical charges on the skin generated by brain activity, can be used to give an indication of the wearer's mental state. The Neurosky Mindwave Mobile is an affordable consumer grade headset featuring only two dry electrodes, which simplifies the setup process so that no professional knowledge is needed.

The Neurosky Mindwave Mobile Headset

The Neurosky Mindwave Mobile Headset.

Normally, the continuous EEG signal is split into frequency bands using spectral analysis. Increased levels of activity in these bands can correlate to different mental states. For example, increased alpha wave activity can be indicative of someone relaxing with their eyes closed. Neurosky has built a number of algorithms, making use of research into these bands, which can measure states like Attention, Meditation, Familiarity and more.

Our aim was to use the output from these algorithms in the usability testing of interfaces. A quantitative change in the mental effort required to complete a task in a piece of software, for example, would serve as a useful metric for optimising interfaces and deciding whether changes to designs are useful.

Since some of the most suited algorithms to this application were provided while in Beta, we decided to explore their validity for our purposes. We started by ensuring that the Mental Effort reading was up to task.

The flow of data from headset to visualisation.

The flow of data from headset to visualisation.


In order to verify the mental effort algorithm, we designed a test to put participants into a state of reliably increased mental workload, and checked that the headset reported what we were expecting. We gave participants a test with seven maths and logic questions, starting with a hard question followed by 5 easier ones and ending with another hard question.

We expected to see increased mental effort over the period the participant was completing the first and last questions, relative to the ones in between. We also recorded a two minute baseline reading at the beginning of the test, where the participant was asked to relax and expected this period to require less mental effort.

When running the tests, the participant was seated in a separate room without distractions and had the questions displayed to them in a PowerPoint presentation on a laptop. After completing each question, the participant was asked to make a note of how difficult they found it on a scale of Easy, Medium or Hard. This was used to verify our expectations of the question difficulty when correlating this with the level of mental effort.

I developed a data logger application which receives the output from the SDKs provided by Neurosky of algorithms selected by the user. It then graphs the data in real-time for the test facilitator to view and saves it to a CSV file, to be used for later analysis in Microsoft Excel. This was written in C# making use of the Windows Forms Chart control.

The data logger, mid-session. The graphs show live data from the EEG headset.

The data logger, mid-session. The graphs show live data from the EEG headset.

Summary of Results

To analyse the results, I first averaged the Mental Effort reading over each question using a trimmed mean to remove outlier values. I then created a graph for each testing session, clearly showing when each question was completed, with a 30s moving average (orange line) to show the general trend in mental effort (raw values in blue).

A graph showing the 'raw' mental effort (blue) and trend line (orange) from Participant A's first trial.

A graph showing the 'raw' mental effort (blue) and trend line (orange) from Participant A's first trial.

Once all the tests were completed, I took an average of the mental effort for each question across all trials to find a general trend. I plotted this, along the individual question averages on the below graph. Also averaged was the reported difficulty for each question. The two green lines on the graph appear to show a good correlation, however, the individual's values varied wildly.

Mental Effort averages from each testing session, the overall average (dark green), and the average reported question difficulty (light green)

Mental Effort averages from each testing session, the overall average (dark green), and the average reported question difficulty (light green).

Three sets of results were unfortunately invalidated by poor signal quality interrupting the output of the algorithm SDK. This is attributed to the electrode losing contact with the forehead. If the signal quality is bad for an extended period of time, the SDK attempts to re-calibrate itself by collecting a new baseline reading before outputting any new data, rendering subsequent readings incomparable to earlier ones.

From the data collected we can make a few key observations

  • The average mental effort for the questions is a lot greater than the baseline average.
  • Reported question difficulty matches what we expected when averaged, but participants did not rate the question difficulty consistently compared to each other. A possible reason for this is that the difficulty ratings were not clearly defined, which could be improved by using an accepted questionnaire for rating mental workload, such as NASA-TLX.
  • Recorded mental effort has a positive correlation with reported difficulty (r=0.65), however this is not as strong as we would like and the data has been distorted by averaging.
  • The absolute values are not comparable person to person, or even between separate tests on the same person. This is because the headset calibrates the algorithm with ~5s baseline reading, which the subsequent values are relative to. If this baseline is different, then the readings cannot be compared on the same scale.

Outcomes & Conclusion

I believe that the headset and provided algorithms can differentiate between differences in mental effort. This is shown by the average baseline readings compared to the question average and also by the apparent correlation between reported difficulty and recorded mental effort (see green lines on graph). However, the data these conclusions have been drawn from are averages from each person, again averaged into values for the entire trial. This could mean that at higher resolutions of data recording, the data is less useful for interpretation. Clearly this poses issues for a usability testing case as actions performed with interfaces may be even shorter than the questions.

Due to this conclusion, we have decided as a team to pivot our focus to another application of the headset which works within its limitations. We are considering investigating how detrimental to focus distractions within the workplace can be and ways to improve productivity by mitigating these effects. Watch this space in the coming weeks for more outcomes!