HomeSpring 2017Research & TeachingBig data, low reliability

Big data, low reliability

Professors Phil Xiang and Patrick Fan
"Use with care," say Patrick Fan and Phil Xiang about online reviews. "Comments are meaningful, but ratings may not be."

An award-winning paper by Virginia Tech data analytics scholars offers a caveat about using data from customer reviews on social media sites for research. There’s little discussion of data quality in big data and social media analytics research, especially in the hospitality and tourism field, says Phil Xiang, an associate professor of hospitality and tourism management.

Big data is all the rage at the moment, and for good reason. The analytics and information that can be mined from large and increasingly prevalent data sets can be useful in any number of industries and fields.

But one timeless rule still applies: “Garbage in; garbage out.”

Missing data, mislabeled data, inconsistent data, and even fake reviews are not uncommon, Xiang says. “Many existing studies take data from social media platforms to make predictive analyses without first assessing the reliability and validity of the data.”

Examining the reliability of data

Xiang co-wrote a paper that examined the reliability of social media data by mining TripAdvisor hotel reviews. The paper — co-authored with Patrick Fan, a professor of accounting and information systems; Qianzhou Du, a business information technology doctoral student; and Yufeng Ma, a computer science doctoral student — won a best paper award at a tourism conference in Rome in January.

“We did this study to shed light about the reliability issues in using online review data,” says Fan. “Users should be cautious in using online reviews. Comments are meaningful, but ratings may not be, especially when the number of ratings is low.”

Though websites like TripAdvisor have been considered premier sampling sources for social media research in hospitality and tourism, the study casts doubt on the quality of the available data.

“This study demonstrates that drawing data from even a highly reputable website like TripAdvisor might yield unreliable results and thus potentially invalid conclusions,” the paper states.

“Drawing data from even a highly reputable website…might yield unreliable results.”

The paper analyzed the quality of the data by using algorithms to predict whether a reviewer was a business traveler or tourist.

The researchers pulled in hundreds of thousands of reviews from around 1 million reviewers of hotels in 18 U.S. cities selected to represent a variety of population sizes, locations, and levels of attraction to tourists. They used data from New York City to build and “train” the text classifier. New York was used because of its large number of hotels and reviews, and because it attracted a good mix of business and leisure travelers.

Potential problems revealed

The results revealed potential problems with the data. The classifier performed well in predicting leisure travelers, but did not perform as expected in predicting business travelers.

Using a variety of methods, the researchers concluded that the problem was in the data itself. A word cloud generated from the misclassified reviews suggested a business purpose to the trips, even when the reviewers had classified them as leisure. The researchers then developed a method to clean the data by identifying the reviews that were mislabeled by reviewers.

The authors hope the study will raise awareness of data quality issues in hospitality and tourism research. “Our findings raise a number of questions regarding the existing approaches in research based on social media data,” says Xiang.

The study will likely lead to further research and, Fan hopes, development of tools to help other researchers as well as the average consumer make better use of this kind of social media data. The text classification algorithm could be refined to detect other travel purposes, and to develop segmentation tools to help with targeted marketing.

– Dan Radmacher