

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 62
www.persistent.com
—
Understanding text data through NLP (Natural Language Processing) is a broad area that is receiving
renewed attention: for instance, sentiment analysis algorithms are very popular these days, as online
opinion has turned into a kind of virtual currency for businesses looking to market their products and
manage their reputations. Data quality challenges related to text include identifying misspelled words,
managing synonym lists, identifying abbreviations, taking into account industry-specific terminology (e.g.,
the word “stock” may mean very different things depending on the industry), leveraging context to filter out
noise in textual data (e.g., relating to a company name: differentiating among Amazon the company,
Amazon the river and Amazon the female warrior) and to attach correct meaning.
Interestingly, sometimes these two topics are connected: a frequent problem is to identify whether a
reference in a social site (e.g., a Facebook identifier of a person) corresponds to an entity (e.g., a customer)
known in an organization’s big data repository or CRM database. In this case, extracting information from
the text can help entity matching: for instance, a product name in a posting talking about the author’s
experience with the product is a candidate for extraction to help match the author to a customer if there is
transactional information in the repository reflecting the purchase.
—
Applications dealing with continuous streams of data, e.g., measurements, or real-time customer engagement
systems, should check this incoming data for quality in real time. If rate changes are very high, data quality
techniques such as outlier detection and record matching won’t work and then users may obtain outdated and
invalid information. On the first topic, data errors such as switching from packets/second to bytes/second in a
measurement feed may cause significant changes in distributions in the data being streamed in. Standard
outlier techniques must be made considerably more flexible to account for the variability in data feeds during
normal operation, so as to avoid raising unnecessary alerts. The system reported in
builds simple
[21]statistical models over the most recently seen data (identified by a sliding window) to predict future trends and
identify outliers as significant deviation from the predictions. To ensure statistical robustness, these models are
built over time-interval aggregated data rather than point-wise data. On the topic of recordmatching, when data
sources are continuously evolving at speed, applying record matching from scratch for each update becomes
unaffordable. In
incremental clustering techniques have been proposed to deal with change streams
[22]including not only inserting a record, but also deleting or changing an existing record.
As with warehousing, profiling tools are very important to understand data. However, with big data, it is not always
straightforward to understand its content, as it is stored in raw form: for instance, the numbers you want to profile may
have to be obtained after some extraction transform. Data volume may also become an issue, so methods to perform
sampling and summarizing may have to be put into play. And then it becomes crucial to discover relationships among
datasets, as in big data repositories there are frequently no schemas, as well as correlations among attributes. Data
scientists may then decide to select attributes (features, in their parlance) for their machine learning tasks (e.g.,
classification or anomaly detection) where they try to come up with some hypothesis that sheds light in identifying a
real business opportunity. Profiling, again, is crucial to this selection activity.