WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Understanding text data through NLP (Natural Language Processing) is a broad area that is receiving

renewed attention: for instance, sentiment analysis algorithms are very popular these days, as online

opinion has turned into a kind of virtual currency for businesses looking to market their products and

manage their reputations. Data quality challenges related to text include identifying misspelled words,

managing synonym lists, identifying abbreviations, taking into account industry-specific terminology (e.g.,

the word “stock” may mean very different things depending on the industry), leveraging context to filter out

noise in textual data (e.g., relating to a company name: differentiating among Amazon the company,

Amazon the river and Amazon the female warrior) and to attach correct meaning.

Interestingly, sometimes these two topics are connected: a frequent problem is to identify whether a

reference in a social site (e.g., a Facebook identifier of a person) corresponds to an entity (e.g., a customer)

known in an organization’s big data repository or CRM database. In this case, extracting information from

the text can help entity matching: for instance, a product name in a posting talking about the author’s

experience with the product is a candidate for extraction to help match the author to a customer if there is

transactional information in the repository reflecting the purchase.

—

Applications dealing with continuous streams of data, e.g., measurements, or real-time customer engagement

systems, should check this incoming data for quality in real time. If rate changes are very high, data quality

techniques such as outlier detection and record matching won’t work and then users may obtain outdated and

invalid information. On the first topic, data errors such as switching from packets/second to bytes/second in a

measurement feed may cause significant changes in distributions in the data being streamed in. Standard

outlier techniques must be made considerably more flexible to account for the variability in data feeds during

normal operation, so as to avoid raising unnecessary alerts. The system reported in

builds simple

statistical models over the most recently seen data (identified by a sliding window) to predict future trends and

identify outliers as significant deviation from the predictions. To ensure statistical robustness, these models are

built over time-interval aggregated data rather than point-wise data. On the topic of recordmatching, when data

sources are continuously evolving at speed, applying record matching from scratch for each update becomes

unaffordable. In

incremental clustering techniques have been proposed to deal with change streams

including not only inserting a record, but also deleting or changing an existing record.

As with warehousing, profiling tools are very important to understand data. However, with big data, it is not always

straightforward to understand its content, as it is stored in raw form: for instance, the numbers you want to profile may

have to be obtained after some extraction transform. Data volume may also become an issue, so methods to perform

sampling and summarizing may have to be put into play. And then it becomes crucial to discover relationships among

datasets, as in big data repositories there are frequently no schemas, as well as correlations among attributes. Data

scientists may then decide to select attributes (features, in their parlance) for their machine learning tasks (e.g.,

classification or anomaly detection) where they try to come up with some hypothesis that sheds light in identifying a

real business opportunity. Profiling, again, is crucial to this selection activity.