Brief description: 

The Language Processing use case will focus on processing large amounts of messages from social media, such as Twitter, in order to perform semantic information extraction, sentiment analysis, summarization, interpretation and organization of content. Such analysis occurs by extracting from each tweet phrases with specific syntactic forms. The process uses a number of different dictionary types storing a diverse range of information from word lists (vocabularies) to complex networks structures expressing syntactic patterns. These dictionaries provide hints with which each tweet (or arbitrary text) is going to be marked. The execution involves critical and complex algorithms (words prox- imity, fuzzy matching, etc.) that are invoked upon each new text, thus requiring their acceleration in order to become as efficient and scalable as possible.

Main Features: 

Processing of unstructured data (text) is widely used to extract knowledge from articles and messages including social media. It is applied within several business domains to support various types of operations, where sentiment analysis and opinion mining are of significant importance (e.g., tourism, marketing, press, etc.). In the financial sector, the processing of online text and order streams is useful when we need to correlate financial news with trade facts, especially in the domain of fraud detection. Extreme time constraints during the execution of such algorithms makes it challenging for them to achieve their real- time business goals.

Areas of Application: 

Finance, Tourism, Marketing, ICT, Politics

Market Trends and Opportunities: 

Biometric authentication, using facial recognition, is fast becoming a mainstream method of authenticating customers for high value transactions, such as the creation of Bank Accounts, issuing of Travel Visas and unmanned border crossing by pre-registered users. Such processes are coupled with tight SLAs to ensure the best possible user experience. E2Data will both optimize the cost base of the platform and automate the performance optimization of code, something that until now has required skilled, and expensive, developers.

Customer Benefits: 

Higher efficiency in the exploitation of available hardware will lead to reduced customer costs. Dynamic adaptation of the code execution to the available hardware can ensure that SLAs are being met, and customers enjoy a high level of services with minimized latency.

Technological novelty: 

Three of the most important NLP engine types incorporate functions which will be accelerated. The common characteristics of these engines are: a) they use a type of dictionary that is static and constant, and b) they work in stream mode, i.e. we feed the engine with input (words or texts) and they return answers. The engine types that will be accelerated are: 1. Lexicographical fuzzy matching search in vocabularies using Directed Acyclic Graph Words which is a deterministic acyclic finite state automaton that can be accessed with regular expressions. Levenshtein distances between the dictionary words and the input words can be also used. 2. Statistical fuzzy matching and classification applied in multiword expressions and/or documents using cosine similarity or TFIDF applied on words or q-grams. Okapi BM25, which is a similar to the TFIDF algorithm for ranking documents is also candidate code to be accelerated through the E2data platform. 3. Fuzzy matching of multiword expressions using Compressed Tries. Compressed Tries are used as indexes to various types of lexicons (morphological, terminological, syntactical, etc.). E2Data platform will be stressed by demanding and realistic datasets; high volumes and rates will be injected that need to be processed to provide real-time response which is required.



Sotiris Diamantopoulos
United Kingdom of Great Britain and Northern Ireland
BDV Reference categories: 
Data processing architectures
Financial and insurance activities
Information Service activities
Readiness Level: 
Natural Language
Artificial Intelligence
social media
unstructured data