Academic Publications
Academic publications so farFeature Selection for Spectral Clustering: to Help or Not to Help Spectral Clustering when Performing Sense Discrimination for IR?
Open Computer Science, Volume 8, Issue 1, p. 218–227.
Keywords: word sense discrimination; information retrieval; query disambiguation; spectral clusteringInternational Journal PapersJournal Papers
Feature Selection for Spectral Clustering: to Help or Not to Help Spectral Clustering when Performing Sense Discrimination for IR?
About The Publication
Whether or not word sense disambiguation (WSD) can improve information retrieval (IR) results represents a topic that has been intensely debated over the years, with many inconclusive or contradictory conclusions. The most rarely used type of WSD for this task is the unsupervised one, although it has been proven to be beneficial at a large scale. Our study builds on existing research and tries to improve the most recent unsupervised method which is based on spectral clustering. It investigates the possible benefits of “helping” spectral clustering through feature selection when it performs sense discrimination for IR. Results obtained so far, involving large data collections, encourage us to point out the importance of feature selection even in the case of this advanced, state of the art clustering technique that is known for performing its own feature weighting. By suggesting an improvement of what we consider the most promising approach to usage of WSD in IR, and by commenting on its possible extensions, we state that WSD still holds a promise for IR and hope to stimulate continuation of this line of research, perhaps at an even more successful level.
Statistical Analysis to Establish the Importance of Information Retrieval Parameters
Journal of Universal Computer Science, Consortium J.UCS, Special Issue Information Retrieval and Recommendation, Vol. 21 N. 13 (2015), p. 1767-1789.
Keywords: Information Retrieval, query difficulty, query clustering, IR system pa- rameters, Random Forest.International Journal PapersJournal Papers
Statistical Analysis to Establish the Importance of Information Retrieval Parameters
About The Publication
Search engines are based on models to index documents, match queries and documents and rank documents. Research in Information Retrieval (IR) aims at defining these models and their parameters in order to optimize the results. Using benchmark collections, it has been shown that there is not a best system configura- tion that works for any query, but rather that performance varies from one query to another. It would be interesting if a meta-system could decide which system config- uration should process a new query by learning from the context of previousqueries. This paper reports a deep analysis considering more than 80,000 search engine config- urations applied to 100 queries and the corresponding performance. The goal of the analysis is to identify which configuration responds best to a certain type of query. We considered two approaches to define query types: one is post-evaluation, based on query clustering according to the performance measured with Average Precision, while the second approach is pre-evaluation, using query features (including query difficulty predictors) to cluster queries. Globally, we identified two parameters that should be optimized: retrieving model and TrecQueryTags process. One could ex- pect such results as these two parameters are major components of IR process. However our work results in two main conclusions: 1/ based on post-evaluation approach, we found that retrieving model is the most influential parameter for easy queries while TrecQueryTags process is for hard queries; 2/ for pre-evaluation, current query fea- tures do not allow to cluster queries to identify differences in the influential parameters.
Word Sense Discrimination in Information Retrieval: A Spectral Clustering-based Approach
Information Processing & Management, Elsevier, Vol. 51, p. 16-31
Keywords: Information retrieval, Word sense disambiguation, Word sense discrimination, Spectral clustering, High precisionInternational Journal PapersJournal Papers Selected
Word Sense Discrimination in Information Retrieval: A Spectral Clustering-based Approach
About The Publication
Word sense ambiguity has been identified as a cause of poor precision in information retrieval (IR) systems. Word sense disambiguation and discrimination methods have been defined to help systems choose which documents should be retrieved in relation to an ambiguous query. However, the only approaches that show a genuine benefit for word sense discrimination or disambiguation in IR are generally supervised ones. In this paper we propose a new unsupervised method that uses word sense discrimination in IR. The method we develop is based on spectral clustering and reorders an initially retrieved doc- ument list by boosting documents that are semantically similar to the target query. For several TREC ad hoc collections we show that our method is useful in the case of queries which contain ambiguous terms. We are interested in improving the level of precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30) respectively. We show that precision can be improved by 8% above current state-of-the-art baselines. We also focus on poor performing queries.
Word Sense Disambiguation to Improve Precision for Ambiguous Queries
Central European Journal of Computer Science, Versita, co-éditeur Springer Verlag, Londres - GB, Vol. 2 N. 4, p. 398-411
Keywords: Information retrieval, Word sense disambiguation, Naïve bayes classification, Difficult queries, Ambiguous queries, Document clustering, Fusion functionsInternational Journal PapersJournal Papers
Word Sense Disambiguation to Improve Precision for Ambiguous Queries
About The Publication
Success in Information Retrieval (IR) depends on many variables. Several interdisciplinary approaches try to improve the quality of the results obtained by an IR system. In this paper we propose a new way of using word sense disambiguation (WSD) in IR. The method we develop is based on Naïve Bayes classification and can be used both as a filtering and as a re-ranking technique. We show on the TREC ad-hoc collection that WSD is useful in the case of queries which are difficult due to sense ambiguity. Our interest regards improving the precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30), respectively, for such lowest precision queries.
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ? (to appear)
Revue d'Intelligence Artificielle (RIA 2020)
Journal PapersNational Journal Papers
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ? (to appear)
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ?
Le Bulletin de l'Association Française pour l'Intelligence Artificielle (AFIA 2019)
Keywords: Sentiment analysis, Aspects detection, Criteria evaluation, Contradiction intensityJournal PapersNational Journal Papers
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ?
About The Publication
Reviews on web resources (e.g. courses, movies) become increasingly exploited in text analysis tasks (e.g. opinion detection, controversy detection). This paper investigates contradiction intensity in reviews exploiting different features such as variation of ratings and variation of polarities around specific entities (e.g. aspects, topics). Firstly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Secondly, the polarity of each review segment containing an aspect is estimated. Only resources containing these aspects with opposite polarities are considered. Finally, some features are evaluated, using feature selection algorithms, to determine their impact on the effectiveness of contradiction intensity detection. The selected features are used to learn some state-of- the-art learning approaches. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. Results showed that variation of ratings, variation of polarities, and reviews quantity are the best predictors of contradiction intensity. Also, J48 was the most effective learning approach for this type of classification.
Fair Exposure of Documents in Information Retrieval: a Community Detection Approach
CIRCLE2020
Keywords: Information systems, Information retrieval, Fair document exposure, Document network, Document communities, Document re-rankingConference PapersInternational Conference Papers
Fair Exposure of Documents in Information Retrieval: a Community Detection Approach
About The Publication
While (mainly) designed to answer users’ needs, search engines and recommendation systems do not necessarily guarantee the exposure of the data they store and index while it can be essential for information providers. A recent research direction so called “fair” exposure of documents tackles this problem in information retrieval. It has mainly been cast into a re-ranking problem with constraints and optimization functions. This paper presents the first steps toward a new framework for fair document exposure. This framework is based on document linking and document com- munity detection; communities are used to rank the documents to be retrieved according to an information need. In addition to the first step of this new framework, we present its potential through both a toy example and a few illustrative examples from the 2019 TREC Fair Ranking Track data set.
DeepNLPF: A Framework for Integrating Third Party NLP Tools
LREC2020
Keywords: Natural Language Processing, NLP tools integration, FrameworkConference PapersInternational Conference Papers
DeepNLPF: A Framework for Integrating Third Party NLP Tools
About The Publication
Natural Language Processing (NLP) of textual data is usually broken down into a sequence of several subtasks, where the output of one the subtasks becomes the input to the following one, which constitutes an NLP pipeline. Many third-party NLP tools are currently available, each performing distinct NLP subtasks. However, it is difficult to integrate several NLP toolkits into a pipeline due to many problems, including different input/output representations or formats, distinct programming languages, and tokenization issues. This paper presents DeepNLPF, a framework that enables easy integration of third-party NLP tools, allowing the user to preprocess natural language texts at lexical, syntactic, and semantic levels. The proposed framework also provides an API for complete pipeline customization including the definition of input/output formats, integration plugin management, transparent multiprocessing execution strategies, corpus-level statistics, and database persistence. Furthermore, the DeepNLPF user-friendly GUI allows its use even by a non-expert NLP user. We conducted runtime performance analysis showing that DeepNLPF not only easily integrates existent NLP toolkits but also reduces significant runtime processing compared to executing the same NLP pipeline in a sequential manner.
The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task
6th Sixth Workshop on NLP for Similar Languages, Varieties and Dialects - colocated with NAACL 2019 (VarDial2019 @ NAACL2019)
Keywords: Dialect classification, Feature engineering, Majority vote, CompetitionConference PapersInternational Conference Papers
The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task
About The Publication
This article presents the model that generated the runs submitted by the R2I LIS team to the VarDial2019 evaluation campaign, more particularly, to the binary classification by dialect sub-task of the Moldavian vs. Romanian Cross-dialect Topic identification (MRC) task. The team proposed a majority vote-based model, between five supervised machine learning models, trained on forty manually-crafted features. One of the three submitted runs was ranked second at the binary classification sub-task, with a performance of 0.7963, in terms of macro-F1 measure. The other two runs were ranked third and fourth, respectively.
On the Use of Dependencies in Relation Classification of Text with Deep Learning
20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing2019)
Keywords: Dependencies, Relation Classification, Deep Learning, Word Embedding, Compositional Word EmbeddingConference PapersInternational Conference Papers
On the Use of Dependencies in Relation Classification of Text with Deep Learning
About The Publication
Deep Learning is more and more used in NLP tasks, such as in relation classification of texts. This paper assesses the impact of syntactic dependencies in this task at two levels. The first level concerns the generic Word Embedding (WE) as input of the classification model, the second level concerns the corpus whose relations have to be classified. In this paper, two classification models are studied, the first one is based on a CNN using a generic WE and does not take into account the dependencies of the corpus to be treated, and the second one is based on a compositional WE combining a generic WE with syntactical annotations of this corpus to classify. The impact of dependencies in relation classification is estimated using two different WE. The first one is essentially lexical and trained on the Wikipedia corpus in English, while the second one is also syntactical, trained on the same previously annotated corpus with syntactical dependencies. The two classification models are evaluated on the SemEval 2010 reference corpus using these two generic WE. The experiments show the importance of taking dependencies into account at different levels in the relation classification.
Query Performance Prediction Focused on Summarized Letor Features
41st st International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR2018
Keywords: Query performance prediction, Query difficulty prediction, Query features, Post retrieval features, Letor featuresConference PapersInternational Conference Papers Selected
Query Performance Prediction Focused on Summarized Letor Features
About The Publication
Query performance prediction (QPP) aims at automatically estimating the information retrieval system effectiveness for any user’s query. Previous work has investigated several types of pre- and post-retrieval query performance predictors; the latter has been shown to be more effective. In this paper we investigate the use of features that were initially defined for learning to rank in the task of QPP. While these features have been shown to be useful for learning to rank documents, they have never been studied as query performance predictors. We developed more than 350 variants of them based on summary functions. Conducting experiments on four TREC standard collections, we found that Letor-based features appear to be better QPP than predictors from the literature. Moreover, we show that combining the best Letor features outperforms the state of the art query performance predictors. This is the first study that considers such an amount and variety of Letor features for QPP and that demonstrates they are appropriate for this task.
Predicting Contradiction Intensity: Low, Strong or Very Strong?
41st st International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR2018
Keywords: Sentiment, Aspect, Feature evaluation, Contradiction intensityConference PapersInternational Conference Papers Selected
Predicting Contradiction Intensity: Low, Strong or Very Strong?
About The Publication
Reviews on web resources (e.g. courses, movies) become increasingly exploited in text analysis tasks (e.g. opinion detection, controversy detection). This paper investigates contradiction intensity in reviews exploiting different features such as variation of ratings and variation of polarities around specific entities (e.g. aspects, topics). Firstly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Secondly, the polarity of each review segment containing an aspect is estimated. Only resources containing these aspects with opposite polarities are considered. Finally, some features are evaluated, using feature selection algorithms, to determine their impact on the effectiveness of contradiction intensity detection. The selected features are used to learn some state-of-the-art learning approaches. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. Results showed that variation of ratings, variation of polarities, and reviews quantity are the best predictors of contradiction intensity. Also, J48 was the most effective learning approach for this type of classification.
Challenges to knowledge organization in the era of social media. The case of social controversies
15th International ISKO Conference, ISKO2018
Keywords: controversy mediation, social media, Twitter, post-truth, social capital, societal challenges to knowledge organizationConference PapersInternational Conference Papers
Challenges to knowledge organization in the era of social media. The case of social controversies
About The Publication
In this paper, we look at how social media, in particular Twitter, are used to trigger, propagate and regulate opinions, and social controversies. Social media platforms are displacing the mainstream media and traditional sources of knowledge by facilitating the propagation of ideologies and causes championed by different groups of people. This results in pressures being brought to bear on institutions in the real world which are forced to make hasty decisions based on social media campaigns. The new forms of activism and the public arena enabled by social media platforms have also facilitated the propagation of so-called “post-truth” and “alternative facts” that obfuscate the traditional processes of knowledge elaboration which took decades to arrive at. This poses serious challenges for Knowledge Organization systems (KOs) that the KO community needs to find ways to address.
Contradiction in Reviews: is it Strong or Low?
40th European Conference on Information Retrieval, ECIR 2018 - BroDyn
Keywords: sentiment analysis, aspect detection, contradiction intensityConference PapersInternational Conference Papers
Contradiction in Reviews: is it Strong or Low?
About The Publication
Analysis of opinions (reviews) generated by users becomes increasingly exploited by a variety of applications. It allows to follow the evolution of the opinions or to carry out investigations on web resource (e.g. courses, movies, products). The detection of contradictory opinions is an important task to evaluate the latter. This paper focuses on the problem of detecting and estimating contradiction intensity based on the sentiment analysis around specific aspects of a resource. Firstly, certain aspects are identified, according to the distributions of the emotional terms in the vicinity of the most frequent names in the whole of the reviews. Secondly, the polarity of each review segment containing an aspect is estimated using the state-of-the-art approach SentiNeuron. Then, only the resources containing these aspects with opposite polarities (positive, negative) are considered. Thirdly, a measure of the intensity of the contradiction is introduced. It is based on the joint dispersion of the polarity and the rating of the reviews containing the aspects within each resource. The evaluation of the proposed approach is conducted on the Massive Open Online Courses collection containing 2244 courses and their 73,873 reviews, collected from Coursera. The results revealed the effectiveness of the proposed approach to detect and quantify contradictions.
Les réactions des stakeholders aux allégations d’irresponsabilité organisationnelle : le cas du scandale Volkswagen
12ème congrès du RIODD
Conference PapersInternational Conference Papers
Les réactions des stakeholders aux allégations d’irresponsabilité organisationnelle : le cas du scandale Volkswagen
Finding and Quantifying Temporal-Aware Contradiction in Reviews
The 13th Asia Information Retrieval Societies Conference, AIRS2017
Keywords: Sentiment analysis, Aspect detection, Contradiction intensityConference PapersInternational Conference Papers
Finding and Quantifying Temporal-Aware Contradiction in Reviews
About The Publication
Opinions (reviews) on web resources (e.g., courses, movies), generated by users, become increasingly exploited in text analysis tasks, the detection of contradictory opinions being one of them. This paper focuses on the quantification of sentiment-based contradictions around specific aspects in reviews. However, it is necessary to study the contradictions with respect to the temporal dimension of reviews (their sessions). In general, for web resources such as online courses (e.g. coursera or edX), reviews are often generated during the course sessions. Between sessions, users stop reviewing courses, and there are chances that courses will be updated. So, in order to avoid the confusion of contradictory reviews coming from two or more different sessions, the reviews related to a given resource should be firstly grouped according to their corresponding session. Secondly, aspects are identified according to the distributions of the emotional terms in the vicinity of the most frequent nouns in the reviews collection. Thirdly, the polarity of each review segment containing an aspect is estimated. Then, only resources containing these aspects with opposite polarities are considered. Finally, the contradiction intensity is estimated based on the joint dispersion of polarities and ratings of the reviews containing aspects. The experiments are conducted on the Massive Open Online Courses data set containing 2244 courses and their 73,873 reviews, collected from coursera.org. The results confirm the effectiveness of our approach to find and quantify contradiction intensity.
Role of social media in propagating controversies: the case of cultural microblog feeds
The 8th Conference and Labs of the Evaluation Forum - CLEF Microblog Cultural Contextualization, CLEF2017
Keywords: Focus IR, opinion mining, information vizualizationConference PapersInternational Conference Papers
Role of social media in propagating controversies: the case of cultural microblog feeds
About The Publication
The aim of this research is to investigate how social media mediate social controversies in the public arena. For that, we will use the CLEF MC2 corpus of microblogs that captured long term political and cultural controversies in order to follow the birth and development of controversies across time and pinpoint the increasing role that social media play in their propagation, regulation and resolution.
Harnessing Ratings and Aspect-Sentiment to Estimate Contradiction Intensity in Temporal-Related Reviews
21th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2017
Keywords: Sentiment Analysis, Aspect Extraction, Rating, Review, Time, Contradiction IntensityConference PapersInternational Conference Papers
Harnessing Ratings and Aspect-Sentiment to Estimate Contradiction Intensity in Temporal-Related Reviews
About The Publication
Analysis of opinions (reviews) generated by users becomes increasingly exploited by a variety of applications. It allows to follow the evolution of the opinions or to carry out investigations on products. The detection of contradictory opinions about a web resource (e.g., courses, movies, products, etc.) is an important task to evaluate the latter. This paper focuses on the problem of detecting contradictions in reviews based on the sentiment analysis around specific aspects of a resource (document). In general, for web resources such as online courses (e.g. on Coursera or edX), reviews are often generated during course sessions. Between each session users stop reviewing on the course, and this course may have updates. So, in order to avoid the confusion of contradictory reviews coming from two or more different sessions, the reviews related to a given resource should be firstly grouped according to their session. Secondly, certain aspects are extracted according to the distributions of the emotional terms in the vicinity of the most frequent names in the reviews collection. Thirdly, the polarity of each review segment containing an aspect is identified. Then taking only the resources containing these aspects with opposite polarities (positive, negative). Finally, we propose a measure of contradiction intensity based on the joint dispersion of the polarity and the rating of the reviews containing the aspects within each resource. The evaluation of our approach is conducted on the Massive Open Online Courses (MOOC) collection containing 2244 courses and their 73,873 reviews, collected from Coursera. The results of experiments revealed the effectiveness of the proposed approach to capture and quantify contradiction intensity.
Human-Based Query Difficulty Prediction
39th European Conference on Information Retrieval, ECIR 17
Keywords: Free Text, Query Term, Free Text Comment, Human Annotator, Query SuggestionConference PapersInternational Conference Papers Selected
Human-Based Query Difficulty Prediction
About The Publication
The purpose of an automatic query difficulty predictor is to decide whether an information retrieval system is able to provide the most appropriate answer for a current query. Researchers have investigated many types of automatic query difficulty predictors. These are mostly related to how search engines process queries and documents: they are based on the inner workings of searching/ranking system functions, and therefore they do not provide any really insightful explanation as to the reasons for the difficulty, and they neglect user-oriented aspects. In this paper we study if humans can provide useful explanations, or reasons, of why they think a query will be easy or difficult for a search engine. We run two experiments with variations in the TREC reference collection, the amount of information available about the query, and the method of annotation generation. We examine the correlation between the human prediction, the reasons they provide, the automatic prediction, and the actual system effectiveness. The main findings of this study are twofold. First, we confirm the result of previous studies stating that human predictions correlate only weakly with system effectiveness. Second, and probably more important, after analyzing the reasons given by the annotators we find that: (i) overall, the reasons seem coherent, sensible, and informative; (ii) humans have an accurate picture of some query or term characteristics; and (iii) yet, they cannot reliably predict system/query difficulty.
SegChainW2V: Towards a generic automatic video segmentation framework, based on lexical chains of audio transcriptions and word embeddings
20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2016
Keywords: Video retrieval, Story segmentation, Lexical chains, Word embeddings, TranscriptionsConference PapersInternational Conference Papers
SegChainW2V: Towards a generic automatic video segmentation framework, based on lexical chains of audio transcriptions and word embeddings
About The Publication
With the advances in multimedia broadcasting through a rich variety of channels and with the vulgarization of video production, it becomes essential to be able to provide reliable means of retrieving information within videos, not only the videos themselves. Research in this area has been widely focused on the context of TV news broadcasts, for which the structure itself provides clues for story segmentation. The systematic employment of these clues would lead to thematically driven systems that would not be easily adaptable in the case of videos of other types. The systems are therefore dependent on the type of videos for which they have been designed. In this paper we aim at introducing SegChainW2V, a generic unsupervised framework for story segmentation, based on lexical chains from transcriptions and their vectorization. SegChainW2V takes into account the topic changes by perceiving the fiuctuations of the most frequent terms throughout the video, as well as their semantics through the word embedding vectorization.
SegChain: Towards a generic automatic video segmentation framework, based on lexical chains of audio transcriptions
6th International Conference on Web Intelligence, Mining and Semantics (WIMS'2016)
Keywords: Video retrieval, Story segmentation, Lexical chains, TranscriptionsConference PapersInternational Conference Papers
SegChain: Towards a generic automatic video segmentation framework, based on lexical chains of audio transcriptions
About The Publication
With the advances in multimedia broadcasting through a rich variety of channels and with the vulgarization of video production, it becomes essential to be able to provide reliable means of retrieving information within videos, not only the videos themselves. Research in this area has been widely focused on the context of TV news broadcasts, for which the structure itself provides clues for story segmentation. The systematic employment of these clues would lead to thematically driven systems that would not be easily adaptable in the case of videos of other types. The systems are therefore dependent on the type of videos for which they have been designed. In this paper we aim at introducing SegChain, a generic unsupervised framework for story segmentation, based on lexical chains from transcriptions. SegChain takes into account the topic changes by perceiving the fluctuations of the most frequent terms throughout the video.
DeShaTo: Describing the Shape of Cumulative Topic Distributions to Rank Retrieval Systems without Relevance Judgments
Symposium on String Processing and Information Retrieval (SPIRE 2015)
Keywords: information retrieval, topic modeling, LDA, document topic distribution, skewness, kurtosis, ranking retrieval systemsConference PapersInternational Conference Papers
DeShaTo: Describing the Shape of Cumulative Topic Distributions to Rank Retrieval Systems without Relevance Judgments
About The Publication
This paper investigates an approach for estimating the effectiveness of any IR system. The approach is based on the idea that a set of documents retrieved for a specific query is highly relevant if there are only a small number of predominant topics in the retrieved documents. The proposed approach is to determine the topic probability distribution of each document offline, using Latent Dirichlet Allocation. Then, for a retrieved set of documents, a set of probability distribution shape descriptors, namely the skewness and the kurtosis, are used to compute a score based on the shape of the cumulative topic distribution of the respective set of documents. The proposed model is termed DeShaTo, which is short for Describing the Shape of cumulative Topic distributions. In this work, DeShaTo is used to rank retrieval systems without relevance judgments. In most cases, the empirical results are better than the state of the art approach. Compared to other approaches, DeShaTo works independently for each system. Therefore, it remains reliable even when there are less systems to be ranked by relevance.
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ?
29es journées francophones d'Ingénierie des Connaissances (IC2018)
Mots-clés: Analyse de sentiments, Détection d’aspects, Évaluation des critères, Intensité de contradiction.(2nd best paper)
Conference PapersNational Conference Papers
Prédire l’intensité de contradiction dans les commentaires : faible, forte ou très forte ?
About The Publication
Les commentaires sur des ressources Web (ex. : des cours, des films) deviennent de plus en plus exploitées dans des tâches d’analyse de texte (ex. détection d’opinion, détection de controverses). Cet article étudie l’intensité de contradiction dans les commentaires en exploitant différents critères tels que la variation des notations et la variation des polarités autour d’entités spécifiques (ex. aspects, sujets). Premièrement, les aspects sont identifiés en fonction des distributions des termes émotionnels à proximité des noms les plus fréquents dans la collection des commentaires. Deuxièmement, la polarité est estimée pour chaque segment de commentaire contenant un aspect. Seules les ressources ayant des commentaires contenant des aspects avec des polarités opposées sont prises en compte. Enfin, les critères sont évalués, en utilisant des algorithmes de sélection d’attributs, pour déterminer leur impact sur l’efficacité de la détection de l’intensité des contradictions. Les critères sélectionnés sont ensuite introduits dans des modèles d’apprentissage pour prédire l’intensité de contradiction. L’évaluation expérimentale est menée sur une collection contenant 2244 cours et leurs 73873 commentaires, collectés à partir de coursera.org. Les résultats montrent que la variation des notations, la variation des polarités et la quantité de commentaires sont les meilleurs prédicteurs de l’intensité de contradiction. En outre, J48 est l’approche d’apprentissage la plus efficace pour cette tâche.
Détection de contradiction dans les commentaires
COnférence en Recherche d'Information et Applications (CORIA2017)
Mots-clés : Analyse de sentiments, Contenus générés par l’utilisateur, ContradictionKeywords: Sentiment analysis, User generated content, Contradiction
Conference PapersNational Conference Papers
Détection de contradiction dans les commentaires
About The Publication
Résumé :
L’analyse des avis (commentaires) générés par les utilisateurs devient de plus en plus exploitable par une variété d’applications. Elle permet de suivre l’évolution des avis ou d’effec- tuer des enquêtes sur des produits. La détection d’avis contradictoires autour d’une ressource Web (ex. cours, film, produit, etc.) est une tâche importante pour évaluer cette dernière. Dans cet article, nous nous concentrons sur le problème de détection des contradictions et de la me- sure de leur intensité en se basant sur l’analyse du sentiment autour des aspects spécifiques à une ressource (document). Premièrement, nous identifions certains aspects, selon les distri- butions des termes émotionnels au voisinage des noms les plus fréquents dans l’ensemble des commentaires. Deuxièmement, nous estimons la polarité de chaque segment de commentaire contenant un aspect. Ensuite, nous prenons uniquement les ressources contenant ces aspects avec des polarités opposées (positive, négative). Troisièmement, nous introduisons une mesure de l’intensité de la contradiction basée sur la dispersion conjointe de la polarité et du rating des commentaires contenant les aspects au sein de chaque ressource. Nous évaluons l’effica- cité de notre approche sur une collection de MOOC (Massive Open Online Courses) contenant 2244 cours et leurs 73873 commentaires, collectés à partir de Coursera. Nos résultats montrent l’efficacité de l’approche proposée pour capturer les contradictions de manière significative.
Abstract:
Analysis of opinions (reviews) generated by users becomes increasingly exploited by a variety of applications. It allows to follow the evolution of the opinions or to carry out investigations on products. The detection of contradictory opinions about a Web resource (e.g., courses, movies, products, etc.) is an important task to evaluate the latter. In this paper, we focus on the problem of detecting contradictions based on the sentiment analysis around specific as- pects of a resource (document). First, we identify certain aspects, according to the distributions of the emotional terms in the vicinity of the most frequent names in the whole of the reviews. Second, we estimate the polarity of each review segment containing one aspect. Then we take only the resources containing these aspects with opposite polarities (positive, negative). Third, we introduce a measure of the intensity of the contradiction based on the joint dispersion of the polarity and the rating of the reviews containing the aspects within each resource. We evalu- ate the effectiveness of our approach on the Massive Open Online Courses (MOOC) collection containing 2244 courses and their 73873 reviews, collected from Coursera. Our results show the effectiveness of the proposed approach to capture contradictions significantly.
MyBestQuery: A serious game to collect manual query reformulation
Colloque Veille Stratégique Scientifique et Technologique (VSST 2016), Rabat (Morocco)
Keywords: Information retrieval, Query reformulation, Serious game, Human annotationConference PapersNational Conference Papers
MyBestQuery: A serious game to collect manual query reformulation
About The Publication
This paper presents MyBestQuery, a serious game designed to collect query reformulations from players. Query reformulation is a hot topic in information retrieval and covers many aspects. One of them is query reformulation analysis which is based on users’ session. It can be used to understand user’s intent or to measure his satisfaction with regards to the results he obtained when querying the search engine. Automatic query reformulation is another aspect of query reformulation. It automatically expands the initial user’s query in order to improve the quality of the retrieved document set. This mechanism relies on document analysis but could also benefit from manually reformulated query analysis. Web search engines collect millions of search sessions and possible query reformulations. As academics, this information is hardly accessible for us. MyBestQuery is designed as a serious game in order to collect various possible reformulation users suggest. The more long-term objective of this work is to analyse the humanly produced query reformulation in order to both analyse manual query reformulation and compare them with the automatically produced reformulations. Preliminary results are reported in this paper.
MyBestQuery : un jeu sérieux pour apprendre des utilisateurs
Conférence francophone en Recherche d'Information et Applications (CORIA 2016), Toulouse
Mots-clés : Jeu sérieux ; Crowdsourcing, Etude utilisateur, Moteur de recherche d’information, Annotation des requêtes, Aide aux utilisateursKeywords: Serious game, Crowdsourcing, User study, Search engine, Query annotation
Conference PapersNational Conference Papers
MyBestQuery : un jeu sérieux pour apprendre des utilisateurs
About The Publication
Résumé :
MyBestQuery est un jeu sérieux qui collecte des éléments sur les requêtes soumises à un moteur de recherche: (i) la prédiction de la difficulté de la requête par le joueur (ii) des raisons possibles expliquant cette difficulté (iii) des propositions de reformulation.
Abstract:
MyBestQuery is a serious game designed to collect items from queries submitted to a search engine: (i) the query difficulty prediction (ii) the possible reasons for this difficulty (iii) other query formulations.
La prédiction efficace de la difficulté des requêtes : une tâche impossible ?
Conférence francophone en Recherche d'Information et Applications (CORIA 2015), Paris
Mots-clés : Recherche d’information, requête difficile, prédiction, analyse de donnéesKeywords: Information retrieval, query difficulty predictor, data mining, evaluation
Conference PapersNational Conference Papers
La prédiction efficace de la difficulté des requêtes : une tâche impossible ?
About The Publication
Résumé :
Les moteurs de recherche d’information (RI) retrouvent des réponses quelle que soit la requête, mais certaines requêtes sont difficiles (le système n’obtient pas de bonne performance en termes de mesure de RI). Pour les requêtes difficiles, des traitements ad-hoc doivent être appliqués. Prédire qu’une requête est difficile est donc crucial et différents prédicteurs ont été proposés. Dans cet articlenous étudions la variété de l’information captée par les prédicteurs existants et donc leur non redondance. Par ailleurs, nous montrons que les corrélations entre les prédicteurs et les performance des systèmes donnent peu d’espoir sur la capacité de ces prédicteurs à être réellement efficaces. Enfin, nous étudions la capacité des prédicteurs à prédire les classes de difficulté des requêtes en nous appuyant sur une variété de méthodes exploratoires et d’apprentissage. Nous montrons que malgré les (faibles) corrélations observées avec les mesures de performance, les prédicteurs actuels conduisent à des performances de prédiction variables et sont donc difficilement utilisables dans une application concrète de RI.
Abstract:
Search engines found answers whatever the user query is, but some queries are more difficult than others for the system. For difficult queries, adhoc treatments must be applied. Predicting query difficulty is crucial and different predictors have been proposed. In this paper, we revisit these predictors. First we check the non statistical redundancy of predictors. Then, we show that the correlation between the values of predictors and system performance gives little hope on the ability of these predictors to be effective. Finally, we study the ability of predictors to predict the classes of difficulty by relying on a variety of exploratory and learning methods. We show that despite the (low) correlation with performance measures, current predictors are not robust enough to be used in practical IR applications.
Performance Analysis of Information Retrieval Systems
Spanish Conference on Information Retrieval, Coruna
Keywords: Information Retrieval, Classification, Query difficulty, Optimization, Random Forest, Adaptive Information RetrievalConference PapersNational Conference Papers
Performance Analysis of Information Retrieval Systems
About The Publication
It has been shown that there is not a best information retrieval system configuration which would work for any query, but rather that performance can vary from one query to another. It would be interesting if a meta-system could decide which system should process a new query by learning from the context of previously submitted queries. This paper reports a deep analysis considering more than 80,000 search engine configura- tions applied to 100 queries and the corresponding performance. The goal of the analysis is to identify which search engine configuration responds best to a certain type of query. We considered two approaches to define query types: one is based on query clustering according to the query performance (their difficulty), while the other approach uses various query features (including query difficulty predictors) to cluster queries. We identified two parameters that should be optimized first. An important outcome is that we could not obtain strong conclusive results; considering the large number of systems and methods we used, this result could lead to the conclusion that current query features does not fit the optimizing problem.
Expansion sélective de requêtes par apprentissage
Conférence francophone en Recherche d'Information et Applications (CORIA 2014), Nancy, France
Mots-clés : Recherche sélective d’information, Prédicteurs de difficulté, Difficulté de requête, Expansion de requête, ApprentissageKeywords: Selective information retrieval, Difficulty predictors, Query expansion, Machine learning
Conference PapersNational Conference Papers
Expansion sélective de requêtes par apprentissage
About The Publication
Résumé :
Si l’expansion de requête automatique améliore en moyenne la qualité de recherche, elle peut la dégrader pour certaines requêtes. Ainsi, certains travaux s’intéressent à développer des approches sélectives qui choisissent la fonction de recherche ou d’expansion en fonction des requêtes. La plupart des approches sélectives utilisent un processus d’apprentissage sur des caractéristiques de requêtes passées et sur les performances obtenues. Cet article présente une nouvelle méthode d’expansion sélective qui se base sur des prédicteurs de difficulté des requêtes, prédicteurs linguistiques et statistiques. Le modèle de décision est appris par un SVM. Nous montrons l’efficacité de la méthode sur des collections TREC standards. Les modèles appris ont classé les requêtes de test avec plus de 90% d’exactitude. Par ailleurs, la MAP est améliorée de plus de 11%, comparée à des méthodes non sélectives.
Abstract:
Query expansion (QE) improves the retrieval quality in average, even though it can dramatically decrease performance for certain queries. This observation drives the trend to suggest selective approaches that aim at choosing the best function to apply for each query. Most of selective approaches use a learning process on past query features and results. This paper presents a new selective QE method that relies on query difficulty predictors. The method combines statistically and linguistically based predictors. The QE method is learned by a SVM. We demonstrate the efficiency of the proposed method on a number of standard TREC benchmarks. The supervised learning models have performed the query classification with more than 90% accuracy on the test collection. Our approach improves MAP by more than 11%, compared to the non selective methods.
Prédire la difficulté des requêtes : la combinaison de mesures statistiques et sémantiques
Conférence francophone en Recherche d'Information et Applications (CORIA 2013), Neuchatel, Suisse
Mots-clés : Recherche d’Information, prédire la performance, difficulté des requêtes, ambiguïté des requêtes, combinaison des prédicteurs, corrélation des mesuresKeywords: Information Retrieval, performance prediction, query difficulty, query ambiguity, combined predictors, measure correlation
Conference PapersNational Conference Papers
Prédire la difficulté des requêtes : la combinaison de mesures statistiques et sémantiques
About The Publication
Résumé :
La performance d’un Système de Recherche d’Information (SRI) est étroite- ment liée à la requête. Les requêtes pour lesquelles les SRI échouent sont appelées dans la littérature des « requêtes difficiles ». L’étude présentée dans cet article vise à ana- lyser, adapater et combiner plusieurs prédicteurs de difficulté de requêtes. Nous avons considéré trois prédicteurs: un lié à l’ambiguïté des termes, un basé sur la fréquence des termes et une mesure de répartition des résultats. L’évaluation de la prédiction est basée sur la corrélation entre la difficulté prédite et la performance réelle des SRI. Nous montrons que la combinaison de ces prédicteurs donne de bons résultats. Le cadre d’évaluation est celui des collections TREC7 et TREC8 adhoc.
Abstract:
The performance of an Information Retrieval System (IRS) is closely related to the query. The queries that lead to retrieval failure are referenced in the literature as “difficult queries”. This study aims at analysing, adapting and combining several difficulty predictors. The evaluation of the prediction is based on the correla- tion between the predicted difficulty and the IRS performance. As predictors, we have considered an ambiguity predictor, the IDF measure and a score distribution measure. We show that combining the proposed predictors, produce good results. The evaluation framework consists in the TREC7 and TREC8 ahdoc collections.
Vers une personnalisation des environnements d’apprentissages à l’expérience émotionnelle de l’apprenant
ORPHEE RDV 2017, Font Romeu (France)
Atelier : Réalités mixtes, virtuelles et augmentées pour l'apprentissage : perspectives et challenges pour la conception, l'évaluation et le suiviPosition Papers
Vers une personnalisation des environnements d’apprentissages à l’expérience émotionnelle de l’apprenant
About The Publication
Les émotions d’un apprenant jouent un rôle déterminant dans l’apprentissage, influant fortement sur ses capacités cognitives (Lafortune et al., 2004 ; Cuisinier et Pons, 2011). Aujourd’hui un des enjeux majeurs des environnement d’apprentissage est d’y intégrer une forme d’intelligence émotionnelle (Mayer et al., 2001) permettant d’adapter automatique l’apprentissage aux émotions de l’apprenant (Harley et al., 2015; Ochs et Frasson, 2004) . Les problématiques sous-jacentes à la création d’un environnement d’apprentissage “émotionnellement intelligents” rejoignent celles de l’Informatique Affective (Picard, 2003) :
- la reconnaissance automatique des émotions ;
- la gestion des émotions de l’utilisateur ;
- l’expression d’émotions par des systèmes interactifs (e.g. via des comportements verbaux et non verbaux de personnages virtuels ou de robots humanoïdes).
Dans ce “position paper”, nous nous concentrerons plus particulièrement sur les deux premiers points : la reconnaissance et la gestion des émotions de l’utilisateur. L’objectif est de modéliser l’expérience émotionnelle de l’apprenant (comprendre les causes et les effets de ses émotions lors du processus d’apprentissage) afin d’adapter l’apprentissage aux émotions de l’apprenant, automatiquement détectés, pour optimiser l’acquisition des connaissances. Les problématiques et pistes de recherche sous-jacentes sont décrites dans la section suivante.
Presentation, Thesis research & SegChainW2V: Towards a Generic Automatic Video Segmentation Framework, based on Lexical Chains of Audio Transcriptions and Word Embeddings
Seminary (Séminaire d'accueil des enseignants-chercheurs de la FEG)
Other Publications