Javascript must be enabled to continue!

Enhancing low-level features with mid-level cues

Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties. This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work. In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features. We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient. Our main contributions are as follows: (a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines. (b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion. (c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors. Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".

Universitat Politècnica de Catalunya

Eduard Trulls Fortuny

2023

Title: Enhancing low-level features with mid-level cues

Description:

Local features have become an essential tool in visual recognition.

Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG.

SIFT in particular shifted the paradigm in feature representation.

Subsequent works have often focused on improving either computational efficiency, or invariance properties.

This thesis belongs to the latter group.

Invariance is a particularly relevant aspect if we intend to work with dense features.

The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points.

Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work.

In this thesis we present strategies to enhance low-level, local features with mid-level, global cues.

We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes.

To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors.

We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability.

For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed.

In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features.

We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements.

We prioritize solutions that are simple, general, and efficient.

Our main contributions are as follows: (a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines.

(b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion.

Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual.

Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG.

SIFT, en concret, ha canviat el paradigma en representació de característiques visuals.

Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança.

El treball presentat en aquesta tesi pertany al segon grup.

L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel.

La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta.

Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge.

S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball.

En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat.

Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge.

Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow".

També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge.

Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant.

En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals.

Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes.

En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment.

Donem prioritat a solucions senzilles, generals i eficients.

Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline").

(b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment.

(c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".

Back

The strong association with visual cues exhibited by fish that prefer to inhabit flowing water (rheophilic species) may help reduce the energetic costs of maintaining position due ...

On sound localization cues in the median plane

In order to make it clear which is more important for sound localization in the median plane, SP cues (spectral cues) or ID cues (interaural difference cues), two localization test...

Reinstatement of Pavlovian responses to alcohol cues by stress

Abstract Rationale Stress may contribute to relapse to alcohol use in part by enhancing reactivity to cues previously paired wi...

What influences the selection of contextual cues when starting a new routine behaviour? An exploratory study

Abstract Background Contextual cues play an important role in facilitating behaviour change. They not only support memory but may also help to make ...

Inequalities between $\mid A\mid + \mid B\mid $ and $\mid A^{*} \mid + \mid B^{*} \mid$

Let $A$ and $B$ be complex square matrices. Some inequalities between $\mid A \mid + \mid B \mid$ and $\mid A^{*} \mid + \mid B^{*} \mid$ are established. Applications of these ine...

Visual Cues for Turning in Parkinson’s Disease

Turning is a common impairment of mobility in people with Parkinson’s disease (PD), which increases freezing of gait (FoG) episodes and has implications for falls risk. Visual cues...

Augmentation of self-motion perception with synthetic auditory cues

Abstract People who suffer from vestibular loss or damage have difficulty maintaining balance and perceiving their own motion in space (self-motion). Sensory augmen...

Socio-sexual cues shape female diet choice in Drosophila melanogaster

Male harassment can disturb female feeding behaviour and limit females’ access to preferred foraging locations. However, it is not yet known how females trade off costs of sexual h...

Email:
Password:

Email:

Enhancing low-level features with mid-level cues

Related Results