Visualizing Traces of the Urban Flâneur with Computer Vision and Generative Digital Media
Visualizing Traces of the Urban Flâneur with Computer Vision and Generative Digital Media
Design Lab, Faculty of Architecture, Design, and Planning, The University of Sydney, Australia
Reference this essay: Hespanhol, Luke. “Visualizing Traces of the Urban Flâneur with Computer Vision and Generative Digital Media.” In Leonardo Electronic Almanac 22, no. 1, edited by Lanfranco Aceti, Paul Thomas, and Edward Colless. Cambridge, MA: LEA / MIT Press, 2017.
Published Online: May 15, 2017
Published in Print: To Be Announced
This essay describes a digital media artwork that utilizes computer vision and generative digital effects to capture the immaterial character of crowd dynamics in public spaces. The process allows for the emergence of a double-layered audiovisual impression of population movement through the city, and includes: (1) a series of short videos shot at various urban spaces and digitally manipulated in real time to register the flow of pedestrians, and (2) a gallery exhibition of the collective footage collated as an interactive movie strip. The outcome of an art and science residency, the artwork refashions surveillance media, video art, and painting by offering traces of pedestrians’ daily walking activities in public spaces, re-rendered artistically in real time. By doing so, it proposes a new idiom for the visualization of patterns of human movement in the digitally augmented city. In isolation, each movie offers a high level of transparent immediacy, positioning the audience at the center of the space, albeit hidden behind the camera’s window frame. At the gallery, however, the collective footage attains a high degree of ‘hypermediacy,’ simultaneously highlighting the hedonic character of the interactive medium itself while offering a critique of the surveillance practices that are increasingly common in contemporary society.
Generative art, computer vision, crowd dynamics, cloud aesthetics
In this essay, I refer to the first execution and exhibition of Milieu Reverb. This project was conceived as part of my artist residence at Culture At Work, an organization based in Sydney, Australia that promotes works at the intersection of art and science. Milieu Reverb consists of a series of short videos recorded in public, urban spaces and reworked through digital media. Computer-vision algorithms are used to detect movements of people in real time and generate an artistic impression of their movement through predefined locations in the city. It brings to the fore issues such as surveillance, social behavior, crowd dynamics, and the effective utilization of public space.
To create Milieu Reverb, videos were recorded at different locations and times of the day (figure 1). As people walked across the public space, the tracking software picked up their movements. Each time of the day entailed different visual and sonic aesthetics. Each moving section of the image was assigned a colorful view frame; once moved, such view frames left permanent marks on the image, akin to long-exposure photographs. Over time, patterns of movement emerged from the cumulative stains in the video record, revealing the underlying movement pattern of the local population as they traversed public space.
The videos were collected daily and contributed content to a generative digital art exhibition at Culture at Work's Accelerator Gallery. A depth-view-camera-based interface allowed visitors to play various movies concurrently and tap into the continuously evolving soundtrack of the city, inadvertently composed each day by the city’s own inhabitants.
Milieu Reverb stems from my desire to remediate  the aesthetics of surveillance technologies. This remediation is made possible by anonymously monitoring the movement of citizens at predefined urban locations and converting these movements into audiovisual expressions that enable hidden patterns of movement behavior to be visualized.
We do not make use of public spaces freely. Control over our behavior in the street is forged and regulated by forces manifested both at the interpersonal and sociopolitical levels. Spaces, albeit public, are naturally still subject to the law, as well as social conventions that are tacitly enforced [2,3] due to fear of social embarrassment and risk of damaging our social acceptability.  The fact that a space is public does not mean that it is free. The fact that a space is public does not mean that it is free. Indeed, rules regulating what behaviors are and are not acceptable in and around such public space are determined by factors such as its purpose, social context, cultural priming, and even its spatial layout. 
The scope of our freedom while in public is very narrow as we continuously and rapidly negotiate the use of the space with our fellow citizens. Social islands emerge, whereby groups of acquaintances congregate in particular spots with suitable affordances that somehow facilitate a social agenda shared amongst them. Likewise, we tend to establish paths of movement across the space, using them routinely and avoiding areas of threat or discomfort. Over time, the repeated utilization of particular routes creates an emerging de facto way of moving across the space, characterizing what in the field of urban planning has been called ‘desired paths’ or ‘desired lines’: a worn path showing where people naturally walk, arguably representing the ultimate expression of human desire or natural purpose.  In natural environments, desired paths are often revealed, for example, as emerging walking trails through grasslands or snowfields. Another example is the skid marks created by cars braking or turning on specific sections of roads. Similar patterns created by pedestrians walking through urban environments have notoriously been identified by researchers like William H. Whyte, who conducted a series of seminal field observations on the behavior of people in public spaces.  It is the physical public space and how we use it that truly bring us together as citizens.
Besides architectural constraints, other factors play their role in the modern city to promote or harness freedom of movement. On the one hand, top-down strategies for enforcing acceptable behavior in public spaces are ostensive and often easily identifiable; in addition to official signage communicating rules and regulations, for example, public surveillance of shared urban locations is also increasingly ubiquitous. On the other hand, with the increasing availability and affordability of cameras and sensors, as well as the advent of cloud-based services and the Internet of Things,  we live in an era where do-it-yourself (DIY) surveillance—also known as sousveillance[9,10]—has become commonplace, giving rise to bottom-up interpersonal mutual regulation. In public space, it is almost certain that we are continuously watched as much by our peers as by official agencies. Yet not all tracking exercises need to be coercive to accomplish their objectives; likewise, not all tracking goals represent breaches of individual privacy. Admittedly, disclosing the tracking practices employed by crowd-behavior studies would make the process rather artificial, heavily compromising the results. In those circumstances, however, the object of interest is not necessarily an individual, but rather the collective of people. Therefore, Tracking under such circumstances assumes a much less threatening verve. As Christian Fuchs puts it, “there are other examples of information gathering that are oriented on care, beneﬁts, solidarity, aid, and co-operation. I term such processes monitoring.” 
Milieu Reverb explicitly downplays actual surveillance practices by blurring the video image to the point of making people unidentifiable. Conversely, it refines the impression left by people moving through the city, revealing otherwise invisible urban desired paths as if people were brushes drawing permanent lines across the video image, staining the city canvas for posterior analysis. Through this process, people disappear, but their traces remain. The movies, created and edited in real time, therefore retain the immediacy of the moment they were recorded. At the same time, rearranging the movies as a continuous media ribbon in a gallery setting, with videos playing simultaneously side-by-side and responding to movement from the visitors themselves, creates a multilayered field of observation wherein attention is drawn as much to the movies as to the ribbon interface itself. ‘Hypermediacy,’ or the conscious and purposeful utilization of media as content,  is thus expressed through the curated exhibition of the original movies, sharpening the qualities of each video through contrast with the others playing at the same time. As Jay David Bolter and Richard Grusin point out, it is typical of new digital media to “oscillate between immediacy and hypermediacy, between transparency and opacity,” which “inevitably leads us to become aware of the new medium as a medium. Thus, immediacy leads to hypermediacy.” 
Hypermediated visualizations of human movement across cities are in fact an emerging field, especially empowered by the spread of technologies like computer vision and GPS tracking. Yet current practices tend to favor a macro vision of the city, mapping people’s changing positions over time as colored lines on a city map. For example, a recent study mapped the movements of 169 young individuals in the city of Aalborg, Denmark.  Each individual was given a GPS unit by the researchers, who then tracked their locations around the city for a week. The results show the locations where the young people spent most of their time (e.g., shopping areas), as well as the main routes leading through and between urban areas. Similarly, data collected from the smartphone-tracking fitness app Human  has been used to create emerging maps of the city drawn from the accumulation of the positions of individual app users during their sport activities (e.g., walking, running, or cycling), collected and collated over time.  With users distributed worldwide, the company was able to create comprehensive and accurate visualizations of many global cities.
Milieu Reverb’s goal is, on one level, to filter the traces left by citizens in regular urban situations. Making these traces visible, the work becomes an aesthetic expression of the tacit patterns of movement through the city environment in which we collectively and unconsciously subscribe to—our invisible urban desired paths. In this sense, it is related to the GPS-based traces emerging from pedestrians’ movements described above, albeit smaller in scope and far less precise. However, it is distinctive of the Milieu Reverb proposition that (1) each visualization happens at the level of a specific urban site in order to facilitate the remediation of surveillance cameras, and that (2) the movement produces audio as ‘urban music’ composed in real time by real people moving though the city. Additionally, a second level of appreciation is created for the gallery audience, immersed in the experience enacted through the movies and made aware of the ever-present interface for retrieving such an experience. As spectator-actors, visitors are continuously reminded of their role as co-creators of the broader experience of appreciating and affecting the reality there described.
Generative Video Art
The videos created for the first instance of Milieu Reverb resulted from impromptu recordings of people moving through preselected urban spaces, transformed in real time by computer-vision and video-editing algorithms. Recordings were performed with a webcam connected to a laptop running a custom software application.
Audio and Video Generation from Urban Movement
As its core, the Milieu Reverb recording software employed background subtraction (a well-known computer-vision technique) to detect movement on each frame received from the live camera feed. Background subtraction works by taking each frame as a pixel matrix and then subtracting, pixel by pixel, the average value it had on previous frames. As a consequence, regions of the image where no movement occurs remain with the same pixel values across frames, as the value subtraction therefore amounts to zero (i.e., a black pixel). Conversely, for regions where movement does occur, the new pixel will have a non-zero value, which is then marked as white. The resulting image, therefore, represents a black-and-white map of the movement observed in the scene. It is important to point out that thresholds are used so that even pixels that are not identical across frames will be considered equivalent if chromatically similar enough. This helps to compensate, for example, for subtle variations in lighting conditions.
Once a single frame is composed within the movement map, it is stored in the software memory, and the generation of the corresponding final frame starts. This is a twofold task that consists of generating both digital imagery and audio to match the movement detected. Although responding to the same input, visual and audial generation each follow a different logic. With video, the goal is to “stain” the screen over time, such that a trace of the movement is created and stays visible on the final footage, revealing individual desired paths, as discussed above. This is achieved by keeping another image in memory, representing a movement canvas that is empty when the application starts and gets updated every frame with the current movement detected, but is never reset. For every frame, continuous regions of movement are marked as hollow rectangles with thick, colorful edges “staining” the canvas. Since this canvas is never reset between movies, it reflects over time the accumulated movement since the recording started. Moreover, the stains get more pronounced in areas where repeated movement has occurred, therefore highlighting patterns of crowd movement as they emerge.
Generative audio, on the other hand, is intended to convey a real-time impression of the level of movement. For that, I adopted a 1:1 mapping between each rectangle rendered in the final movie frame and a short audio tune, such that every movement detected on the camera stream causes a tune to play. The tune itself is also distorted in pitch depending on where on the screen the movement occurs, lending it a spatial property. From the perspective of a person watching the final movie, it should be as if the people moving through the public space are generating the sounds themselves, immediately. In addition to these, a more generic and upbeat percussive tune plays whenever more than two rectangles are produced, increasing in volume the longer the movement persists, thus denoting the emergence of crowd movement in the target public space.
Movies recorded at different times of day have different sets of visual and audial artifacts in order to provide clearer contrast between patterns of movement during the day. For this initial study, I adopted five different sets, as illustrated in figure 2.
After the movement canvas in memory is updated with the new rectangles, it is laid over a filtered version of the corresponding live-feed frame, resulting in the output frame. That is then drawn to the screen, accompanied by the audio generated in the process. Figure 3 illustrates the steps involved in rendering an edited frame in real time from the live feed.
The Milieu Reverb recording software application consists of a processing  sketch running the OpenCV library  for manipulation of the live camera feed and the Beads library  for the generative audio synthesis. For its initial execution, the application ran on a MacBook Pro laptop connected to a USB webcam (see fig. 4). Such a lightweight setup was important to enable a swift record in a chosen location, with minimal disruption to its local dynamics. The intention was such that, from the pedestrians’ perspectives, I would be no more noticeable than any ordinary individual casually working on their laptop in the city, thereby blending into the surrounding environment much like a security camera.
For this initial study, locations were chosen based on their physical layout, and more specifically with regard to the ways they allowed crowds to move through them. Preference was given to thoroughfares and plazas. For the purposes of this study, thoroughfares are defined as urban spaces with high pedestrian traffic of a transitional nature, usually not perceived as destinations themselves, but rather as conduits. Plazas are defined as large urban areas delimited by buildings on their peripheries and with minimal urban furniture and other obstacles in their centers. Thus, while thoroughfares essentially constrain crowd movement to two opposite directions, plazas are more accommodating and allow a distributed flow across their centers and along their edges.
Recording a Movie
Once a location is chosen, recording the movie itself is a very straightforward process, involving (1) setting the webcam and laptop in a stationary position facing the public space; (2) running the video generation software for approximately five minutes with screen capture activated; and then (3) stopping both, thereby recording the video on disk. For the sake of the integrity of the generative process and the possibility of comparison between the various movies, the recording duration was enforced. For the same reason, once the video was recorded by the Milieu Reverb video app, no further editing was done; the video stood as an immediate expression of the city at the particular moment it was conceived. Retaining such immediacy was crucial to ensure the conceptual strength of the pieces produced.
Interactive Gallery Exhibition
A second goal of the Milieu Reverb’s initial deployment was to produce a gallery installation for exhibiting the movies soon after they were recorded. More than displaying the movies individually, the intention is to display them collectively so that they can function as small city windows that visitors can peep through. Even more so, given their subject matter, placing the movies side-by-side naturally evokes the aesthetics of security rooms, where various monitors—or frames within the same monitor—display different views of public spaces, all while the people actually moving through them remain oblivious to the fact they are being observed.
I designed the installation as two sets of media ribbons running as projections onto opposite walls of the gallery. The ribbons display movies side-by-side and can be slid left or right in response to the viewer’s position in the room. In other words, visitors must walk across the room to move the media ribbons and consequently switch the channels (movies), which in turn display other people walking in the city. This yields a playful symmetry between the subjects in the movies and their spectators. Such a simple interactive mechanism is implemented with depth-view cameras used to track the position of individuals between 1.5 and 3 meters of each wall. The visible section of each media ribbon is twice as large as each movie, and can therefore potentially accommodate up to three movies simultaneously (either two full movie windows or one full movie window in the center with two other half windows, one at each side). Consequently, up to six movies can be played simultaneously in the gallery space, surrounding visitors with the emerging sounds of the city. Visitors are given the ability to ‘play the city’ by sliding the media ribbons sideways and thus selecting which ‘channels’ to watch. The sequence of videos on each ribbon is randomly established at the start of the application. Figure 5 shows the installation on each wall of the gallery.
The resulting exhibition presents an intriguing interplay between two perspectives on the technology-mediated reality we ordinarily switch between: ‘augmented reality’ and ‘hypermediacy.’
In isolation, each movie naturally implies a high level of immediacy; the viewers see themselves immersed in the unfolding movement as if they were in situ, witnessing firsthand the events taking place. The stationary camera offers a static viewpoint for watching the scene, as if the viewers themselves were seated on a bench in the actual site, passively watching people walking by. This illusion of immediate reality is, however, augmented by the blurred image, the otherworldly color palette, and the progressive staining of regions where movement occurs, transforming the original camera footage into a richer visual expression generated in real time by real humans moving across a real urban location. That visual schema, combined with the generative audio, articulates what I call the ‘immediacy of the intangible.’ Just as clouds are condensed humidity, here we can perceive condensed movement. The images transmitted to the passive observer not only show individuals going by, but also their movement through space. Just like electrons around the nucleus of an atom, the crowd dynamics are traced as a ‘pedestrian cloud’ through an urban space. By analyzing the stained footage a posteriori, it is possible to tell—with a high degree of confidence—which specific zones within a given public space would have a high probability of accommodating walking humans (see fig. 6).
When observed from the macro perspective offered by the gallery exhibition, however, the artwork reveals a more complex construction of such an augmented reality. In fact, the augmentation itself is a direct outcome of the sequential layers of graphical edits applied over the original, raw live feed using digital media—just like layers in graphic software applications are used to organize the different visual elements in a scene. Each new media layer ‘augments’ the reality slightly further, resulting in a new view of reality, while still retaining a clear link to the original scene. Conversely, it also loses definition in the process, with the image becoming blurred and shedding its original colors. In this regard, Milieu Reverb creates five views of reality filtered through five layers of media, as illustrated in figure 7.
The first view of reality is the trivial one; that is, the reality of people walking through the public space itself. Application of the first media layer—the Milieu Reverb recording software—yields two additional views of reality: the raw camera footage and its filtered version, accompanied by generative sounds. Each individual movie thus results from the application of three media layers to the original scene, producing three compounded views of it. The installation at the gallery takes such a process further by applying a fourth media layer to it: the random sequencing of movies as a media ribbon, yielding yet another version of reality, zoomed out and fragmented into separate yet simultaneous scenes that evolve in parallel. A wider view of the urban reality then emerges. Finally, the interactive features of the installation represent a fifth media layer that enables an even more complex version of reality, now defined by audience participation (or lack thereof) and ultimately determining which ‘perspectives’ of reality are displayed by shifting some movies out of view and bringing others to the fore. This view of urban reality is thus characterized by its ability to be redefined and recombined by the gallery visitors.
Marc Hassenzahl and colleagues  have written extensively about the perceived features of interactive products, and have argued that these can be classified into two broad categories: pragmatic and hedonic. Pragmatic qualities are those related to the function or functionality of the product. Hedonic features, in turn, relate to the appreciation of the interfaces on their own, the pleasure their use evokes, and the emotional landscapes they set, thereby fulfilling so-called “be-goals” (e.g., to be admired, to be stimulated).  In other words, hedonic features are closely related to—and their perception is potentially a result of—the hypermediacy of the interface; that is, the awareness by the user/audience of its aesthetic and sensorial aspects, as well as the pleasure they derive from experiencing it. Analyzed from that perspective, it is precisely the hypermediacy made apparent in the gallery version of Milieu Reverbthat highlights the hedonic nature of its interaction, here characterized as artistic appreciation. By interacting with the work, visitors are able to access various layers of content related to the movement of crowds in public spaces, all while moving through the semi-public space of the gallery themselves.
The content of the artwork—as presented in the gallery exhibition—is, in fact, a direct function of its structure. This is realized through ‘multilayered hypermediacy,’ here defined as a strategy for constructing a technology-mediated version of reality through the observation of the following properties: (1) the use of multiple simultaneous media channels, each conveying immediacy and transparency (in the case of Milieu Reverb, those channels are the movies created in real time); (2) simultaneous play of various channels, highlighting the individual properties of each by visually contrasting them (achieved in the Milieu Reverb exhibition through the reorganization of movies as media ribbons); and (3) selection of active channels by the audience (achieved in Milieu Reverb by physically sliding the media ribbon within the exhibition space).
It is precisely this coexistence of multiple media and media strategies within this installation—multiple, simultaneous ‘transparent channels’—that conveys the feeling of mass surveillance and sets the tone for the viewer experience. As in a surveillance control room, tracking footage taken from different locations around the city is displayed side-by-side, and it is up to the user/audience to select which is on, and which is off. Rather than tracking individuals, however, this is an observation of patterns of crowd dynamics, revealed through multiple layers of audiovisual media applied over the original footage. This multilayered hypermediacy, wherein the subject matter of the videos blends with and is made more powerful by the multiple media it is expressed through, constitutes the installation’s unique artistic statement.
This essay has discussed Milieu Reverb, an audiovisual-generative expression of pedestrian movement through urban public spaces. The work consists of a series of videos that are digitally manipulated in real time and a corresponding gallery installation that exhibits them simultaneously. Employing tracking techniques at its core, the artwork depersonalizes an otherwise familiar form—video monitoring of a public space—by blurring the original camera feed to make people unidentifiable, while retaining information about their movement through the urban space and highlighting it using a colorful visual language. The artwork is also innovative in its use of multilayered hypermediacy to provide transparency to multiple, simultaneously playing videos via an interface built primarily on hedonic interactive features. The random assemblage of movies available at the gallery exhibition, compounded by the possibility of interactive editing by the audience itself, represents an innovative approach to the augmentation of reality through digital technology.
References and Notes
 Jay David Bolter and Richard Grusin, Remediation: Understanding New Media(Cambridge, MA: MIT Press, 1998), 56-62.
 Erving Goffman, Behaviour in Public Places: Notes on the Social Organisation of Gatherings (New York, NY: The Free Press, 1963).
 Erving Goffman, The Presentation of Self in Everyday Life (Woodstock, NY: The Overlook Press, 1973).
 Julie Rico, Giulio Jacucci, Stuart Reeves, Lone Koefoed Hansen, and Stephen Brewster, “Designing for Performative Interactions in Public Spaces,” Proceedings of the UbiComp'10, Copenhagen, Denmark, September 26-29, 2010.
 Bill Hillier and Julienne Hanson, The Social Logic of Space (Cambridge: Cambridge University Press, 1984).
 Carl Myhill, “Commercial Success by Looking for Desire Lines,” Proceedings of the 6th Asia Pacific Computer-Human Interaction Conference, Rotorua, New Zealand, June 29-July 2, 2004.
 William H. Whyte, The Social Life of Small Urban Spaces (Washington, DC: The Conservation Foundation, 1980).
 Dieter Uckelmann, Mark Harrison, and Florian Michahelles, Architecting the Internet of Things (Berlin: Springer, 2011).
 Jan Fernback, "Sousveillance: Communities of Resistance to the Surveillance Environment,” Telematics and Informatics 30, no. 1 (2013): 11-21.
 Steve Mann, “Sousveillance: Inverse Surveillance in Multimedia Imaging,” Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, October 10-16, 2004.
 Christian Fuchs, “New Media, Web 2.0 and Surveillance,” Sociology Compass 5, no. 2 (2011): 134-147.
 Marshall McLuhan, Understanding Media: The Extensions of Man (New York, NY: New American Library, Times Mirror, 1964).
 Jay David Bolter and Richard Grusin, Remediation: Understanding New Media(Cambridge, MA: MIT Press, 1998), 18.
 Henrik Harder, Akkelies van Nes, Anders Sorgenfri Jensen, Kristian Hegner Reinau, and Michael Weber, “Time Use and Movement Behaviour of Young People in Cities. The application of GPS tracking in tracing movement pattern of young people for a week in Aalborg,” Proceedings of the Eighth International Space Syntax Symposium, Santiago, Chile, January 3-6, 2012.
 Human Official Website, “Human - Activity & Calorie Tracker,” http://human.co/(accessed November 15, 2014).
 “Smartphone App Human Draws Maps of Urban Movement,” Dezeen Magazine, July 13, 2014, http://www.dezeen.com/2014/07/13/human-app-maps-urban-movement-with-wearable-technology/ (accessed November 15, 2014).
 Processing Official Website, “Processing,” http://www.processing.org (accessed November 15, 2014).
 GitHub, “OpenCV for Processing. A Creative Coding Computer Vision Library Based on the Official OpenCV Java API,” https://github.com/atduskgreg/opencv-processing(accessed November 15, 2014).
 Beads Project, “Beads,” http://www.beadsproject.net (accessed November 15, 2014).
 Marc Hassenzahl, Axel Platz, Michael Burmester, and Katrin Lehner, “Hedonic and Ergonomic Quality Aspects Determine a Software’s Appeal,” Proceedings of the CHI 2000 Conference on Human Factors in Computing, The Hague, Netherlands, April 1-6, 2000.
 Marc Hassenzahl, Sarah Diefenbach and Anja Goritz, “Needs, Affect, and Interactive Products – Facets of User Experience,” Interacting with Computers 22 (2010): 353–362.
This project has been realized as part of an art residency at Culture at Work, Pyrmont, NSW, Australia. The exhibition of the videos was held at the Accelerator Gallery, Culture at Work, from 17th August to 6th September 2013. Additional support was received from the Design Lab – Faculty of Architecture, Design and Planning, University of Sydney, Australia. All movies conceived for this project were recorded in Sydney, Australia, between July and September 2013.
Dr. Luke Hespanhol is a media artist and design researcher based in Sydney, Australia. His practice investigates the potential of digital media to enable public expression and reflection on the relationship between individuals and the built environment. He has developed interactive installations for academic research, galleries, and public art festivals, including multiple editions of Vivid Sydney. Luke holds a PhD on interactive media architecture and the user-centred design of hybrid urban environments, is a guest researcher at the Department of Aesthetics and Communication at Aarhus University, Denmark, and currently teaches Design and Computation at the University of Sydney.