Overview
The world of consumer electronics is growing very fast. New devices, services, media, and delivery networks are constantly introduced. Novel description methods for audiovisual content are necessary in order to use it on different platforms. For this aim it is important to have a common standard, because the description should be platform independent. In the MPEG 4 standard of the ISO Moving Picture Experts Group, a collection of coding processes and principles are defined which may bring us closer to these demands. Just a few of the defined principles have been implemented until now. One important idea in this object orientated framework is the coding of each audiovisual object as an element of a scene. Complete application chains are needed to demonstrate and verify such principles.
1. Introduction
MPEG-4 passed as a standard for the coding of audiovisual objects in 1998. The "Moving Picture Expert Group" is a working group of the International Organisation for Standardization (ISO). In the early 1990s, MPEG-1 and MPEG-2 became two very successful standards for coding video and associated audio. They are used e. g. for the DVD and DVB.
MPEG-4 is not only an improvement of MPEG-2. Novel approaches for coding aural and visual content are defined. The main idea is to code aural, visual, or audiovisual content as separated media objects in a 2D or 3D scene. The coding of these objects relates to their origin. This means that e.g. a chair can either be a real object pictured by a camera or a synthetic one modelled on a graphics computer. Because of these many different processes defined in MPEG-4, it has become a very extensive norm.
Due to the extent of the standard it is difficult to get an overview. Many catchwords are associated with MPEG-4, to name only a few:
advanced audio coding AAC, facial animation, low bit rate video codecs (exchange video clips via Internet)
Every catchword is standing for a coding method. It is self-evident that there are more approaches:
arbitrarily shaped video, 3D audio objects, synthetic visual objects (e. g. meshes)
Most of these processes for audiovisual objects are defined in
Part 2: Visual
and Part 3: Audio
of the MPEG-4 standard. Because of number of different coding processed and approaches the MPEG-4 standard is often called as a tool box. A successful rule of the MPEG is defining only the coded signal and how to decode it but not the way to encode it. Thus the standard can be up to date for a long time because the encoder technology can be improved while the standard remains unaltered.
The description of the spatial and temporal relationships of these objects are laid down in
Part 1: Systems
of MPEG-4. In further parts e. g. the reference software and the transmission via IP are demonstrated.
The relationship between the media objects can be defined using a special language. It includes definitions of the spatial arrangement and the temporal run as well as interactions functions for the user. Furthermore, the details of the coding method used, and information about e. g. the origin of the object can be stored in object descriptors. All these data are combined to a BiFS file (Binary Format for Scene Description) which is explained in Part 1 of the MPEG-4 standard.
Fig. 1: Parts of MPEG-4
Different kinds of media objects may come together in a scene. This raises new questions on how the scene can be generated and represented. Especially if natural and artificial objects get together new procedures are necessary. This principle is called Synthetic Natural Hybrid Coding (SNHC).
The use of scenes (BIFS) with several media objects has not become very customary until now. Nowadays in most cases MPEG-4 is used as a coding standard for only one type of media objects. Applications using the scene concept are scarcely available even the standard is already some years old.
There are several benefits utilizing the MPEG 4 scene concept. These will become clear when explaining the significance of complete application systems in the next chapter.
2. Why Application Systems
The digital media chain is a starting point for the application systems in development.
The main idea is the content adaptive and object based description beginning at the creation ending at the presentation of the content in a 3D environment. Thus, such innovative features like individual views, presentation onto a 3D display, and object based interactions within the content are possible.
The reproduction and the interactive utilization will go on respectively at special consumer devices. Scalability and reusability of the content at many platforms or for many productions are other aspects. This is practicable because of a standardized description.
The parts of the digital media chain are:
- Production
The authoring process contains the design of a scene, the recording of natural objects, the design of synthetic objects, and the acquisition of additional data e.g. meta data.
- Coding and Transmission
This part contains the data management, the coding of the media objects, and the adaptation to the transmission layer.
- Interactive Application
This part represents the hardware and the user interface at the user side including a back channel.
Controlling the complete media chain is necessary in order to demonstrate the benefits of the philosophy of MPEG-4 systems realized in the aimed application system.
3. Aspects of Object Based Media Processing
In a conventional scene as produced for TV or cinema, a scriptwriter defines both the plot (the temporal run of the things happening) and the viewpoint (viewer's angle, distance, etc.) at the scenery. A director makes sure that these instructions are followed properly by the actors and that the cameraman chooses the right settings. There is more than one century of experience with these things at our disposal.
In contrast, interactivity calls for new ways of media production. If we want to allow the user to move freely inside a scenery, we need to deliver more (geometrical, textural, aural, etc.) information about it. One way of doing this is to create a detailed virtual scenery consisting of natural and/or synthetic objects.
These objects then need to be related to each other in order to form one consistent scene. Relationships between objects can have temporal, geometrical or logical qualities. They are defined in a tree structure called scene graph.
3.1 Requirements
The main goal of object based media processing is the integration of natural and synthetic sources into defined environments. The source data has to be acquired, coded and transmitted in an efficient way. These requirements shall be realized with standard consumer hardware.
Interactivity is another important requirement when developing user applications. An example already in existence is interactive broadcast TV. A unique format is required for the description of realistic audiovisual scenes with integration of different levels of interactivity. MPEG-4 provides efficient support for this. New markets are emerging for commercial applications with interactive use.
3.2 Acquisition of Synthetic and Natural Objects
Natural objects are captured from a natural environment by means of microphones and broadcast cameras. The sources of synthetic objects are technical devices or software applications.
Let's first discuss the acquisition of synthetic objects. It is possible to differentiate between three main object types:
- 3D-geometry
- synthetic visual objects
- synthetic audio
The 3D geometry is created with the help of modelling software, e.g. 3D computer animation or CAD software. To a great extent, synthetic visual objects like textures are produced with image processing software. Synthetic audio comprises various types of signals. Once created, these can be rendered and stored just like any natural audio data (e.g. in the common PCM format). On the other hand, synthetic sounds and sound-effects might be created using the Structured Audio Orchestra Language (SAOL). Then, these single sounds can be put together to form a melody or a more complex score using the Structured Audio Score Language (SASL). For synthetic speech generation, there are special modules available in MPEG-4. Meta information is not a part of the MPEG 4 description. As described, the fundamental acquisition of synthetic objects is not very difficult today.
In contrast, the acquisition of natural objects is much more difficult. Natural objects have to be captured free from the surrounding environmental influences. Additionally, aspects of post-processing for interaction with natural objects have to be considered.
For the production of natural objects, ideal conditions are given in a virtual studio environment. There, it is possible to get time dependent tracking data of natural audiovisual objects together with the objects themselves, using methods of pattern recognition, sensors and infra red monitoring.
The recording of natural audio objects in the virtual studio can easily be done using standard Lavalier microphones. They offer a lot of advantages compared to hand-held microphones or microphones mounted on a stand or a boom. First of all, they are nearly invisible and do not disturb the visual appearance of the audiovisual object. Second, when applied in the correct way, they deliver an acoustically fairly dry signal, because they are very close to the actual sound source. This is very important with regard to aspects of immersion and interactivity (see cps. 3.4, 3.6). Finally, they are easy to use, and as most models are available as wireless devices, they don't hinder the actor (the natural audiovisual object).
One aspect of research in MPEG 4 is the acquisition of natural video, a task which can be separated into various methods according to its later use:
- ordinary 2D rectangular video
- 2D shaped video
- stereoscopic video
- omni directional video
- multi view video
- 3D free viewpoint video objects
This selection gives an impression of how different the demands for natural video representations can be.
3.3 "As realistic as possible"
The fusion of natural and synthetic content is defined in the systems part of the MPEG 4 standard. It is stored in a BIFS file. The multimedia content can be delivered to a number of different user devices, e.g. DVB receivers, mobile phones or PDAs. The high number of different user devices causes the need for a quality of service (QoS) definition. Profiles for levels of applications are defined to support QoS.
In MPEG 4 the "complete graphics profile" is the highest profile definition and ensures the greatest realism. The profile enables the composition of natural and synthetic objects into a complex 3D scenery. It is possible to define lights, sound sources, viewpoints, acoustic and visual material properties and so on. The descriptions in the "complete graphics profile" are not sufficient in some areas. Extensions are necessary to produce a higher level of scene realism. They can be integrated with the help of Advanced Framework extensions (AFX). New parts of AFX shall include the possibility to reproduce natural and technical phenomena, e.g. shadows (see fig. 2).
Fig. 2: An animated synthetic bumble bee casts shadows onto a 2D shaped video
Future applications need a higher level of interactivity to offer more fun and a deeper immersion of the user. At present, interactive TV supports only some very basic features.
By using MPEG 4 scene description, new interaction features can be introduced, as for example free choice of view point, exchange of single defined media objects, and so on. The definition of sensitive areas provides the integration of metadata for commercial use.
Another important point is the ability to reuse media objects in different sceneries. Evidently, objects become more and more complex when raising their naturalness by means of providing a more detailed description. For synthetic objects, if we want them to appear in a more detailed way, we need a more detailed description to be provided by the object designer. For natural objects, the expenditure of manpower and technical devices is even higher if we think of e.g. 3D free viewpoint video objects (see cp. 3.2).
In conclusion, both the higher level of interactivity and the wish to reuse media objects are factors (amongst others) which call for an advanced management of the objects involved and of the necessary data in general.
3.5 Data Management
These new methods for representing audiovisual content require novel ideas for managing the resulting data. Audio and video objects are usually large binary files. Scene descriptions and meta data are complex textual files with a tree structure.
The support of a group based authoring process including a rights management for authors and users is a special aspect.
The similarity of MPEG-4 scene description data to XML data allows the usage of an XML server. Some of these methods and aims are mentioned in MPEG-21 and MPEG-7.
3.6 New Methods of Presentation
As a novelty, MPEG-4 allows for a reproduction of the same content on different playback devices. Imagine an interactive video clip being reproduced on a TV screen in your living room, on a PDA in the office or on a cell phone away. On each of the three platforms named, we need a different level of detail for the objects to be displayed. Of course, in return we also get a different level of user immersion.
Taking into account not only visual aspects, but also the reproduction of sound, it becomes very clear that user immersion is a question of central interest. While 3D visual displays still lay ahead of us and can only be used for very cost-insensitive applications, 3D/2D sound displays can be found in many homes (surround sound loudspeaker setup) nowadays. Using a surround setup of loudspeakers (e.g. 5.1 surround sound), everyday-experiences can be reproduced when the consumer navigates inside the scenery.
If standing in a church, the human voice will sound different from a small room environment. This is due to the sound reflection pattern immanent to each specific environment. Even if the position of the sound source (the audio object) or the receiver (the user navigating in the scenery) changes, this reflection pattern is altered. With the help of room simulation algorithms which perform in real time, the reflection pattern can be calculated based on the geometrical data of the scene and the objects' and receiver's coordinates. Of course, acoustic qualities of (virtual) walls and objects have to be taken into account for a convincing result.
When using a bigger number of loudspeakers, immersion is even higher. But while a demonstration system using a setup of surround speakers is up and running in the IAVAS project, the implementation of Wave Field Synthesis (WFS) into this context will take some more research effort.
4. The IAVAS-Project
The Institute of Media Technology at Technische Universität Ilmenau (university of technology) comprises several professorships in the field of the technology of electronic media. More than 10 years' experience and more than 700 students in the course of Media Technology (degree is comparable to the Master of Science) make the institute an academic center in this field.
The director of the institute, Prof. Dr. Karlheinz Brandenburg has gained lots of experience during the invention of the MPEG Audio Layer 3 (the so called MP3). Looking to the tendencies of the electronic media technique, the idea for the IAVAS project was born in November 2001. Its title Interactive Audiovisual Application Systems explains the aim. Novel approaches for coding audiovisual content are to evaluate, to utilize, and to be improved. One focus is on MPEG-4.
The project is assigned to the Institute of Media Technology and is founded by the Thuringian Ministry of Science, Research, and Art.
The beginning of the project was on October 1st, 2001 for a duration of at least three years. Eight scientists under the leadership of Prof. Brandenburg work in and for this project. In addition, many students work on special tasks during their diploma thesis for the project, too.
There are already cooperations with other institutions and companies. A very close teamwork with the Fraunhofer Institute of Integrated Circuits in Erlangen and its working group Electronic Media Technology in Ilmenau has been established.
5. Summary
There are promising novel ideas in the field of digital media technology. The implementation of these requires new methods of creating, coding, managing, delivering, and utilization of interactive audiovisual content. The development of new generations of devices is absolutely necessary for the success of these concepts.
|