Multimedia and the Web Workshop Report

WWW9, 15th May 2000, Amsterdam, The Netherlands

Lynda Hardman, Jacco van Ossenbruggen

Original call, Workshop Web page

Text documents containing images have long been commonplace within the Web infrastructure. Multimedia documents, which rely much more heavily on time elements or synchronization among elements, are only beginning to make an entrance into the Web world -- SMIL 1.0 was released as a recommendation in June 1998 and SMIL Boston, the public draft for the next release of SMIL, is currently being worked on.

Web-based multimedia, however, covers broader issues than a single document language. The goals of the workshop were to explore the open issues for Web-based multimedia, to gain insights into the potential directions Web-based multimedia could go and to obtain a feel for the most fruitful directions. We selected presentations from the submitted position papers in order to focus our ideas and try to bring different perspectives on similar topics. After presentations in four "themes" we devoted time to issues that were raised during the presentation and discussion sessions.

The agenda was limited to allow a large amount of time for discussion. The topics chosen for the talks were broadly:

Integration of timing into XML documents

SMIL 1.0 is a language developed by W3C to enable synchronized multimedia presentations to be described declaratively and to provide a functional specification of presentation behaviour at runtime. A declarative specification supports the preservation of presentation behaviour through different authoring tools -- the document needs to be able to survive a "round trip", i.e. being created in tool 1, saved from tool 1, read into tool 2, saved from tool 2 and read back into tool 1.

The existence of SMIL 1.0 as a single document format provides only a partial solution to synchronized behaviour on the Web. XML allows multiple document languages to be defined, some of which could benefit from added temporal functionality. For example, XHTML would benefit from the addition of temporal behaviour for emulating, e.g., slide presentations. SMIL Boston defines a timing module which can also be used in other document formats. This approach has already been used in HTML+SMIL and SVG Animation.

While defining a timing model is necessary, it is not sufficient for including temporal information in XML documents -- a flexible syntactic approach also needs to be defined. Presentations from Patrick Schmitz and Ramon Clout presented such an approach by combining three ways of adding temporal information to XML documents:

Patrick discussed several issues related to the mutual interaction of the three approaches, especially in the context of the general W3C document processing and filter chain. Aspects include interaction of temporal behaviour with CSS style properties, DOM manipulations, XSLT transformations etc.

Ramon focused on the concept of a timesheet, and explained two principles underlying this concept. The first is to associate three functional sections with a document. These are content, formatting, and timing. The second principle is to relate the timing section with the other two by assigning timing information as relations between the entities in the other sections.

Discussion points were whether the SMIL Boston timing model is appropriate for generic XML document types. A common declarative timing model allows authoring tools for different XML document formats to present temporal information to users in a consistent way. Which information should be presented as being on a timeline during authoring was also discussed. For example, the ordering of elements is useful, but for adding transition information (e.g. fade between images) a full timeline view is probably overkill.

Later during the day it was noted that the approach to add timing proposed by Patrick and Ramon, (inline, style and timesheets) could also be applied to other types of information. Other important examples included are meta-data (e.g. MPEG-7), adaptation hints (for device or human accessibility) and navigation structures.

The relation between MPEG and W3C work

Carsten Herpel gave an overview of MPEG-4. He emphasized the difference between a scene-based model of a presentation and a document-based model. Furthermore, the MPEG-4 scene may have a timeline that stretches indefinitely and changes to the scene can be conveyed incrementally. Another important difference is the emphasis on delivery timing in MPEG-4, in addition to the emphasis on presentation time in SMIL. The latter is the time the parts of the document are calculated to play, given ideal conditions. The former takes into account multiplexing and streaming mechanisms and network loads, and makes sure that the available network bandwidth (assuming QoS) is not exceeded by the streamed content. (Note that while SMIL Boston allows specification of prefetching strategies, its delivery model is not as sophisticated as that of MPEG-4.) The MPEG-4 community is considering producing a text representation of BIFS (BInary Formats for Scenes). Some of the BIFS functionalities are proposed to be mapped to the appropriate SMIL modules. Carsten also discussed the overlap between the MPEG and W3C work in this area, and put up the following equation:
(BIFS-2D + MPEG-J) ~= (HTML + SVG + SMIL + DOM)
The equation is intended to be provocative and from a 3000 ft. view. While there are some functionalities that can not be accomplished by either side, many applications could be realized in both domains.

Jane Hunter and Frank Nack explained the MPEG-7 standard for describing semantic annotations for continuous media. They asked the question as to whether streaming meta-data would be useful, or whether the audio/video data formats so overwhelm the bandwidth problems that even a large XML meta-data file would not need to be streamed. An example of an application where streaming meta-data would be useful independent of size issues includes a live broadcast, where semantic information needs to be generated on the fly.

Adaptation

Masahiro Hori described the on-the-fly adaptation of XML or XHTML to formats suitable for small devices such as PDAs and mobile phones. In order to enable such a process, extra semantic information is needed to encode, for example, the importance of a particular piece of each part of the original document. This information is encoded externally by means of an authoring tool which makes use of RDF and associates it with the original document by using XPointer and XLink.

Markku Hakkinen described a method of creating navigation structures over and above existing document structures, in particular digital talking books (DTB). A DTB provides spoken or synthesized renditions of a textual source for non-visual readers. While SMIL can be used to synchronize the audio to corresponding text elements, it provides inadequate navigation facilities. NCX, a format for providing flexible navigation structures on top of existing DTBs is described. This allows navigational access to a wide variety of source documents such as well-structured XML, legacy HTML or audio-only books.

While the talk by Masahiro emphasized the on-the-fly creation of WML by transcoding existing HTML pages, Daniel Ockeloen focuses on a common underlying data model that can be used for generating multiple delivery formats including WML, HTML and SMIL. Daniel also stressed the importance for content-owners of archiving high quality media data which can be converted at run-time to the particular format required. Large scale production of multimedia on the Web requires the integration of Web authoring tools into the everyday production environment of the content providers. For example, authoring tools need to be able to re-use valuable information such as that stored in edit decision lists (EDLs).

Discussion

Discussion points were collected from the presentations and ensuing discussions, then a few were chosen as topics for discussion by the whole group: application areas and business models, content annotation, SMIL/MPEG.

Application areas and business models

If we are trying to design multimedia technology for the Web, then who are we designing for? Who are our users? What will their applications be? How will the convergence of the Web, TV and mobile devices (fundamentally) influence these applications and development of new ones? Is Web technology able to handle a huge number of requests when triggered by a mass medium such as TV?

Whatever they are, new applications need to be defined at a level that does not depend on current standards. New tools need to take into account legacy content as well as legacy tools. In order for new tools to be accepted they need to integrate into the current production methods, processes and tools or they will not be used.

Traditional storytelling is focussed on relative short time frames, whereas in TV entertainment, storylines, for example from soap operas or reports on running elections, can last weeks or months. This requires support from the content provider's infrastructure to be able to provide consistency over a period of time.

Entertainment broadcasting can allow recording on the fly, decreasing the notion of "prime time", e.g. Tivo and Replay. This allows viewers to see what they want to see at a time convenient to them. It also, however, allows them to skip commercials. This removes the traditional payment model for the content, so that business models will have to change. How can we ensure that consumers pay for content if they can choose to skip adverts? Should there be different subscription levels, so that consumers can choose to pay to not have to see (all/some of) the ads? Ads can be pushed on the basis of consumer interests or pulled by an explicit request from the user. In both cases the ads would be more interesting, and more effective for the advertisers. There is already a model with DVD where consumers buy a disk and if they want access to material already on the disk (e.g. the director's cut) then they can visit a Web site, give information about themselves and then be given a key they can use to unlock the extra information. Another means is to allow people to vote by phone, so that the extra phone calls become a source of revenue.

Annotation of content

Annotation of non-textual media already occurs during the production of the media, that is when recording shots for inclusion in film or video productions, but currently the annotations are not in a form for storing with the media. A first step to providing online annotations of non-textual media is thus to retain the meta-data used and generated during production. Examples are the script of the storyline and the camera meta-data, such as focal length, pan, zooms. Camera metadata can also be determined beforehand to control the camera shot and then stored with the video recording -- that is if the technology allows it. With only a little more work, GPS technology (e.g. see TomTom) can be used to add real location data to the camera metadata stream. Another way of collecting metadata is to have users submit it, e.g. CDDB, where the huge effort required for populating the annotation database is distributed over massive numbers of users.

While camera metadata is non-controversial, other types of meta-data are more subjective. Rather than restricting annotators to rigid predefined sets or unmanageable free text, a more balanced approach is to allow users to define their own ontologies and be able to extend them. Just as librarians have systems for updating their classification schemes, so must online media annotation schemes be able to grow.

MPEG-4 is about streaming media, and MPEG-7 about semantic annotations. MPEG is currently trying to answer the question of how MPEG-7 annotations can be structured into "a priori info" -- static and streamable portions so that relevant portions can be streamed along with the MPEG-4 streams. What applications can be built from that data? Can the same meta data concepts be applied to media other than MPEG-4?

SMIL/MPEG

The different levels in the processing chain in MPEG-4 are better integrated than those that are covered by different IETF and W3C specifications. SMIL doesn't address transport issues, so that the author doesn't know what the different implementations will do, e.g., when the audio becomes out of sync. SMIL's support of hypermedia is more explicit. MPEG incorporates stricter timing in the streaming, and accounts for prefetching time, while SMIL has a higher level model of hierarchical time definitions (relative to a time container). It would therefore seem to be a good idea to combine MPEG notions of delivery/decoding time with the presentation time common to MPEG and SMIL along with W3C hyperlinking. Note, however, that the MPEG standard does not include the compression and transport of the streams. This is rather achieved by defining mappings to either RTP (for IP transport) or MPEG-2 (for Digital TV applications).

Bandwidth will always remain a bottleneck, and a standard needs to include priority indications of which streams, or media items, to drop or allow to hiccough. In some cases, for example, video conferencing, real-time is critical and delays (to allow for buffering) cannot be introduced. While this may be useful to specify within a standard, it is an area that companies may prefer to compete on.

David Shrimpton mentioned the Multimedia Home Platform (MHP) as a good example of the use of MPEG standards.