Virtual Reality, Volume Visualization and Video

 91-06/Sarnoff.info

From: herbt@apollo.sarnoff.com (Herbert H Taylor III)

Subject: Virtual Reality, Volume Visualization and Video (LONG) 

Date: Wed, 05 Jun 91 11:27:20 EDT





 The recent series of postings on VR/Video following the exchange

between myself and Chris Shaw have been most informative. I would also

like to publicly thank Chris for asking some very tough questions. I

can certainly take a "whipping" without losing my sense of humour -

although I hope we can both avoid the use of the pejorative. It was

unfortunate that the HDTV emphasis of previous posts defocused the

more important topic of VR Architectures. Chris challenged the

applicability of those ideas to VR and while I remain convinced that

they will prove important they seem to represent a small conceptual

detour. Likewise, a strong technical challenge was raised to our

description of several 3D video based scenarios. Much of the technical

criticism which followed was the result of our poor description of our

ideas. Real-time 3D imagery alone will not be sufficient to construct

and manage entirely "simulated" virtual worlds. We will not be able to

look under the Kitchen table for the "Chewing Gum". (Of course if the

CGI modeler doesn't have a "Chewing Gum" model, "it", to borrow a

favorite colloquial expression of Chris', "ain't there either.")

Whether these limitations preclude the use of 3D imagery as a useful

interface component of VR remains a topic of further research...


 In our original speculative post of ca. 3/10/91, (responding to our

moderators request for summaries of current research), we mused about

our desire to explore what VR will be like in ten years. We admitted

that the system we were using (aka the Princeton Engine) was not a

general solution to VR processing. I hope, however, that we use

Supercomputers as vehicles to explore otherwise impossible ideas.

Machines such as the UNC PxPL5, the CM2 or the Princeton Engine are

not "practical" single user architectures - at least not yet - but

surely in not too many years we will have desktop massive parallelism

with the potential for real-time interactive VR applications. This

leads naturally to motivating questions: will the "ultimate" VR system

of the year 2000 be more SGI-MIMD like - with a large number of

powerful processors in the rendering pipeline - or, will it be a

hybrid of SIMD and MIMD, as in the PxPL5?  Perhaps VR specific

architectures will emerge. What impact does our choice of VR world

model have on architecture? Is the evolution of VR only going to be in

the direction of increasingly realistic CGI rendering?  Are worlds

derived from sampled data going to become more viable?


 What are the applications which motivate future VR architectures?

Certainly, the potential for applications in medicine, architecture,

simulated experience, "gaming" and product design will continue to

provide motivation for developing systems with improved visual realism

and more natural interactions. Likewise, scientific data visualization

offers fertile ground for VR research and future application - where

we have this notion of interacting with and literally "experiencing"

our data. There has been significant independent progress in recent

years in each of the fields of interactive data visualization, VR and

specifically, Volume Visualization (VV), however, it will be the

convergence of all three technologies into a single computing and

interaction framework where the true enabling leap of functionality

will occur. Scientists will be able to simulate and visualize complex

phenomena, in some sense, actually "participate" in their experiments.

This kind of interaction will revolutionize scientific research in

much the same way that the computer itself has.


 There are probably those who will question our enthusiasm and observe

that our scientific forebears drew marvelous insight from very simple

physical models of complex structure without the benefit of computers

or graphics. It is said that the structure of benzene came to Kekule

in a dream. Certainly, Watson and Crick were able to visualize amazing

structure without the benefit of complex "tools"... which is exactly

the point. When the visualization systems of the future are as easy to

use as a box of snap together molecular models, as interactive as the

microscope or as "free associative" as a dream - only then will they

realize their full potential. These advances, however, will come at

great computational cost.


 Where are the computational boundaries for VR? To address these

issues we must first establish complexity bounds for VR in terms of

computation (rendering, dynamics, constraints, etc) and I/O. The

processing requirements of VR have been studied in terms of system

dynamics and constraint satisfaction by [Pentland90] giving O(n**3)

"calculations" per vertex for the dynamical system. For a 1000 objects

with a 1000 vertices a 100 TFLOP performance is required to achieve

interactivity (Assuming 100 floating point operations per system

"calculation"). That astounding number is still two orders removed

from the NEXT generation of supercomputers. The authors propose a

reduced complexity model - still with computational complexity in the

10 GFLOP range to satisfy system constraints and 100MFLOP for the

dynamics. A system which implements this approach is described in

[PentE90]. [Witkin90] also discusses the constrained dynamical system

in some detail. Polygon rendering has been discussed in [Ake88] with

floating point requirements in the range of 40MFLOP for 100,000

polygons. Further research needs to be done to reduce world complexity

and to make higher resolution worlds with complex objects more

tractable on near term computers.


 Several posters to this group have suggested that VR input processing

requirements are quite modest - at least in terms of the data glove.

We might ask how does the complexity scale as more and higher

"resolution" input devices are introduced?  Devices such as the Eye

tracker [Levoy90] and 3D head tracker [Wang90] would seem to add

significant complexity to VR processing requirements even with

dedicated interface hardware.  Comparatively less research has been

done on the physical "output" side of VR. [Minsky90] describes a

system using the sense of touch. What other input or output devices

can we look forward to and what are the performance specs?


  Clearly, the exponential growth of the computational and I/O

requirements of VR will motivate both algorithmic and architectural

solutions. A recent estimate by the US Government projects "terraflop"

commercial Supercomputers before the year 2000 [USG91] with the first

demonstration systems emerging from the DARPA High Performance

Computing Systems (HPCS) initiative in 1992-3. Whether the spectacular

performance of these machines can be fully harnessed for VR remains to

be seen. Perhaps what is really needed is a combination of VR specific

architectures with VR specific algorithms - architectures which are on

the HPCS technology learning curve (i.e. that employ MCM packaging,

superdense ULSI, optical inteconnect, etc) COMBINED with algorithms

which can replace O(n**3) with say O(n log n). A number of research

groups have proposed (and in some cases actually built) "application

specific" or "algorithm specific" visualization computers, including

the well known UNC PxPL5 [Fuchs82], the SUNY CUBE [Kaufman88] and the

Stanford SLAM [Dem86] systems. (We are not sure if a full version of

the later machine was ever built.)  In general, these researchers were

motivated by the desire to explore "future" visualization algorithms.


  Our original motivation in developing the Princeton Engine was the

desire to explore "future" video systems. However, we have found that

its applicability is in no way "limited" to video and it can serve as

a useful architecture to study visualization algorithms and future

visualization systmes. We certainly do believe we can accomplish much

of what we described in our previous posts. After all, the system can

turn over 30 (16bit) GIGAOPS, or over 1 GFLOP (for you scientific

types). More important, ALL system I/O is continuous and transparent

to the CPU. A 48bitx28MHZ (=1.4GBPS) digital input bus and a

64bitx28MHZ (= 1.8GBPS) digital output bus can drive any combination

of analog or digital I/O devices. Transparent, continuous gigabit I/O

should be important to the NEXT generation of VR peripherals: Data

Gloves, eye trackers, you invent it. Finally, while there is no

special system requirement that either the I/O or the application be

"video", a typical application will often COMBINE scientific computing

AND real-time data visualization.


 Reat-time Interactive Volume Visualization

 ------------------------------------------

 To calibrate this machine for graphics applications, we are in the

process of implementing a real-time volume rendering system that can

arbitrarily rotate and render 256x256x256 volumes at 30fps. We believe

we can volume render 1Kx1Kx1K at about 8fps using single axis

rotation. This is possible because the Princeton Engine can perform a

"continuous" real-time transpose (512x512x32bitsx30fps) for very

little CPU cost.  The programmer effectively has an array and its

transpose as working data structures. At any line "time" each

processor has a row and colume of the current frame in hand.

Therefore, "scanline" algorithms are relatively straightforward...


 A number of recent papers suggest that Volume Visualization (VV) and

Virtual Reality are closely related, convergent applications. The

"edvol" system combines a VPL Data Glove and 3SPACE Polhemus Tracker

to provide direct interaction with volumetric data [Kaufman90]. The

authors do not characterize the size and complexity of either the VR

or the VV but do describe it as "small scale". It often seems as

though the electronic "media" believe that VV is "already" a standard

component of VR. This obvious misconception has occured because volume

visualization demos are usually presented as real-time interactive

simulations, when the visualization actually took CPU hours to

orchestrate. But the "dream" clearly is real-time interactive volume

rendering and visualization.  One of the best demonstrations I have

seen of the potential for a VV fly through was presented by Mark Levoy

(then of UNC) using volume rendered CT. In [Levoy89] the intended use

of a head mounted display interface to the PxPL5 is described while in

[Levoy90] the use of eye tracking hardware is described - in both

cases specifically for VV.  UNC's Steve Pizer showed a video of 8fps

single axis rotation volume rendering on PxPL5 at the San Diego

Workshop on VV. A head display system has also been developed at UNC

to assist radiologists in treatment planning.  Although these examples

may not in all cases qualify as pure "VR" they certainly speak to the

potential for a real-time interactive VR interface to a volume

visualization environment.


  Volumetric data sets come in two basic forms: "real" sampled data

(as in CT, MRI, Ultrasound, Optical or Xray Microscopy, etc.)  and

computed or "synthetic" data (weather models, CFD, etc.). In the later

case the volume is usually "simulated" while in the former the "raw"

data is sampled and sometimes preprocessed before the volume is

rendered. For example, before an MRI image is produced the "raw"

sampled data must be Fourier Transformed.  With either approach the

resulting data set is a 3D spatial volume. In the case of synthetic

simulated data there is also a "timestep" - the fluid flows, the

turbine spins, etc. With sampled data there is often no clear notion

of time, the data is entirely static; however, the interaction with

the data can be dynamic and even involve the "introduction" of time. A

traveler passing through a sampled and rendered volume certainly

experiences the passage of time, however, the "world" itself remains

static. By analogy one can imagine walking through a museum (static

sampled "objects") verses walking along the bank of a river (dynamic

simulated "objects"). In our conceptual museum, as we begin to

interact with objects we can simplify the system constraint dynamics

as much as desired, literally "determining" the laws of physics. That

Ming Dynasty vase I knocked over? It never touched the ground. If we

are "in" an MRI or CT museum we might wish to change opacity, point of

view or other parameters which effect our visual perception of the

phenomena we are studying. Of course, the same control of time is

possible in the "synthetic" case, however, only at the risk of

undermining the scientific interpretation of the simulation i.e.

correctly visualizing and understanding the physics is often

fundamental to the experiment.


  With the emergence of increasingly real-time instrumention there is

a second form of sampled data to consider: 3D spatial volumes with

time varying data. Imagine a sampled volume of a living organism or

dynamic micro structure which is updated 30 times a second. If we were

"inside" this museum while it was "open" we could watch cells as they

proceed through nuclear envelope breakdown, divide and emerge as two

identical cells. (We are working with this kind of data now.) The

degree to which this form of interaction is a Virtual Reality results

not from our ability to "alter the experiment" on the fly, but from

our ability to control the dynamics of how we "view" the experiment

while it is taking place. We may ultimately be able to turn up the

heat or add some catalyst to a chemical reaction from within the

Virtual experience but that ability neither defines VR nor does the

absense of that ability preclude VR. IMO, it is the VR observers

perceived sense of simulated presence combined with the ability to

control the visual experience which principally defines the

interaction as VR.


 Two related projects provide sources of volumetric data which we are

using at Sarnoff and which we feel have VR "prospects": from an

experimental ultrasound instrument and from a differential inferential

contrast (DIC) micrograph which produces a sequence of image slices

through a cell embryo. The DIC volume can be acquired at or near

real-time with the latest instrumentation - hence, a "video volume"

(sorry Chris, it really can be...) In the case of the ultrasound

instrument the Princeton Engine will also perform the front-end signal

processing required to produce a volume from the sampled data.

Presently, a "raw" 3D ultrasound data set cannot be acquired in

real-time, however, the signal processing to produce a data volume

from an unprocessed data set can potentially be accomplished in

real-time, as can the Volume Rendering process. The KEY POINT is that

we are going to see more and more real-time instrumentation which can

produce true sampled data volumes. As the aquisition time of systems

such as MRI, ultrasound, Electron Microscopy and DIC decrease we will

also see greater coupling between the front-end signal processing, the

data visualization and perhaps even the user interaction. At a recent

workshop at Princeton University, "Seeing Into Materials: Imaging

Complex Structures", both optical and EM microscopy systems capable of

real-time 3D acquisition were described... Actually "4D" is used to

refer to a 3D spatial volume "moving" in time. BTW, these scientists

have a strong intuitive feel for the potential of VR - at least what

they call "VR". That is, the ability to interact with an experiment -

either in situ OR as part of post analysis - as in our museum

examples.


 Is it fair to ask where in the taxonomy of VR systems we should place

these kinds of applications? True, the worlds are derived from various

real world spectra but the interactions are entirely SIMULATED, one

can change viewing parameters, etc. The exact meaning of virtually

touching "objects" or surfaces in such worlds remains unclear... but

really no more so than in a CGI simulated world where everything is

built from models. In either case the eventual consequence of our

fundamental interactions within a world must be determined by a "law

giver". If I put my data gloved hand directly into the burner of a VR

Kitchen stove what happens?


 It should be noted that there are several potential problems with

these methods of data visualization. First, a number of people report

varying degrees of motion sickness when observing through a head

mounted display. That may be acceptable if I am performing a "hammer-

head" maneuver in my Superdecathalon simulator, but probably is not

acceptable if I am inside someones brain. (Informal Poll: How Many of

you have experienced this effect? RSVP and I will tabulate.) A second

potential problem results from "persistance" effects on the human

visual system. We vividly recall in the early days of VLSI CAD

workstations the problem IC draft persons experienced after long hours

staring at color stripes and squares. [Frome83] describes the so

called "McCollough effect", wherein, after looking at color stripes

for only a short time, high contrast B&W stripes suddenly appear to

have color where none is present. To dramatize this effect during her

talk at DAC83, Francine Frome periodically displayed slides with green

and red stripes. About halfway through the talk she displayed a slide

with a striped "BTL" in bright green foreground offset from a bright

red background - before informing the audience that the slide was

totally black and white! It was quite remarkable. These issues and

other "human factor" issues will need to be fully understood before

head mounted displays achieve broad use and certainly before we let

Nintendo sell one to every ten year old...


 More VR/Video

 -------------

  In our original post we also speculated that multiple camaras might

be used to develop a "Video" data glove or "whole body" interface to a

virtual world. In particular, we asked how such an interface would

impact the future design of the data glove.  We received a number of

thoughtful comments following this post. It is important to note that

the VR world itself COULD STILL be CGI - with video only providing a

framework for interaction. With support for up to six camaras one

could surround participants either individually or collectively with

video. (Remember we pay no CPU cost to load frames into memory from

each camara, however, we do pay once we start to do something with the

data.) Participants might wear "chroma keyed" gloves (wireless gloves

of a reserved key color) or even body suites. Chroma keying is a well

known technique for creating simple special effects such as the

"weatherman" overlay [Ennes77] [Watk90]. We would NOT use this merely

for special effects, however, but to provide a means of isolating the

hands so we can build a useful model. On the Engine the ammount of

processing for each chroma key is only about 5-10% of the "real-time

budget" at 30fps. A second chroma key is used for the background. This

is similar to Myron Kruegers Videoplace which uses white backing

screens. It differs in that Videoplace produces only silhouettes of

"Artificial Reality" participants as a group and provides a limited

framework for identifying individual participants.  ( Don't get me

wrong - Videoplace is still a lot of Fun! - I recently spent a day at

the Franklin Institute watching kids play in it and was impressed by

the overall effect produced. )


 The next major technical step is to be able to exploit this interface

in a useful way. In particular we want to study the effect of multiple

"individual" participants. If two video channels are paired to each

participant with a distinct chroma key can we construct a useful model

and use that to interact with and control the dynamics of the

visualization process. Our present plan will focus on the real-time

recognition of simple hand gestures from each "pair of hands".


 The idea of using sign language as some posters have suggested is

very interesting - particularly coupled with a neural net based

recognizer. Recognizing full ASL in "continuous" real-time by any

approach is probably ambitious, however, a useful subset might be

possible. We have used a neural net approach to detect and remove

characteristic AM impulse noise (aka "hair dryer noise") in a TV

receiver [Pearson90].  One network is trained to detect AM impulses on

an image line and a second network is trained to look at the entire

image and determine which of the detected pulses are really "false"

positives. This program runs in continuous real-time on the Princeton

Engine. We also demonstrated real-time BEP training on a simple three

layer MLP (a total of 86 w's and th's). For hand signs, however, a new

network topology would be required - with the input to the network

derived from the subsampled chroma key image segment of the original

image.  However, if my understanding of "conversational" ASL is

correct - and each hand sign is typically an entire word or concept -

then the resulting training set still might be hugh. Also, I believe

that hand motion itself plays a significant role in the interpretation

of signs - not just in the transition from one sign to another - as in

cursive writing. This implies that a robust sign recognition system

would need to compute a motion vector and use that as part of the

training set.  We would appreciate references to current work...

particularly how one detects individual signs when in "continuous"

conversation i.e.  when does one sign end and the next begin?


  Lastly, a second video experiment would involve the use of the

chroma key to present to each of three remote participants a composite

image of their two neighbors, to form a virtual conference. While this

interaction is entirely real-time we recognize that there will be a

significant limit to the quality of interaction between subjects. We

are interested in the degree of "total" immersion each person

experiences.  If we also mix the audio does the participant "feel"

like he or she is having a conversation with three people.

Unfortunately, I would imagine that the head mounted displays would

tend to undermine intimacy - "perhaps" we could image warp new faces

on everybody - just kidding Chris - although now that I think about

it...


 References

 ----------


[Pentland90] "Computational Complexity Verses Virtual Worlds", A

Pentland, 1990 Symposium on Interactive 3D Graphics. Vol 24,

No 2, March 1990, ACM SIGGRAPH


 ( Based on the quality of papers in the proceedings, this must have

been a great conference! )


[Witkin90] "Interactive Dynamics", A Witkin, M Gleicher, W Welch, 

1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March

1990, ACM SIGGRAPH


[PentE90] "The ThingWorld Modeling System: Virtual Sculpting by Modal

Forces" A Pentland, I Essa, M Freidmann, B Horowitz.

1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March

1990, ACM SIGGRAPH


[Ack88] "High Performance Polygon Rendering" K Akeley, T Jermoluk,

Computer Graphics Vol 22, No 4, August 1988. ACM


[Levoy90] "Gaze Directed Volume Visualization", M Levoy, W Whitaker.

1990 Symposium on Interactive 3D Graphics. Vol 24, No 2, March

1990, ACM SIGGRAPH


[Wang90] "A Real-time Optical 3D Tracker For Head Mounted Display

Systems" J Wang, V Chi, H Fuchs. 1990 Symposium on Interactive

3D Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH


[Minsky90,vr-66] "Feeling and Seeing: Issues in Force Display" M

Minsky, O Ming, O Steele, F Brooks, Jr. 1990 Symposium on

Interactive 3D Graphics. Vol 24, No 2, March 1990, ACM

SIGGRAPH


[USG91] "Grand Challenges: High Performance Computing and

Communications" A Report by the Committee on Physical,

Mathematical and Engineering Sciences; Federal Coordinating

Council for Science, Engineering, and Technology; Office of

Science and Technology Policy.


[Fuchs82] "Developing Pixel-Planes, A Smart Memory Based Raster

Graphics System", H Fuchs, J Poulton, A Paeth, A Bell. 1982

Conference on Advanced Research in VLSI.


[Kauf88] "Memory and Processing Architecture for 3D Voxel-Based

Imagery" A Kaufman, R Bakalash, IEEE Computer Graphics and

Applications, Volume 8. No 11, November 1988, pg 10-23

reprinted in "Volume Visualization", edited by A Kaufman, IEEE

Computer Society, 1991.


[Dem86] "Scan Line Access Memories for High Speed Image

Rasterization", S.G. Demetrescu. Phd Dissertation, Stanford

University. June 1986.


[Kauf90] "Direct Interaction with a 3D Volumetric Environment", A

Kaufman, R Yagel, R Bakalash, 1990 Symposium on Interactive 3D

Graphics. Vol 24, No 2, March 1990, ACM SIGGRAPH


  (We also highly recommend, "Volume Visualization", edited by A

Kaufman, IEEE Computer Society, 1991 which contains a large survey of

relevant publications.)


[Levoy89] "Design for a Real-Time High Quality Volume Rendering

Workstation" Chapel Hill Workshop on Volume Visualization,

1989, Department of Computer Science, University of North

Carolina. C. Upson, Editor.


[Frome83] "Incorporating the Human Factor in Color CAD Systems", F.S.

Frome, 20th Design Automation Conference, June 1983, IEEE

Computer Society. 


[Ennes77] "Television Broadcasting: Equipement, Systems and Operating

Fundamentals", H.W Sams, 1979. pg 319-323


[Watk90] "The Art of Digital Video", John Watkinson, Focal Press,

1990, pg 75-77


[Pearson90] "Artificial Neural Networks as TV Signal Processors"

Clay D. Spence, John C. Pearson, Ronald Sverdlove SPIE

Proceedings Vol. 1469: Applications of Artificial Neural

networks, 1991






 

Comments

Popular posts from this blog

BOTTOM LIVE script

Fawlty Towers script for "A Touch of Class"