For use in a wide-issue pipelined processor, a mechanism and method for reducing pipeline stalls between nested calls and supporting early prefetching of instructions in nested subroutines and a digital signal processor (DSP) incorporating the mechanism or the method. In one embodiment, the mechanism includes: (1) a program counter (PC) generator that generates return PC values for call instructions in a pipeline of the processor and (2) return PC storage, coupled to the PC generator and located in an execution core of said processor, that stores the return PC values and makes ones of the return PC values available to a PC of the processor upon execution of corresponding return instructions.
A method for processing divergent samples in a programmable graphics processing unit is described. In one embodiment, the method includes the step of incrementing a subroutine depth of a first sample to designate that first call instructions are to be executed on the first sample. The method also includes the steps of pushing state data of a second sample upon which the first call instructions are not to be executed onto a global stack and executing the first call instructions on the first sample.
A method for synchronizing divergent samples in a programmable graphics processing unit is described. In one embodiment, the method includes the steps of determining that a divergence has occurred and detecting that a first sample of a group of samples has encountered a first synch token. The method also includes the steps of determining whether each of the other samples of the group has encountered a synch token and determining whether the synch token encountered by each of the other samples of the group is the first synch token.