environment
ClaudiuZissulescu,BartKienhuis,EdDeprettere
LeidenEmbeddedResearchCenter,
LeidenInstituteofAdvancedComputerScience(LIACS),
LeidenUniversity,TheNetherlands{claus,kienhuis,edd}@liacs.nl
Presentedat
Field-ProgrammableLogicandApplicationsconference(FPL’05)
Tampere,FinlandAugust24-26,2005
Abstract.AtLeidenUniversity,wearedevelopingadesignmethodologythatallowsforfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconfigurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].Thismethodologygeneratesaprocessnetworkinwhichtheinter-processcom-municationtakesplaceinapoint-to-pointfashion.Fourtypesofpoint-to-pointinter-processorcommunicationexistinthePN.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealizationsforthefourcommunicationtypesandshowthatpoint-to-pointcommunicationatthelevelofscalarscanberealizedautomaticallyandveryefficientlyintoday’sFPGAs.
1Introduction
Tobetterexploitthereconfigurablehardwaredevicesthatarecomingtomarket,anum-beroftechnologiesaredevelopedtohandlebillionsoftransistorsavailableinthesenewchips.Akeyideainthesetechnologiesisdecouplingthecommunicationfromcomputation.ThisdecouplingallowstheIPcores(thecomputationpart)andtheinter-connect(thecommunicationpart)tobedesignseparately[5].Respectingthisdesignconcept,wearedevelopingadesignmethodologythatallowsfastmappingofnested-loopapplications(e.g.DSP,Imaging,orMulti-Media)writteninasubsetofMatlabontoreconfigurabledevices.ThisdesignmethodologyisimplementedintoatoolchainthatwecallCOMPAAN/LAURA[8].
TheCOMPAANtoolanalyzestheMatlabapplicationandderivesautomaticallyaparallelrepresentation,expressedasaProcessNetwork(PN).APNconsistsofcon-currentprocessesthatareinterconnectedviaasynchronousFIFOs.ThecontroloftheinputMatlabprogramisdistributedovertheprocessesandthememoryisdistributedovertheFIFOs.TheLAURAtoolsynthesizesanetworkofhardwareprocessorsfromthegivenPN.AkeyoperationinLAURAistogeneratetheproperhardwarecommu-nicationmechanismforthepointtopointinter-processorscommunication,whichmaybeadifferentcommunicationstructurethanthePNcommunicationFIFOmodel.
2ClaudiuZissulescu,BartKienhuis,EdDeprettere
Ourtoolflowhasbeendevelopedfordata-flowalgorithms,havingcommunica-tionatthelevelofscalars(e.g.bytesorwords).ThecommunicationtopologyofthePNisstatic,derivedatcompiletime.Torealizetheinter-processorcommunication,weusepoint-to-pointcommunicationmechanism.Employingbussesand/orcomplexNetworks-on-Chips(NoCs)[7,4]forthecommunicationisnotfeasibleduetothede-laysintheroutingprocessandtheusageoflargepacketsinsteadofscalarsinthecom-municationprotocol.
Aswefoundoutin[10],fourtypesofpoint-to-pointinter-processorcommunicationexistinthePNwegenerate.TwoofthemuseaFIFOlikecommunicationandtheothertwouseacachelikememorytoexchangedata.Inthispaper,weinvestigatetherealiza-tionsforthefourcommunicationtypesanddiscusstheeffectivenessoftherealizations.Therestofthepaperisorganizedasfollows:first,wepresenttheCOMPAAN/LAURAflowinordertounderstandhowthehardwaremappingisdonebyLAURA.Next,weaddressthecommunicationchannelgenerationandproposeanapproachtosolveeachcommunicationtype.Wefinishthispaperwithadiscussionoverthemeritsandim-provementsoftheseapproachesinthecontextofourtoolchain.1.1
COMPAAN/LAURAdesignflow
TheprocessnetworksweconsiderinthispaperarederivedusingtheCOMPAANtoolchain.COMPAANtakesasinputparameterizedstaticnestedloopprogramswritteninMatlabandconvertsthiscodetoprocessnetworks.Anintermediatestepistransforma-tionoftheinitialMatlabcodeintosingleassignmentcode(SAC)usingexactdataflowanalysis[3].Thelasttooloftheflow,calledLAURA,isusedtogenerateaVHDLde-scriptionofanarchitecturefromaPNdescription.Duringthisstep,eachprocessofthePNismappedtoanabstractarchitecturalmodelcalledVirtualProcessor.Eachvirtualprocessorconsistsofthreedistinctcomponents:
–AnExecuteUnit,whichisthecomputationalpartofthevirtualprocessor.ThisunitwrapsinanIPcorethatimplementsthefunctionalityoftheprocess.ItsinterfaceconsistsofanumberofInputargumentsandOutputarguments.
–AReadUnit,whichisresponsibleforassigningvalidtokenstotheinputargumentsoftheExecuteUnit.Sincetherearemoreinputportsthanarguments,theReadUnithastoselectatrun-timefromwhichChanneltoreadtokensusingacontrolprogramthatisderivedbyCOMPAAN.
–AndaWriteUnit,whichisresponsiblefordistributingtheresultsoftheExecuteUnittodifferentChannels.AwriteoperationcanexecuteonlywhenalltheoutputargumentsoftheExecuteUnitareavailablefortheWriteUnit.SimilarlytotheReadUnit,theWriteUnithastoselectachannelatrun-timetowritetokensinto,usingacontrolprogramthatisderivedbyCOMPAAN.
TheapplicationstargetedbyCOMPAANareusuallydata-flowintensive,requiringlargecomputationalpower.Therefore,animportantissueinLAURAisthederivationofef-ficientcommunicationstructuresinhardwareandisthefocusofthispaper.Initially,COMPAANfindsintheinputMatlabfile,allthepossibleproducer-consumerpairs.Atthatlevelthecommunicationbetweentwoprocessesisdoneusingamultidimensional
CommunicationSynthesisinamultiprocessorenvironment3
array,representedasapolytope.ToselectthetypeofcommunicationalinearizationprocedureisemployedbytheCOMPAANtoolwhichselectstherighttypeofcommuni-cationchannel.
2CommunicationGeneration
InahardwarenetworkgeneratedbyLAURA,eachprocessorexecutesaninternalcontrolprogramatboththeReadandWriteUnits.ThisprogramdescribesalocalscheduleintermsofExecuteUnitexecutions.Ateachexecution,alsorefertoasaniteration,aReadUnitreadsdatafromaChannelandaWriteUnitwritesdatatoaChannel.IntheoriginalMatlabcode,theChannelrepresentsthecommunicationonan-Dimensionalarray(e.g.,a[i,j]).Thisarrayisreplacedby1-Darraybyourtoolchaininthelinearizationstep.
Virtual Processor 1Channel 1ReadExecuteWriteChannel 2Channel 3ReadExecuteWriteVirtual Processor 2ProducerConsumerFig.1.AProducer-Consumerpair
Figure1depictsaclassicalproducerconsumerpair.AVirtualProcessor1sendsdatatothesecondprocessorcalledVirtualProcessor2viatheChannel2.ThechannelrepresentsthedatadependencybetweenaReadUnitandaWriteUnit.ThisrelationisgivenbyaMappingfunction.Thelinearizationstepreplacestheaddressingofanarraywithrelativeaddressingschemebasedonputandgetprimitives.Usually,thede-rivedcommunicationchannelisaFIFO,however,therearecasesinwhichaFIFOisnotsufficienttolinearizean-dimensionalarray[10].WehavefoundthatfourtypesofcommunicationcanbedistinguishedasgiveninFigure2.Theyresultfromtheorder-ingoftheiterationsattheProducerandtheConsumerprocessesandtheexistenceofmultiplicityforagiventoken,whichmeansthatatokenthatissentbyProducerisreadmorethanonceattheConsumerside.Hence,dependingontheorderandexistenceofmultiplicity,anarbitrarycommunicationchannelbelongstooneoffourdisjointclasses:in-orderwithoutmultiplicity(IOM-),in-orderwithmultiplicity(IOM+),out-of-orderwithoutmultiplicity(OOM-),andout-of-orderwithmultiplicity(OOM+).Foreachclassanadequatecommunicationmechanismneedstobeefficientlysynthesizedinhardwareintermsofcyclesperoperation,areaandspeed.
Fromexperience[10],weknowthatonaveragethefollowingdistributioncanbeex-pectedoverthevariouscommunicationtypes:typeIOM-(80%),IOM+(10%),OOM-(9%),OOM+(1%).TypeIOM-togetherwithtypeIOM+,resultinthat90%ofthecommunicationchannels,requireaFIFObuffertorealizethecommunication.Intheremaining10%ofthecases,amorecomplexReorderingChannelisneeded.2.1InOrdercommunication(IOM-)
IntheInOrdercommunication(IOM-)case,theProducerwritesdataintheChannelinthesameorderastheConsumerreadsfromtheChannel.Therefore,thisChannelis
4ClaudiuZissulescu,BartKienhuis,EdDeprettere
In−order without multiplicity (IOM−) :
Produceri4321In−order with multiplicity (IOM+) :Produceri4data dependecyloop schedulej4321Consumerj4321Consumer3211 2 3 4 5 i1 2 3 4 5 iOut−of−order without multiplicity (OOM−) :
Produceri4321j43211 2 3 4 5 iOut−of−order with multiplicity (OOM+) :
Producerij43211 2 3 4 5 i4321ConsumerConsumerFig.2.ThefourcasesofcommunicationbetweenProducerandConsumer
implementedinhardwareusingaFIFObuffer.Itisaccessedusingthetwoprimitivesput(implementedintheWriteunit)andget(implementedintheReadunit).BecausehighlyoptimizedimplementationsofFIFObuffersexistfortoday’sFPGAs,ittakeseachprimitiveonlyasinglecycletowritedataortoreaddatafromaChannel.AhardwareFIFOhasfinitememory,thusbothprimitivesareblocking,e.g.theyhaltaprocessorwhennodataisavailableinaFIFOorwhenaFIFOisfull.FindingalowerboundonaChannelisahardprobleminPNsanditisoutsidethescopeofthispaper,althoughasmalldiscussionisgiveninSection3.
2.2InOrderwithMultiplicitycommunication(IOM+)
IntheInOrderwithMultiplicityCommunication(IOM+)case,theorderdataispro-ducedisthesameastheorderinwhichdataisconsumed.However,somedataiscon-sumedmorethanonce,breakingthecommunicationmodelofaFIFOwhereagetoperationisdestructive.Inthismodel,thelife-timeofatokenneedstobetakenintoaccount.Onlyattheendofthelife-timeofatoken,thetokencanbereleasedfromtheFIFO.WhiletheputprimitiveremainsthesameasinthecaseofIOM-,weaddedanewcommunicationprimitivewhichwecalledthepeekprimitive.ThepeekprimitivefetchesdatafromaFIFObufferwithoutdestroyingit.TodestroythecurrentFIFOdataareleasecontrolissynthesizedintheConsumerReadUnit.Also,theoutputoftheFIFOisregisteredbytheMultiplicityRegisterwhichiscontrolledbythereleasecontrol.Apeekisonlyreadingthecontentsoftheregister,whileagetoperationisreadinganewvaluefromtheFIFObufferandplaceitintheMultiplicityRegister.Thecontrolthatdeterminesthelife-timeofatokenisexpressedinthesamewaythecontrolprogramsintheReadandWriteUnitareexpressed.2.3OutofOrdercommunication(OOM-)
IntheOut-of-Ordercommunication(OOM-)case,aConsumerreadsdatainadifferentorderithasbeenwrittenbytheProducer.Hence,thecommunicationchannelallowsaConsumertofetchdataintheorderitexpectit.WerefertothiskindofChannel,
CommunicationSynthesisinamultiprocessorenvironment5
whichallowsforrun-timereorderingoftokens,asaReorderChannel.Themainele-mentsofthisReorderChannelisthereordermemoryandthetaggingoftokensthatarewritten/readto/fromthereorderingmemory.EachtokenneedstobetaggedtoallowtheConsumerProcesstorequestparticulartokensintheordergivenbyitslocalsched-ule.Thetagcomputationtakesplaceinbothpartsinvolvedinthetransaction,i.e.,theProducersideandtheConsumerside.
ValidDataProducerProducerData Out &AddressDataConsumerConsumerReorder MemoryReorder Memory(RAM)(RAM)Data InDataRequestAddress?ACK ProducerACK ConsumerFig.3.TheorganizationoftheReorderchannel
InFigure3,theorganizationoftheReorderChannelisgiven.Themainelementistherandomaccessmemory,calledtheReorderMemory.ThefigurealsoshowsaProducerthatwantstowritetokenstotheChannelandaConsumerthatwantstoreadtokens.Atoken(Data)thatiswrittenbytheProducer,istemporarilystoredinareg-istertogetherwithanaddress.ThisaddressiscalculatedbythetaggeneratoroftheProducer.Eachtokenthatisstoredinmemoryhasavalidbit,whichindicatesthataparticularlocation(address)containsvaliddata.Ifthevalidbitisset,theProducerisnotallowedtowritedataandiscompletelystalleduntiltheaddressbecomesavailableagain(ACKProducer).Otherwise,theProducerwritesthetemporarilystoreddataintothememoryandsetsthevalidbit.Attheotherside,theConsumerplacesarequestcommandtotheReorderingChannelforaparticularlocationgivenbyanaddress.Iftherequestedlocationcontainsvaliddata,theConsumerreceivesanacknowledgesig-nal(ACKConsumer),and,atthesametime,thedesireddata.Ifthelocationdoesnotcontainvaliddata,theConsumerstallsuntilvaliddatabecomesavailable.GiventheorganizationoftheReorderingChannelinFigure3,twoissuesdeterminethedesign.Oneisrelatedwiththecomplexityofthetaggenerationandtheotheroneisrelatedwiththeperformanceofthereorderchannelintermsofclockcycles.2.4Taggeneration
Inthegenerationoftags,wetakeadvantagesofthefactthatweoperateonpolytopeswithinCOMPAAN.Thisleadstotwodifferentapproachwecanusetogeneratethetag.OneapproachisbasedontheEhrhartenumerationtheory[2].Usingthistheory,weobtainapseudo-polynomialexpressionthatgivesanuniqueintegervalueforeachpointenclosedbythepolytope.Thisapproachhasbeensuccessfullyexploredinsoftwarein[9].However,thisapproachisnotsuitableforhardwareimplementationduetothecomplexityoftheobtainedpseudo-polynomialexpression.InourexampleinFigure4,
6ClaudiuZissulescu,BartKienhuis,EdDeprettere
thepseudopolynomialforthepolytopeenclosingtheproducerpointsisgivenbythefollowexpression:(−1/4)∗i2+(N−5/2)∗i+j+[−1,5/2]i.Inthisexpression,thepseudopolynomialterm[−1,5/2]iindicatesthatwhentheevaluationofmod(i,2)isequaltozero,thevalue-1isselected;otherwisethevalue5/2isselectedinthepolynomial.
Inthesecondapproach,werelaxtheshapethatenclosestheproducerpolytopetoahyper-rectangularshape.Wecallthishyper-rectangularshapetheBoundingBox.ForaBoundingBox,wecanmakeuseofclassicallinearizationtoconvertan-dimensionrectangletoanone-dimensionalarray[1,6].FindtheBoundingBoxthatbestenclosesthepolytopeisaminimizationproblemthatwesolveusingintegerlinearprogram-ming.TheimprovementovertheEhrhartapproachisthateachBoundingBoxcanbeaddressedusingasimplepolynomialthatcanbeimplementedefficientlyinhardware.ThetagforatokenisobtainedasafunctionoftheiteratorsofaprocessorandhastheN
formtag=k=1ck∗xk+x0+c0whereckrepresentsaconstant,xkisaniterationspaceindex.EachtagbecomesanaddressforaRAMmemory.
ProducerMappingN=87654321543211098761413121118171615212019242322262528272930Consumer(i,j) <−− (y,x)N=87654321211373122148422395241062517112618122719282029301516BoundingBoxPolytopej
i12345678910 = N+2y12345678910 = N+2W:(N−3)*(i−1) + j−400010203040506070809xR:(N−3)*(x−1) + y−420 NULL21 NULL22token 1923token 2024token 2125 NULL26 NULL27token 2228token 2329token 2430313233343536373839token 1token 2token 3token 4token 5token 6token 7token 8token 9token 1010111213141516171819 NULLtoken 11token 12token 13token 14 NULLtoken 15token 16token 17token 18 NULL NULL NULLtoken 25token 26 NULL NULL NULLtoken 27token 2840414243444546474849 NULL NULL NULL NULLtoken 29 NULL NULL NULL NULLtoken 30Reorder memory The Linearization shape is given by the smallestrectagular containing the Producer domain
Fig.4.TheLinearrealizationofthereordermemory
IfweextracttheBoundingBoxattheProducersideinFigure4,weobtainthewriteaddress,whichis(N−3)∗(i−1)+j−4.Fromtheaddressgeneratedbythewriteaddress,theConsumerhastoreadtokensintheorderitdesires.Itthereforerequiresareadaddresstoaccessthetokens.Sincethereisawelldefinedmathematicalrelation-shipbetweentheProducerandConsumerasgivenbytheMapping,wecanobtainthereadaddressbymathematicallycomposingthewriteaddresswiththemappingfunc-tion.Thus,theparameterizedreadaddressisequalto(N−3)∗(x−i)+y−4.Inthiscase,reorderingisinducedbythefactthatthepointsintheConsumerpolytopeareaccessedinadifferentorderthentheProducerhaswrittenthantotheReorderChannel.
ThedownsideoftheBoundingBoxapproachisthatthememoryallocatedfortheReorderMemoryislargerthanneededwhencomparedwiththeEhrhartapproach.InFigure4,thecontentoftheReorderMemoryisgiven.Itshowsatwhichaddressapar-ticulartokeniswritten.Forexample,thetokenfromlocation(6,7)iswrittenataddress28.Thisisthe23rdtokenwrittenbytheProducer.ThistokenisreadbytheConsumer
CommunicationSynthesisinamultiprocessorenvironment7
atlocation(6,7)asthe18thtokenread.Fromthecontentofthereorderingmemory,weobservethatmemorylocationsexistthatareneverwrittenorreadduringtheexecutionofanetwork,asgivenbythe”Nulls”.InthecaseofanEhrhartgeneratedaddress,these”Nulls”wouldnotexistattheexpenseofmorecomplexaddresspolynomialsthataredifficulttobeimplementedefficientlyinhardware.2.5Communicationprotocol
TheReorderingChannelimplementsaparticularcommunicationprotocolthatinvolveswritingandreadingatoken.Awriteoperationusesonlytheputcommunicationprim-itive,likewewouldhavedoneincaseofaFIFOChannel.ThisprimitiveprovidestotheReorderChannelatokenthatconsistsofataganddata.Ifplaceisavailableatthelocationgivenbythetag,thetokeniswritten;otherwisetheProducerisstalledrealizingablockingwrite.Areadoperationconsistsofthreecommunicationprimitives;acheck,apeekandaget.TogethertheyrealizethereadingofdatafromtheReorderingChannelinanydesiredorder.Eachprimitiveperformsaparticulartask:
–checkinquiresifspecificdataispresentintheReorderMemory.ThedesiredtaggiventotheReorderingChannelasaRequest.Ifthatdataispresentatthead-dressgivenbythetag,theReorderingChannelsendsanacknowledgesignaltotheConsumer;otherwisetheConsumerblocksrealizingablockingread.
–getreadsatokenfromtherequestedaddressandsetthevalidbittofalseindicatedthatthelocationifavailableforwritingagain.Agetoperationisthereforedestruc-tive.
–peekreadsatokenfromtherequestedaddressbutkeepsthevalidbithightoin-dicatedthatthelocationisstillneededforreading.Usingthepeekoperation,thelife-timeofatokeniscontrolled.
IntheOOM-communicationtype,weonlyusethecheckandgetprimitivestoreaddatasincenomultiplicityisinvolved.Amemorylocationcanbeimmediatelyreleasedwhenreadingatoken.
2.6OutofOrderwithmultiplicitycommunication(OOM+)
IntheOut-of-OrderwithMultiplicitycommunication(OOM+)case,achannelhasthesamecharacteristicsastheOOM-case.AdditionalreleaselogicishoweveraddedattheConsumersidetokeeptrackofthelife-timeoftokens.Consequently,theOOM+communicationtypealsousesthepeekprimitiveifatokendoesnotyetneedtobereleasedfrommemory.Thegetprimitiveisusedwhenthereleaselogicindicatesthatthelife-timeofatokenhascometoitsend.
3MemoryAllocationSchemes
WhenimplementingaChannel,whetheritisaFIFOoraReorderingChannel,weneedtodeterminealowerboundontheamountoftokensthatcanbestoredinaChannel,withoutcausingadeadlocktooccur.Findingthislowerboundinprocessnetworks
8ClaudiuZissulescu,BartKienhuis,EdDeprettere
isingeneraladifficultproblem.Aprocessnetworkisspecifiedintermsofpartialorderingsbetweenaproducerandconsumer,i.e.,theProducer/Consumerpair.TofindalowerboundonaChannelrequires,however,atotalorderontheexecutionoftheprocessesinthenetwork.Manydifferenttotalordersexist;leadingtodifferenttrade-offsinevaluationspeedandmemoryrequirement.AtotalorderforaPNcanbeobtainedbyschedulingorbydoingarun-timeevaluationofaprocessnetwork.Nevertheless,wewouldliketoavoidbothmethods.Theschedulinginterfereswiththenotionofdistributedcontrolandtherun-timeevaluationisnotacompiletimeanalysisthatcanbeperformedaspartoftheCOMPAANtool.Also,boththerun-timeandscheduleapproachcannothandletheparameterizednatureofthePNswederive.Instead,weusecompiletimeallocationschemes.Withoutgoingtodeepintodetail,wecurrentlydistinguishtwocompiletimeallocationschemes:
–MemoryAllocationwithoutreleasingthememory(SAC)
Thismemoryallocationschemeisderivedfromthesingleassignmentdataallo-cationfoundbyoneoftheintermediatestepofourtoolchain.Inthisscheme,amemorylocationisassigneddataonlyonce.HenceaConsumerdoesnotneedtoreleaseamemorylocationafterithasconsumeddatafrommemory.ThememoryrequiredonaChannelisthedimensionneededtoaccommodateallproduceddata,whichisgivenbytheBoundingBox.
–MemoryAllocationWithreleasingthememory(SEQ)
ThismemoryallocationschemeisderivedfromtheoriginaldataallocationfoundinthesequentialinputMatlabprogram.Inthisscheme,memorylocationsarere-usedandhenceeachreadorwriteoperationontheChannelhastocheckifamemorylocationiseitherfreeorcontainsvaliddata.
TheSACallocationschemeresultsinthefastestexecutionofaPNasthenetworkcanrunwithmaximumparallelism.Anyotherallocationscheme,liketheSEQallocationscheme,mayrestrictthenetworkparallelismtoamoresequentialexecutionscheme,asexplicitchecksneedtobeperformedonthevalidityofdata.Theguaranteethatthetwodescribedallocationschemesdoindeednotintroducedeadlockduetounderdimensionedcommunicationbuffers,isgivenbythefactthattheallocationschemesareabletoruninboundedamountofmemoryinthesequentialexecution.Weknowfromsimulationthattighterlowerboundsarepossible.Furtherresearchisneededtodeveloptechniquestoderivethesetighterboundsatcompiletime.WeexpectthatresultsfromresearchinmemoryoptimizationforsequentialcodecanalsobeappliedtoimprovethememoryallocationinPNs.
4Hardwarerealization
AllfourcommunicationtypescanbeautomaticallygeneratedbyLAURAandhavebeenimplementedonaVirtexII-6000platformfromXilinx.AFIFOchannelisrealizedusingdifferenttypesofmemoryatcompiletimedependingonthecalculatedsize.Iflessthan1024bitsarerequired,weuseRAM16x1Dmemories,otherwiseweuseRAMB16memoryblocks.Ifonlyasinglelocationisrequired,wesimplyinstantiateaFIFO
CommunicationSynthesisinamultiprocessorenvironment9
channelthatusesonlyaregister.IncaseoftheReorderChannels,wealwaysuseBlockRAMstorealizetheReorderingMemoryasafulldualportmemoryisrequired.
ThehardwarerealizationofaFIFObufferisthefastestandmostefficientoneintermsofcyclesperoperation,asitrequiresonecycleperreadorwriteoperation.InthecaseofaReorderingChannel,theSACimplementationrequiresstillonecycleperwriteoperationastheavailabilityofalocationdoesnotneedtobechecked.However,thereadoperationrequiresthreecyclestoreadonescalarfromthechannel.TheSEQimplementationoftheReorderChannelrequiresupto2cyclesforawriteand3cyclesforareadoperation.Inboththereadandwriteoperation,theavailabilityofalocationneedstobechecked.
5Example
Tohighlightthedifferentcharacteristicsofthecommunicationchannels,welookedattheQRalgorithm[11].Thisalgorithmrequires1IOM+and10IOM-channels.How-ever,wecanimplementthesechannelsalsousing1OOM+and10OOM-channels.Thishighlightsthedifferencesinthehardwareimplementationandperformanceofthedifferentchannelsandthebenefitsofselectingthetherightchanneltypeatcompiletime.Inbothcases,thelowerboundsonthechannelsaredeterminedatcompiletimeusingtheSACallocationscheme.
15432
CyclesMemorySizeFrequency
1285120bits102Mhz
Compiletime
estimations
MemorySize111771
FIFOChannel
RAM16x1DSlices
Table1.Experimentalresultsforvarioushardwarechannels.
FromTable1,weobservethattheReorderChannelsaremoreinefficientthanFIFOchannelsintermsofusageofFPGAmemories.TheReorderMemoryisrealizedusingonlySelectBlockRAMs(RAMB16).Forthe11channels,11RAMB16memoryblocksallocated,whichis180224bitsinmemoryontheFPGA.ForeachReorderingchannel,weneedtoallocationatleastaRAMB16blockwhichcanaccommodate512tokensof32bit,evenifweonlyneedspacefor10tokens.Hence,itisdifficulttousethememoryblockefficiently;therecanbequitesomespill.Ontheotherhand,foraFIFOchan-nel,wecanselectbetweenSelectBlockRAMsorRAM16x1Dmemories,dependingontherequiredchannelsize.Inthisparticularcase,weuse320RAM16x1DmemoryblocksandthememoryusedfortheFIFOimplementationisonlyslightlylargerthanthecalculatedsizeusingtheSACallocationscheme,i.e.,5120bitsversusthe4928bitscalculated.SincetheSACallocationschemeisused,theReorderChannelsimplementafastwriteoperation.Nevertheless,ittakestwicethenumberofcyclestoruntheQRal-gorithm(258cycles)comparedtowhenusingFIFOchannels(128cycles).Thatareadoperationrequires3cycles,hasabiginfluenceontheexecutionspeedofthealgorithm.ThenumberofslicesusedtoimplementtheQRalgorithmwithFIFOsis890andwithReorderingChannelsis1771.TheReorderingChannelrequirestwiceasmanyslices
10ClaudiuZissulescu,BartKienhuis,EdDeprettere
duetothehardwaretaggeneratorsandtheuseofRAMB16memories.Boththeexecu-tionspeedandnumberofslicesusedshowthatitisindeedveryimportanttoselectthepropercommunicationtypeinordertoevaluateanalgorithmasfastaspossiblewiththeleastamountofresources.
6Conclusions
Thedata-flowintensiveapplications,wetargetwithCOMPAAN,requirelargecompu-tationalpoweranditisimportanttoderiveefficientcommunicationstructuresinhard-ware.Fourpoint-to-pointcommunicationtypesaredistinguishedinCOMPAANandwehaveshownthatforeachofthem,wecanderiveefficienthardwareinLAURA.TwotypesusehardwareFIFOimplementationsandtwotypesuseaReorderChannelim-plementation,aswepresentedinthispaper.Fromacasestudy,wehaveshownthataFIFOimplementationisthemostoptimalimplementationofaChannelfromanypointofview(throughput,hardwareresources,memoryusage).BecauseaFIFOcanreadandwriteinasinglecycleitcankeepupwiththemaximumdataflowthroughourvirtualprocessors.Fromexperience,weknowthatonaverage90%ofthechannelsinanapplicationcanberealizedwithFIFObuffers.Intheremaining10%,weemploytheReorderingChannelimplementation.Asfuturework,weareinterestedtofurtheropti-mizetheReorderChannelandportingittootherFPGAarchitectures,e.g.Altera.Also,wewanttoinvestigatenewcompiletimeallocationschemestoobtaintighterlowerboundsonthememoryneededinthecommunicationchannels.
References
1.A.Aho,R.Sethi,andJ.Ullman.Compilers:Principles,TechniquesandTools.Addison-WesleyPublishingCo.,1986.2.E.Ehrhart.Polynˆomesarithm´etiquesetM´ethodedesPoly´edresenCombinatoire.Birkh¨auserVerlag,Basel,internationalseriesofnumericalmathematicsvol.35edition,1977.
3.P.Feautrier.Dataflowanalysisofarrayandscalarreferences.InternationalJournalofParallelProgramming,20(1):23–53,1991.
4.P.GuerrierandA.Greiner.Agenericarchitectureforon-chippacket-switchedintercon-nections.InDATE’00:ProceedingsoftheconferenceonDesign,automationandtestinEurope,pages250–256.ACMPress,2000.
5.K.Keutzer,S.Malik,R.Newton,J.Rabaey,andA.L.Sangiovanni-Vincentelli.Systemleveldesign:Orthogonalizationofconcernsandplatform–baseddesign.IEEETrans.onCAD,2000.
6.S.Muchnick.AdvancedCompilerDesignandImplementation.MorganKaufmannPublish-ers,Inc.,1997.
7.E.Rijpkema,K.Goossens,A.adulescu,J.vanMeerbergen,P.Wielage,andE.Waterlander.Tradeoffsinthedesignofarouterwithbothguaranteedandbest-effortservicesfornetworksonchip,2003.
8.T.Stefanov,C.Zissulescu,A.Turjan,B.Kienhuis,andE.Deprettere.Systemdesignusingkahnprocessnetworks:Thecompaan/lauraapproach.InProceedingsofDATE2004,Paris,France,Feb16–202004.
CommunicationSynthesisinamultiprocessorenvironment11
9.A.Turjan,B.Kienhuis,andE.Deprettere.Acompiletimebasedapproachforsolvingout-of-ordercommunicationinKahnProcessNetworks.InProceedingsofIEEE13thInternationalConferenceonApplication-specificSystems,ArchitecturesandProcessors,July17-192002.10.A.Turjan,B.Kienhuis,andE.Deprettere.AnIntegerLinearProgrammingApproachto
ClassifyCommunicationinProcessNetworks.In8thInternationalWorkshoponSoftwareandCompilersforEmbeddedSystems(SCOPES),Amsterdam,Sept.2–32004.
11.R.WalkeandR.Smith.20gflopsqrprocessoronaxilinxvirtex-efpga.InAdvancedSignal
ProcessingAlgorithms,Architectures,andImplementationsX,volume4116,2000.
因篇幅问题不能全部显示,请点此查看更多更全内容