ACM Home Page
Please provide us with feedback. Feedback
On the exploitation of loop-level parallelism in embedded applications
Full text PdfPdf (730 KB)
Source
ACM Transactions on Embedded Computing Systems (TECS) archive
Volume 8 ,  Issue 2  (January 2009) table of contents
Article No. 10  
Year of Publication: 2009
ISSN:1539-9087
Authors
Arun Kejariwal  University of California, Irvine, CA, USA
Alexander V. Veidenbaum  University of California, Irvine, CA, USA
Alexandru Nicolau  University of California, Irvine, CA, USA
Milind Girkar  Intel Corporation
Xinmin Tian  Intel Corporation
Hideki Saito  Intel Corporation
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 38,   Downloads (12 Months): 324,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1457255.1457257
What is a DOI?

ABSTRACT

Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.

1 Other names and brands may be claimed as the property of others.

2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
4
5
 
6
Arm11 Family. http://www.arm.com/products/CPUs/families/ARM11Family.html.
 
7
ATLAS (Automatically Tuned Linear Algebra Software). http://math-atlas.sourceforge.net/.
 
8
 
9
Auto-Vectorization in GCC. http://gcc.gnu.org/projects/tree-ssa/vectorization.html.
10
 
11
 
12
Banerjee, U., Eigenmann, R., Nicolau, A., and Padua, D. 1993. Automatic program parallelization. Proc. IEEE. IEEE, Los Alamitos, CA, 211--243.
 
13
Bernstein, A. J. 1966. Analysis of programs for parallel processing. IEEE Trans. Electron. Comput. 15, 5, 757--763.
 
14
15
 
16
BLAS. BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/.
 
17
Bodin, F., Beckman, P., Gannon, D. B., Narayana, S., and Yang, S. 1991. Distributed pC++: Basic ideas for an object parallel language. In Proceedings of Supercomputing. IEEE, Los Alamitos, CA, 273--282.
 
18
Cell. The Cell Project at IBM Research. http://www.research.ibm.com/cell/.
 
19
20
 
21
CM-2. 1989.Thinking Machines Corporation, Connection Machine Model CM-2 Technical Summary, Version 5.1.
 
22
Derby, J. H. and Moreno, J. H. 2003. A high-performance embedded DSP core with novel SIMD features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing II. IEEE, Los Alamitos, CA, 301--304.
 
23
EEMBC. EEMBC Benchmarks. http://www.eembc.org.
 
24
 
25
 
26
Fisher, J. A., Faraboschi, P., and Young, C. 2004. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers, New York.
 
27
Flynn, M. 1966. Very high-speed computing systems. Proc. IEEE 54, 12, 1901--1909.
 
28
29
 
30
 
31
 
32
 
33
GNU C Library. GNU C Library. http://www.gnu.org/software/libc/.
 
34
 
35
 
36
 
37
Halfhill, T. R. 2006. Cell processor isn't just for games? Microprocessor Report.
 
38
 
39
 
40
Herity, D. 2006. Applying Distributed System Concepts to Embedded Multiprocessor Designs. http://www.embedded.com/showArticle.jhtml?articleID=177104979.
41
 
42
HOOD. HOOD: A user-level threads library for multiprogrammed multiprocessors. http://www.cs.utexas.edu/users/hood/.
 
43
 
44
Intel. 2007. Quad-Core, kentsfield, targeted for Q1. http://www.intel.com/technology/architecture/coremicro/index.htm.
 
45
Intel. IXP1200. Network processor. http://www.intel.com/design/network/products/npfamily/ixp1200.htm.
 
46
Intel. IXP2850. Network processor. http://www.intel.com/design/network/products/npfamily/ixp2850.htm.
 
47
Intel. 8.0. Math kernel library. http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm.
 
48
Intel. Multi core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
 
49
Intel. Teraflops research chip. http://www.intel.com/research/platform/terascale/teraflops.htm.
 
50
Intel. VTune#8482; Performance Analyzer 8.0 for Window. http://www.intel.com/cd/software/products/asmo-na/eng/vtune/219898.htm.
 
51
Intel. Yonah: Multi-core processor architecture development. http://www.intel.com/cd/ids/developer/asmo-na/eng/201969.htm?page=6.
 
52
Jerraya, A. and Wolf, W. 2004. Multiprocessor Systems-on-Chips. Morgan Kaufmann Publishers, New York.
53
 
54
 
55
Kejariwal, A. and Nicolau, A. Reading list of performance analysis, speculative execution. http://www.ics.uci.edu/_akejariw/SpeculativeExecutionReadingList.pdf.
56
57
 
58
 
59
 
60
Kuck, D. 2005. Platform 2015 software: Enabling innovation in parallelism for the next decade. ftp://download.intel.com/technology/computing/archinnov/platform2015/download/Parallelism. pdf.
 
61
Kwong, Y.-S. 1982. On Reductions and Livelocks in Asynchronous Parallel Computation. UMI Research Press, New York, NY.
 
62
 
63
Lee, E. D. 2006. The Problem with Threads. Tech. rep. TR UCB/EECS-2006-1, EECS Department, University of California at Berkeley.
 
64
Liles Jr., A. and Wilner, B. 1979. Branch prediction mechanism. IBM Tech. Disclos. Bull. 22, 7, 3013--3016.
65
 
66
Lundstrom, S. F. and Barnes, G. H. 1980. A controllable MIMD architectures. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 19--27.
67
 
68
69
 
70
 
71
 
72
MiBench. MiBench Version 1.0. http://www.eecs.umich.edu/mibench/.
 
73
 
74
75
 
76
Nicolau, A. 1985. Percolation scheduling. In Proceedings of the International Conference on Parallel Processing. IEEE, Los Alamitos, CA.
77
 
78
Omap2420. http://focus.ti.com/general/docs/wtbu/wtbuproductcontent.tsp?templateId=6123&navigationId=11990&contentId=4671.
 
79
OpenMP. OpenMP specification, Version 2.5. http://www.openmp.org/drupal/mp-documents/spec25.pdf.
 
80
Polychronopoulos, C. 1987. Loop coalescing: A compiler transformation for parallel machines. In Proceedings of the International Conference on Parallel Processing. ACM, New York, 235--242.
 
81
Prakash, S. and Parker, A. C. 1992. SOS: Synthesis of application-specific heterogeneous multiprocessor systems. J. Parall. Distrib. Comput. 16, 338--351.
 
82
 
83
84
85
 
86
 
87
 
88
SPEC. SPEC: Standard Performance Evaluation Corporation. http://www.spec.org/.
 
89
90
 
91
Thakur, R., Bordawekar, R., Choudhary, A., Ponnusamy, R., and Singh, T. 1994. PASSION runtime library for parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference. IEEE, Los Alamitos, CA, 119--128.
 
92
Tian, X., Bik, A., Girkar, M., Grey, P., Saito, H., and Su, E. 2002. Intel OpenMP C++/Fortran compiler for hyper-threading technology: Implementation and performance. Intel Techn. J. 3, 1.
 
93
Tomasulo, R. M. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Resear. Develop. 11, 25--33.
94
 
95
Turley, J. 2003. Embedded processors of tomorrow. http://www.embedded.com/columns/technicalinsights/15201862?requestid=804418.
 
96
UPC. http://upc.gwu.edu/.
97
 
98
 
99
Winslett, M., Ad Y. Chen, K. E. S., Cho, Y., Kuo, S., and Subramanium, M. 1996. The PANDA library for parallel I/O of large multidimensional arrays. In Proceedings of the Scalable Parallel Libraries Conference III. IEEE, Los Alamitos, CA.
100
101
 
102
Xeon. Dual-Core IntelR XeonR Processor 7000 Sequence Platform Brief. ftp://download.intel.com/products/processor/xeon/dc7kplatbrief.pdf.
 
103
Yang, W.-S. and Ding, C. 2003. ZioLib: A Parallel I/O Library. Number LBNL-53521.
104

Collaborative Colleagues:
Arun Kejariwal: colleagues
Alexander V. Veidenbaum: colleagues
Alexandru Nicolau: colleagues
Milind Girkar: colleagues
Xinmin Tian: colleagues
Hideki Saito: colleagues