Basics:
We can't tune what is not being taxed, we can't tune what can't be tracked.
(we tune the intensities not the sleepy times)
If it runs fast enough we are done (you can't tune it forever)
Hardware:
- date; uname -a; id; oslevel -s; lparstat -i
- check hardware (prtconf | more -- lsdev | grep vail -- lscfg | grep +)
- think to think or move the data workload
- check near-static structures (lvm, paging space, settings)
- check historical data (events, errorrs)
Memory - VMM
- shared memory segments: ipcs -bm (most of the time not all of the memory is allocated)
the given values shows what is the maximum size that mem. segments can grow (it is a major component of the computational memory)
- uptime; vmstat -s (it increments since boot)
- uptime; vmstat -v (I/O goes through fsbufs -> then pbufs (each of them can be exhausted))
- vmo -L; ioo -L
I/O - LVM:
- df -k (how much content is goverened by 1 inode)
- tech-stack map: RAIDset ->LUN->LVM(VG:lv:fs/with options) -> logical content
- iostat -a, iostat -D
- lvmo -a -v <vgname> (pbufs can be checked and increased if needed)
- iostat -AQ 2 (asynchronous I/O stats)
Processes:
- uptime; ps -ekf| egrep "syncd|lrud|nfsd|biod|wait
match time of lrud with syncd: if lrud is greater, then it should grab your attention (if lower it is fine)
if lrud is high it is scanning and freeing and scanning and freeing...
lrud has high priority, so if it is running not much work can be done (reduce lrud to let other processing running)
- ps -kelmo THREAD (shows the threaded world)
- ps guww (shows in descending %CPU (RSS:in real memory SZ:in virtual memory, STIME: start time, TIME: accumulated system time))
- ps gvww (shows in ascending PID (PGIN:how many pages are moved))
- ps -ef | grep -v "Oct 20" (the day of boot has been grepped out and check what processes have been started from that time)
- ps -ef | grep -LOCAL=NO (for Oracle client sessions)
Network:
- netstat -ss (check non-zero values)
- netstat -v (queue overflow)
- nfsstat
6 in 1 tool:
- vmstat -Iwt 2
-------------------
if cpubound -> tprof is used to spot those processes which are using
if memory bound -> svmon is used to help to find what is using the most memory
if i/o bound -> filemon will help to find what is causing all of the disk activity
-------------------
CPU wait is too high, how can I reduce it?
CPU in waiting for I/O mode is not a problem. The CPU is actually in Idle mode but it has noted there is disk I/O outstanding and then it is reported as Wait instead of Idle. Lots of workloads that throw data away faster than it can be read will be seen as high Wait. In Wait for I/O mode it is fully available to run more application code.
In benchmarks, Wait for I/O is seen positively as an opportunity - we can do throw in more work to boost throughput.
Any workload in which the CPU does little work compared to the volume of disk I/O is going to give you high Wait for I/O.
If this high Wait for I/O is a sudden change from the normal pattern then it needs investigating and you should make sure as many disks as possible are involved in the disk I/O.
In fact, faster CPUs would mean even high wait values.
-------------------
Which process consumes most memory?
topas -P, you can tab to page space column to sort on that. It is called "page space" column, because it shows the memory usage which is backed by that amount of paging space (which is the size of the process in memory
-------------------
Which process has used the DISK I/O most frequently?
Start nmon --> t for top processes -->Hit 5 to list them in I/O order, then look at the Char I/O column
-------------------
Free memory is near zero, how do I free more memory?
This is just how AIX works and is perfectly normal. All memory will be soaked up with copies of filesystem blocks after a reasonable length of time and the free memory will be near zero. If your file systems cache is a large percentage of memory then you are avoiding disk I/O. This is a good thing. You should NOT try to reduce it - this could damage performance.
AIX will then use the lrud process to keep the free list at a reasonable level. If you see the lrud process taking more than 30% of a CPU then you need to investigate and make memory parameter changes.
-------------------
20% paging space usage, how affect performance?
20% of paging space can be allocated but no actual I/O taking place. You need to look at the paging stats to determine, if paging I/O is actually happening. Allocating paging space would not have a performance impact.
-------------------
15 comments:
What are fields we need to monitor in NMON Analyser report through which we point out the bootelnecks?
I usually use these fields:
SYS_SUMM -for general overview
LPAR - for cpu usage
DISK_SUMM - for disk usage
FILE - for read/write
MEM, MEMNEW -for RAM usage
NET - for network usage
PAGE - for paging space activity
PROC - to see how many processes are in RunQueue
TOP - to see which process used how many CPU%
Hi,
Can you please provide detailed information about default NMON Analyser output report? How can we modify it? What different parameters indicate? etc..
Thanks
Hi, OK I try my best. The official page also can help you, if you download the .zip file, after unzipping it, there will be a documentation. (I would use that documentation too...)
Official page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Systems/page/nmon_analyser
Hi , I am Adeel. we are using Oracle Rac 10G R2 on 2 nodes with OS 6.1 issue is with my 1 node its paging increases but its not reducing its paging itself at non-peak timings while other node releases its paging. as a result in 5,6 days its paging increases in such a way tht machine get reeboted. we have checked with IBM and Oracle but still not getting any proper response. other node behaviour is very normal. now every weak our admin stop Oracle instance on this node and then restart it in order to reduce the paging and avoid restart situation. this ractise is not gud as this is our CORE BANKING DB node. Kindly suggest. our Memory size on both machines are 16GB and paging we assigned is 18,19 GB.
Hi, I can give you a workaround for this, but I can't tell you what is causing it. If you dynamically decrease the size of the paging space, AIX will create a temporary paging space with only the necessary data and then rename it as your original paging space. The command for example: chps -d 1 paging00. It will run for a long time, and in your rootvg must be enough free space, but after it finished usually paging space usage goes to a low percentage (unnecessary data will be flushed away). You can monitor its progress with nmon.
For the real cause I think it should be some Oracle issue...but IBM or Oracle should investigate it more.
Thanks for your reply. Can I run this command with out taking any down time ?? as normally in the day time paging increases. Is it safe to run while users are online.
Yes, this will do everything online, no reboot is needed. However I would choose a time when there is no heavy load on the system, and you can test this command on a test/development system prior to execute on prod. system to have some confidence in it.
Hi admin...
In an interview i faced a question that i am unable to answer. can you please tell me.
The question is "what is jumping process, if a process is jumping what should i do, shall i kill it"
Hi, I never heard about "jumping process". Do you know what is that?
it must be zombie process...
You can find some description about zombie processes here: http://aix4admins.blogspot.hu/2011/08/commands-and-processes-process-you-use.html
When looking at "topas" we see Kern% measurement higher than we think it should be. Often times it is on par with the size of the User% but we do see "spikes" in activity where Kern% is actually quite a bit higher then User%. What is the best way to identify which processes are be measured under this category?
ps -eo pid,vsz,user,comm|awk 'NR==1;{print|"sort +1 -nr|head -20"}' will help to find which process is taking more memory
Post a Comment